Apache Solr Ref Guide 5.1

User Manual:

Open the PDF directly: View PDF .
Page Count: 578 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Apache Solr Reference Guide

Apache Solr Reference Guide

Covering Apache Solr 5.1

Licensed to the Apache Software Foundation (ASF) under one

or more contributor license agreements. See the NOTICE file

distributed with this work for additional information

regarding copyright ownership. The ASF licenses this file

to you under the Apache License, Version 2.0 (the

"License"); you may not use this file except in compliance

with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,

software distributed under the License is distributed on an

"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY

KIND, either express or implied. See the License for the

specific language governing permissions and limitations

under the License.

Apache and the Apache feather logo are trademarks of The Apache Software Foundation. Apache Lucene, Apache

Solr and their respective logos are trademarks of the Apache Software Foundation. Please see the Apache

for more information.Trademark Policy

Fonts used in the Solr Reference Guide include , licensed under the .Raleway SIL Open Font License, 1.1

2Apache Solr Reference Guide 5.1

Apache Solr Reference Guide

This reference guide describes Apache Solr, the open source solution for search. You can download Apache

Solr from the Solr website at .http://lucene.apache.org/solr/

This Guide contains the following sections:

Getting Started: This section guides you through the installation and setup of Solr.

Using the Solr Administration User Interface: This section introduces the Solr Web-based user interface.

From your browser you can view configuration files, submit queries, view logfile settings and Java environment

settings, and monitor and control distributed configurations.

Documents, Fields, and Schema Design: This section describes how Solr organizes its data for indexing. It

explains how a Solr schema defines the fields and field types which Solr uses to organize data within the

document files it indexes.

Understanding Analyzers, Tokenizers, and Filters: This section explains how Solr prepares text for indexing

and searching. Analyzers parse text and produce a stream of tokens, lexical units used for indexing and

searching. Tokenizers break field data down into tokens. Filters perform other transformational or selective work

on token streams.

Indexing and Basic Data Operations: This section describes the indexing process and basic index operations,

such as commit, optimize, and rollback.

Searching: This section presents an overview of the search process in Solr. It describes the main components

used in searches, including request handlers, query parsers, and response writers. It lists the query parameters

that can be passed to Solr, and it describes features such as boosting and faceting, which can be used to

fine-tune search results.

The Well-Configured Solr Instance: This section discusses performance tuning for Solr. It begins with an

overview of the file, then tells you how to configure cores with , how to configuresolrconfig.xml solr.xml

the Lucene index writer, and more.

Managing Solr: This section discusses important topics for running and monitoring Solr. Other topics include

how to back up a Solr instance, and how to run Solr with Java Management Extensions (JMX).

SolrCloud: This section describes the newest and most exciting of Solr's new features, SolrCloud, which

provides comprehensive distributed capabilities.

Legacy Scaling and Distribution: This section tells you how to grow a Solr distribution by dividing a large index

into sections called shards, which are then distributed across multiple servers, or by replicating a single index

across multiple services.

Client APIs: This section tells you how to access Solr through various client APIs, including JavaScript, JSON,

and Ruby.

3Apache Solr Reference Guide 5.1

About This Guide

This guide describes all of the important features and functions of Apache Solr. It is free to download from http://l

.ucene.apache.org/solr/

Designed to provide high-level documentation, this guide is intended to be more encyclopedic and less of a

cookbook. It is structured to address a broad spectrum of needs, ranging from new developers getting started to

well-experienced developers extending their application or troubleshooting. It will be of use at any point in the

application life cycle, for whenever you need authoritative information about Solr.

The material as presented assumes that you are familiar with some basic search concepts and that you can read

XML. It does not assume that you are a Java programmer, although knowledge of Java is helpful when working

directly with Lucene or when developing custom extensions to a Lucene/Solr installation.

Special Inline Notes

Special notes are included throughout these pages.

Note Type Look & Description

Information

Notes

Tip

Warning

Hosts and Port Examples

The default port when running Solr is 8983. The samples, URLs and screenshots in this guide may show

different ports, because the port number that Solr uses is configurable. If you have not customized your

installation of Solr, please make sure that you use port 8983 when following the examples, or configure your own

installation to use the port numbers shown in the examples. For information about configuring port numbers, see

.Managing Solr

Similarly, URL examples use 'localhost' throughout; if you are accessing Solr from a location remote to the server

hosting Solr, replace 'localhost' with the proper domain or IP where Solr is running.

Paths

Path information is given relative to , which is the location under the main Solr installation wheresolr.home

Solr's collections and their and directories are stored. When running the various examplesconf data

mentioned through out this tutorial (i.e., ) the will be a sub directorybin/solr -e techproducts solr.home

of created for you automatically.example/

Notes with a blue background are used for information that is important for you to

know.

Yellow notes are further clarifications of important points to keep in mind while using

Solr.

Notes with a green background are Helpful Tips.

Notes with a red background are warning messages.

4Apache Solr Reference Guide 5.1

Getting Started

Solr makes it easy for programmers to develop sophisticated, high-performance search applications with

advanced features such as faceting (arranging search results in columns with numerical counts of key terms).

Solr builds on another open source search technology: Lucene, a Java library that provides indexing and search

technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities. Both Solr

and Lucene are managed by the Apache Software Foundation ( .www.apache.org)

The Lucene search library currently ranks among the top 15 open source projects and is one of the top 5 Apache

projects, with installations at over 4,000 companies. Lucene/Solr downloads have grown nearly ten times over

the past three years, with a current run-rate of over 6,000 downloads a day. The Solr search server, which

provides application builders a ready-to-use search platform on top of the Lucene search library, is the fastest

growing Lucene sub-project. Apache Lucene/Solr offers an attractive alternative to the proprietary licensed

search and discovery software vendors.

This section helps you get Solr up and running quickly, and introduces you to the basic Solr architecture and

features. It covers the following topics:

Installing Solr: A walkthrough of the Solr installation process.

Running Solr: An introduction to running Solr. Includes information on starting up the servers, adding documents,

and running queries.

A Quick Overview: A high-level overview of how Solr works.

A Step Closer: An introduction to Solr's home directory and configuration options.

Solr Start Script Reference: a complete reference of all of the commands and options available with the bin/solr

script.

Installing Solr

This section describes how to install Solr. You can install Solr in any system where a suitable Java Runtime

Environment (JRE) is available, as detailed below. Currently this includes Linux, OS X, and Microsoft Windows.

The instructions in this section should work for any platform, with a few exceptions for Windows as noted.

Got Java?

You will need the Java Runtime Environment (JRE) version 1.7 or higher. At a command line, check your Java

version like this:

$ java -version

java version "1.7.0_55"

Java(TM) SE Runtime Environment (build 1.7.0_55-b13)

Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)

The output will vary, but you need to make sure you have version 1.7 or higher. If you don't have the required

version, or if the java command is not found, download and install the latest version from Oracle at http://www.or

.acle.com/technetwork/java/javase/downloads/index.html

Installing Solr

5Apache Solr Reference Guide 5.1

Solr is available from the Solr website at .http://lucene.apache.org/solr/

For Linux/Unix/OSX systems, download the file. For Microsoft Windows systems, download the file..tgz .zip

When getting started, all you need to do is extract the Solr distribution archive to a directory of your choosing.

When you're ready to setup Solr for a production environment, please refer to the instructions provided on the Ta

page. To keep things simple for now, extract the Solr distribution archive to your localking Solr to Production

home directory, for instance on Linux, do:

$ cd ~/

$ tar zxf solr-5.0.0.tgz

Once extracted, you are now ready to run Solr using the instructions provided in the section.Running Solr

Running Solr

This section describes how to run Solr with an example schema, how to add documents, and how to run queries.

Start the Server

If you didn't start Solr after installing it, you can start it by running from the Solr directory.bin/solr

$ bin/solr start

If you are running Windows, you can start Solr by running instead.bin\solr.cmd

bin\solr.cmd start

This will start Solr in the background, listening on port 8983.

When you start Solr in the background, the script will wait to make sure Solr starts correctly before returning to

the command line prompt.

The and scripts allow you to customize how you start Solr. Let's work through a fewbin/solr bin\solr.cmd

examples of using the script (if you're running Solr on Windows, the works the samebin/solr bin\solr.cmd

as what is shown in the examples below):

Solr Script Options

The script has several options.bin/solr

Script Help

To see how to use the script, execute:bin/solr

$ bin/solr -help

For specific usage instructions for the command, do:start

6Apache Solr Reference Guide 5.1

$ bin/solr start -help

Start Solr in the Foreground

Since Solr is a server, it is more common to run it in the background, especially on Unix/Linux. However, to start

Solr in the foreground, simply do:

$ bin/solr start -f

If you are running windows, you can run:

bin\solr.cmd start -f

Start Solr with a Different Port

To change the port Solr listens on, you can use the parameter when starting, such as:-p

$ bin/solr start -p 8984

Stop Solr

When running Solr in the foreground (using -f), then you can stop it using . However, when running in theCtrl-c

background, you should use the command, such as:stop

$ bin/solr stop -p 8983

The stop command requires you to specify the port Solr is listening on or you can use the parameter to-all

stop all running Solr instances.

Start Solr with a Specific Example Configuration

Solr also provides a number of useful examples to help you learn about key features. You can launch the

examples using the flag. For instance, to launch the "techproducts" example, you would do:-e

$ bin/solr -e techproducts

Currently, the available examples you can run are: techproducts, dih, schemaless, and cloud. See the section Ru

for details on each example.nning with Example Configurations

Check if Solr is Running

If you're not sure if Solr is running locally, you can use the status command:

Getting Started with SolrCloud

Running the example starts Solr in mode. For more information on starting Solr incloud SolrCloud

cloud mode, see the section . Getting Started with SolrCloud

7Apache Solr Reference Guide 5.1

$ bin/solr status

This will search for running Solr instances on your computer and then gather basic information about them, such

as the version and memory usage.

That's it! Solr is running. If you need convincing, use a Web browser to see the Admin Console.

http://localhost:8983/solr/

The Solr Admin interface.

If Solr is not running, your browser will complain that it cannot connect to the server. Check your port number

and try again.

Create a Core

If you did not start Solr with an example configuration, you would need to create a core in order to be able to

index and search. You can do so by running:

$ bin/solr create -c <name>

This will create a core that uses a data-driven schema which tries to guess the correct field type when you add

documents to the index.

To see all available options for creating a new core, execute:

$ bin/solr create -help

Add Documents

Solr is built to find documents that match queries. Solr's schema provides an idea of how content is structured

(more on the schema ), but without documents there is nothing to find. Solr needs input before it can dolater

much.

8Apache Solr Reference Guide 5.1

You may want to add a few sample documents before trying to index your own content. The Solr installation

comes with different types of example documents located under the sub-directories of the directory ofexample/

your installation.

In the directory is the post script, a command line tool which can be used to index different types ofbin/

documents. Do not worry too much about the details for now. The sectionIndexing and Basic Data Operations

has all the details on indexing.

To see some information about the usage of , use the option.bin/post -help

bin/post can post various types of content to Solr, including files in Solr's native XML and JSON formats, CSV

files, a directory tree of rich documents, or even a simple short web crawl. See the examples at the end of

`bin/post -help` for various commands to easily get started posting your content into Solr.

Go ahead and add all the documents in some example XML files:

$ bin/post -c gettingstarted example/exampledocs/*.xml

SimplePostTool version 5.0.0

Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...

Entering auto mode. File endings considered are

xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt

,log

POSTing file gb18030-example.xml (application/xml) to [base]

POSTing file hd.xml (application/xml) to [base]

POSTing file ipod_other.xml (application/xml) to [base]

POSTing file ipod_video.xml (application/xml) to [base]

POSTing file manufacturers.xml (application/xml) to [base]

POSTing file mem.xml (application/xml) to [base]

POSTing file money.xml (application/xml) to [base]

POSTing file monitor.xml (application/xml) to [base]

POSTing file monitor2.xml (application/xml) to [base]

POSTing file mp500.xml (application/xml) to [base]

POSTing file sd500.xml (application/xml) to [base]

POSTing file solr.xml (application/xml) to [base]

POSTing file utf8-example.xml (application/xml) to [base]

POSTing file vidcard.xml (application/xml) to [base]

14 files indexed.

COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...

Time spent: 0:00:00.153

That's it! Solr has indexed the documents contained in those files.

Ask Questions

Now that you have indexed documents, you can perform queries. The simplest way is by building a URL that

includes the query parameters. This is exactly the same as building any other HTTP URL.

For example, the following query searches all document fields for "video":

http://localhost:8983/solr/gettingstarted/select?q=video

Notice how the URL includes the host name ( ), the port number where the server is listening ( ),localhost 8983

the application name ( ), the request handler for queries ( ), and finally, the query itself ( ).solr select q=video

The results are contained in an XML document, which you can examine directly by clicking on the link above.

The document contains two parts. The first part is the , which contains information about theresponseHeader

9Apache Solr Reference Guide 5.1

response itself. The main part of the reply is in the result tag, which contains one or more doc tags, each of

which contains fields from documents that match the query. You can use standard XML transformation

techniques to mold Solr's results into a form that is suitable for displaying to users. Alternatively, Solr can output

the results in JSON, PHP, Ruby and even user-defined formats.

Just in case you are not running Solr as you read, the following screen shot shows the result of a query (the next

example, actually) as viewed in Mozilla Firefox. The top-level response contains a named lst responseHeade

and a result named response. Inside result, you can see the three docs that represent the search results.r

An XML response to a query.

Once you have mastered the basic idea of a query, it is easy to add enhancements to explore the query syntax.

This one is the same as before but the results only contain the ID, name, and price for each returned document.

If you don't specify which fields you want, all of them are returned.

http://localhost:8983/solr/gettingstarted/select?q=video&fl=id,name,price

Here is another example which searches for "black" in the field only. If you do not tell Solr which field toname

10Apache Solr Reference Guide 5.1

search, it will search default fields, as specified in the schema.

http://localhost:8983/solr/gettingstarted/select?q=name:black

You can provide ranges for fields. The following query finds every document whose price is between $0 and

$400.

http://localhost:8983/solr/gettingstarted/select?q=price:[0%20TO%20400]&fl=id,name

,price

Faceted browsing is one of Solr's key features. It allows users to narrow search results in ways that are

meaningful to your application. For example, a shopping site could provide facets to narrow search results by

manufacturer or price.

Faceting information is returned as a third part of Solr's query response. To get a taste of this power, take a look

at the following query. It adds and .facet=true facet.field=cat

http://localhost:8983/solr/gettingstarted/select?q=price:[0%20TO%20400]&fl=id,name

,price&facet=true&facet.field=cat

In addition to the familiar and response from Solr, a element is also present.responseHeader facet_counts

Here is a view with the and response collapsed so you can see the faceting informationresponseHeader

clearly.

11Apache Solr Reference Guide 5.1

...

</lst>

<doc>

<str name="name">Solr, the Enterprise Search Server</str>

...

</result>

</lst>

</lst>

</response>

The facet information shows how many of the query results have each possible value of the field. You couldcat

easily use this information to provide users with a quick way to narrow their query results. You can filter results

by adding one or more filter queries to the Solr request. Here is a request further constraining the request to

documents with a category of "software".

http://localhost:8983/solr/gettingstarted/select?q=price:0%20TO%20400&fl=id,name,p

rice&facet=true&facet.field=cat&fq=cat:software

A Quick Overview

Having had some fun with Solr, you will now learn about all the cool things it can do.

Here is a example of how Solr might be integrated into an application:

An XML Response with faceting

12Apache Solr Reference Guide 5.1

In the scenario above, Solr runs along side other server applications. For example, an online store application

would provide a user interface, a shopping cart, and a way to make purchases for end users; while an inventory

management application would allow store employees to edit product information. The product metadata would

be kept in some kind of database, as well as in Solr.

Solr makes it easy to add the capability to search through the online store through the following steps:

Define a . The schema tells Solr about the contents of documents it will be indexing. In the onlineschema

store example, the schema would define fields for the product name, description, price, manufacturer, and

so on. Solr's schema is powerful and flexible and allows you to tailor Solr's behavior to your application.

See for all the details.Documents, Fields, and Schema Design

Deploy Solr to your application server.

Feed Solr the document for which your users will search.

Expose search functionality in your application.

Because Solr is based on open standards, it is highly extensible. Solr queries are RESTful, which means, in

essence, that a query is a simple HTTP request URL and the response is a structured document: mainly XML,

but it could also be JSON, CSV, or some other format. This means that a wide variety of clients will be able to

use Solr, from other web applications to browser clients, rich client applications, and mobile devices. Any

platform capable of HTTP can talk to Solr. See for details on client APIs.Client APIs

Solr is based on the Apache Lucene project, a high-performance, full-featured search engine. Solr offers support

for the simplest keyword searching through to complex queries on multiple fields and faceted search results. Sea

has more information about searching and queries.rching

If Solr's capabilities are not impressive enough, its ability to handle very high-volume applications should do the

trick.

A relatively common scenario is that you have so much data, or so many queries, that a single Solr server is

unable to handle your entire workload. In this case, you can scale up the capabilities of your application using So

13Apache Solr Reference Guide 5.1

to better distribute the data, and the processing of requests, across many servers. Multiple options canlrCloud

be mixed and matched depending on the type of scalability you need.

For example: "Sharding" is a scaling technique in which a collection is split into multiple logical pieces called

"shards" in in order to scale up the number of documents in a collection beyond what could physically fit on a

single server. Incoming queries are distributed to every shard in the collection, which respond with merged

results. Another technique available is to increase the "Replication Factor" of your collection, which allows you to

add servers with additional copies of your collection to handle higher concurrent query load by spreading the

requests around to multiple machines. Sharding and Replication are not mutually exclusive, and together make

Solr an extremely powerful and scalable platform.

Best of all, this talk about high-volume applications is not just hypothetical: some of the famous Internet sites that

use Solr today are Macy's, EBay, and Zappo's.

For more information, take a look at .https://wiki.apache.org/solr/PublicServers

A Step Closer

You already have some idea of Solr's schema. This section describes Solr's home directory and other

configuration options.

When Solr runs in an application server, it needs access to a home directory. The home directory contains

important configuration information and is the place where Solr will store its index. The layout of the home

directory will look a little different when you are running Solr in standalone mode vs when you are running in

SolrCloud mode.

The crucial parts of the Solr home directory are shown in these examples:

<solr-home-directory>/

solr.xml

core_name1/

core.properties

conf/

solrconfig.xml

schema.xml

data/

core_name2/

core.properties

conf/

solrconfig.xml

schema.xml

data/

Standalone Mode

14Apache Solr Reference Guide 5.1

<solr-home-directory>/

solr.xml

core_name1/

core.properties

data/

core_name2/

core.properties

data/

You may see other files, but the main ones you need to know are:

solr.xml specifies configuration options for your Solr server instance. For more information on solr.xm

see .lSolr Cores and solr.xml

Per Solr Core:

core.properties defines specific properties for each core such as its name, the collection the

core belongs to, the location of the schema, and other parameters. For more details on core.pro

, see the section .perties Defining core.properties

solrconfig.xml controls high-level behavior. You can, for example, specify an alternate location

for the data directory. For more information on , see .solrconfig.xml Configuring solrconfig.xml

schema.xml describes the documents you will ask Solr to index. Inside , you defineschema.xml

a document as a collection of fields. You get to define both the field types and the fields

themselves. Field type definitions are powerful and include information about how Solr processes

incoming field values and query values. For more information on , see schema.xml Documents,

.Fields, and Schema Design

data/ The directory containing the low level index files.

Note that the SolrCloud example does not include a directory for each Solr Core (so there is no conf solrconf

or ). This is because the configuration files usually found in the directory are storedig.xml schema.xml conf

in ZooKeeper so they can be propagated across the cluster.

If you are using SolrCloud with the embedded ZooKeeper instance, you may also see and zoo.cfg zoo.data

which are ZooKeeper configuration and data files. However, if you are running your own ZooKeeper ensemble,

you would supply your own ZooKeeper configuration file when you start it and the copies in Solr would be

unused. For more information about ZooKeeper and SolrCloud, see the section .SolrCloud

Solr Start Script Reference

Solr includes a script known as " " that allows you to start and stop Solr, create and delete collectionsbin/solr

or cores, and check the status of Solr and configured shards. You can find the script in the directory of yourbin/

Solr installation. The script makes Solr easier to work with by providing simple commands andbin/solr

options to quickly accomplish common goals.

In this section, the headings below correspond to available commands. For each command, the available options

are described with examples.

More examples of bin/solr in use are available throughout the Solr Reference Guide, but particularly in the

sections and .Running Solr Getting Started with SolrCloud

SolrCloud Mode

15Apache Solr Reference Guide 5.1

Starting and Stopping

Start and Restart

Status

Stop

Healthcheck

Collections and Cores

Create

Delete

Starting and Stopping

Start and Restart

The start command starts Solr. The restart command allows you to restart Solr while it is already running or if it

has been stopped already.

The start and restart commands have several options to allow you to run in SolrCloud mode, use an example

configuration set, start with a hostname or port that is not the default and point to a local ZooKeeper ensemble.

bin/solr start [options]

bin/solr start -help

bin/solr restart [options]

bin/solr restart -help

When using the restart command, you must pass all of the parameters you initially passed when you started

Solr. Behind the scenes, a stop request is initiated, so Solr will be stopped before being started again. If no

nodes are already running, restart will skip the step to stop and proceed to starting Solr.

Available Parameters

The bin/solr script provides many options to allow you to customize the server in common ways, such as

changing the listening port. However, most of the defaults are adequate for most Solr installations, especially

when just getting started.

Parameter Description Example

-a "<string>" Start Solr with additional JVM

parameters, such as those starting with

-X. If you are passing JVM parameters

that begin with "-D", you can omit the -a

option.

bin/solr start -a

"-Xdebug -Xrunjdwp:transport=dt_socket,

server=y,suspend=n,address=1044"

16Apache Solr Reference Guide 5.1

-cloud Start Solr in SolrCloud mode, which will

also launch the embedded ZooKeeper

instance included with Solr.

This option can be shortened to simply

.-c

If you are already running a ZooKeeper

ensemble that you want to use instead

of the embedded (single-node)

ZooKeeper, you should also pass the -z

parameter.

For more details, see the section SolrCl

below.oud Mode

bin/solr start -c

-d <dir> Define a server directory, defaults to se

(as in, ). Itrver $SOLR_HOME/server

is uncommon to override this option.

When running multiple instances of Solr

on the same host, it is more common to

use the same server directory for each

instance and use a unique Solr home

directory using the -s option.

bin/solr start -d newServerDir

-e <name> Start Solr with an example

configuration. These examples are

provided to help you get started faster

with Solr generally, or just try a specific

feature.

The available options are:

cloud

techproducts

dih

schemaless

See the section Running with Example

below for more detailsConfigurations

on the example configurations.

bin/solr start -e schemaless

-f Start Solr in the foreground; you cannot

use this option when running examples

with the -e option.

bin/solr start -f

-h

Start Solr with the defined hostname. If

this is not specified, 'localhost' will be

assumed.

bin/solr start -h search.mysolr.com

17Apache Solr Reference Guide 5.1

-m

Start Solr with the defined value as the

min (-Xms) and max (-Xmx) heap size

for the JVM.

bin/solr start -m 1g

-noprompt Start Solr and suppress any prompts

that may be seen with another option.

This would have the side effect of

accepting all defaults implicitly.

For example, when using the "cloud"

example, an interactive session guides

you through several options for your

SolrCloud cluster. If you want to accept

all of the defaults, you can simply add

the -noprompt option to your request.

bin/solr start -e cloud -noprompt

-p <port> Start Solr on the defined port. If this is

not specified, '8983' will be used.

bin/solr start -p 8655

-s <dir> Sets the solr.solr.home system

property; Solr will create core

directories under this directory. This

allows you to run multiple Solr

instances on the same host while

reusing the same server directory set

using the -d parameter. If set, the

specified directory should contain a

solr.xml file. The default value is serve

.r/solr

This parameter is ignored when running

examples (-e), as the solr.solr.home

depends on which example is run.

bin/solr start -s newHome

-V Start Solr with verbose messages from

the start script.

bin/solr start -V

-z <zkHost> Start Solr with the defined ZooKeeper

connection string. This option is only

used with the -c option, to start Solr in

SolrCloud mode. If this option is not

provided, Solr will start the embedded

ZooKeeper instance and use that

instance for SolrCloud operations.

bin/solr start -c -z

server1:2181,server2:2181

To emphasize how the default settings work take a moment to understand that the following commands are

equivalent:

bin/solr start

bin/solr start -h localhost -p 8983 -d server -s solr -m 512m

18Apache Solr Reference Guide 5.1

It is not necessary to define all of the options when starting if the defaults are fine for your needs.

Setting Java System Properties

The bin/solr script will pass any additional parameters that begin with -D to the JVM, which allows you to set

arbitrary Java system properties. For example, to set the auto soft-commit frequency to 3 seconds, you can do:

bin/solr start -Dsolr.autoSoftCommit.maxTime=3000

SolrCloud Mode

The -c and -cloud options are equivalent:

bin/solr start -c

bin/solr start -cloud

If you specify a ZooKeeper connection string, such as , then Solr will connect to-z 192.168.1.4:2181

ZooKeeper and join the cluster. If you do not specify the -z option when starting Solr in cloud mode, then Solr will

launch an embedded ZooKeeper server listening on the Solr port + 1000, i.e., if Solr is running on port 8983,

then the embedded ZooKeeper will be listening on port 9983.

IMPORTANT: If your ZooKeeper connection string uses a chroot, such as , then youlocalhost:2181/solr

need to bootstrap the /solr znode before launching SolrCloud using the bin/solr script. To do this, you need to

use the script shipped with Solr, such as:zkcli.sh

server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:2181/solr -cmd bootstrap

-solrhome server/solr

When starting in SolrCloud mode, the interactive script session will prompt you to choose a configset to use.

For more information about starting Solr in SolrCloud mode, see also the section .Getting Started with SolrCloud

Running with Example Configurations

bin/solr start -e <name>

The example configurations allow you to get started quickly with a configuration that mirrors what you hope to

accomplish with Solr. The following examples are provided:

cloud: This example starts a 1-4 node SolrCloud cluster on a single machine. When chosen, an

interactive session will start to guide you through options to select the number of nodes for your example

cluster, the ports to use, and name of the collection to be created.

techproducts: This example starts Solr in standalone mode with a schema designed for the sample

documents included in the directory. The configset used does$SOLR_HOME/example/exampledocs

not have SolrCloud or schemaless modes enabled, so fields must be explicitly defined in inschema.xml

order for documents including those fields to be added to the index. The configset used can be found in $

.SOLR_HOME/server/solr/configsets/sample_techproducts_configs

dih: This example starts Solr in standalone mode with the DataImportHandler (DIH) enabled and several

example files pre-configured for different types of data supported with DIH (such as,dataconfig.xml

database contents, email, RSS feeds, etc.). For more information about DIH, see the section Uploading

.Structured Data Store Data with the Data Import Handler

schemaless: This example starts Solr in standalone mode using a managed schema, as described in the

section , and provides a very minimal pre-defined schema. SolrManaged Schema Definition in SolrConfig

19Apache Solr Reference Guide 5.1

will run in with this configuration, where Solr will create fields in the schema on the flySchemaless Mode

and will guess field types used in incoming documents. The configset used can be found in $SOLR_HOME

./server/solr/configsets/data_driven_schema_configs

Note: The run-in-foreground option (-f) does not work with the -e option since the script needs to perform

additional tasks after starting the Solr server.

Status

The status command displays basic JSON-formatted information for any Solr nodes found running on the local

system. The status command uses the SOLR_PID_DIR environment variable to locate Solr process ID files to

find running Solr instances; the SOLR_PID_DIR variable defaults to the bin directory.

bin/solr status

The output will include a status of each node of the cluster, as in this example:

Found 2 Solr nodes:

Solr process 39920 running on port 7574

{

"solr_home":"/Applications/Solr/solr-5.0.0/example/cloud/node2/solr/",

"version":"5.0.0 1658469 - anshumgupta - 2015-02-09 09:54:36",

"startTime":"2015-02-10T17:19:54.739Z",

"uptime":"1 days, 23 hours, 55 minutes, 48 seconds",

"memory":"77.2 MB (%15.7) of 490.7 MB",

"cloud":{

"ZooKeeper":"localhost:9865",

"liveNodes":"2",

"collections":"2"}}

Solr process 39827 running on port 8865

{

"solr_home":"/Applications/Solr/solr-5.0.0/example/cloud/node1/solr/",

"version":"5.0.0 1658469 - anshumgupta - 2015-02-09 09:54:36",

"startTime":"2015-02-10T17:19:49.057Z",

"uptime":"1 days, 23 hours, 55 minutes, 54 seconds",

"memory":"94.2 MB (%19.2) of 490.7 MB",

"cloud":{

"ZooKeeper":"localhost:9865",

"liveNodes":"2",

"collections":"2"}}

Stop

The stop command sends a STOP request to a running Solr node, which allows it to shutdown gracefully. The

command will wait up to 5 seconds for Solr to stop gracefully and then will forcefully kill the process (kill -9).

bin/solr stop [options]

bin/solr stop -help

Available Parameters

Parameter Description Example

20Apache Solr Reference Guide 5.1

-p <port> Stop Solr running on the given port. If you are running more than one instance,

or are running in SolrCloud mode, you either need to specify the ports in

separate requests or use the -all option.

bin/solr

stop -p

8983

-all Stop all running Solr instances that have a valid PID. bin/solr

stop -all

-k <key> Stop key used to protect from stopping Solr inadvertently; default is "solrrocks". bin/solr

stop -k

solrrocks

Healthcheck

The healthcheck command generates a JSON-formatted health report for a collection when running in SolrCloud

mode. The health report provides information about the state of every replica for all shards in a collection,

including the number of committed documents and its current state.

bin/solr healthcheck [options]

bin/solr healthcheck -help

Available Parameters

Parameter Description Example

-c

Name of the collection to run a healthcheck against (required). bin/solr

healthcheck

-c

gettingstarted

-z <zkhost> ZooKeeper connection string, defaults to localhost:9983. If you are

running Solr on a port other than 8983, you will have to specify the

ZooKeeper connection string. By default, this will be the Solr port + 1000.

bin/solr

healthcheck

-z

localhost:2181

Below is an example healthcheck request and response using a non-standard ZooKeeper connect string, with 2

nodes running:

21Apache Solr Reference Guide 5.1

$ bin/solr healthcheck -c gettingstarted -z localhost:9865

{

"collection":"gettingstarted",

"status":"healthy",

"numDocs":0,

"numShards":2,

"shards":[

{

"shard":"shard1",

"status":"healthy",

"replicas":[

{

"name":"core_node1",

"url":"http://10.0.1.10:8865/solr/gettingstarted_shard1_replica2/",

"numDocs":0,

"status":"active",

"uptime":"2 days, 1 hours, 18 minutes, 48 seconds",

"memory":"25.6 MB (%5.2) of 490.7 MB",

"leader":true},

{

"name":"core_node4",

"url":"http://10.0.1.10:7574/solr/gettingstarted_shard1_replica1/",

"numDocs":0,

"status":"active",

"uptime":"2 days, 1 hours, 18 minutes, 42 seconds",

"memory":"95.3 MB (%19.4) of 490.7 MB"}]},

{

"shard":"shard2",

"status":"healthy",

"replicas":[

{

"name":"core_node2",

"url":"http://10.0.1.10:8865/solr/gettingstarted_shard2_replica2/",

"numDocs":0,

"status":"active",

"uptime":"2 days, 1 hours, 18 minutes, 48 seconds",

"memory":"25.8 MB (%5.3) of 490.7 MB"},

{

"name":"core_node3",

"url":"http://10.0.1.10:7574/solr/gettingstarted_shard2_replica1/",

"numDocs":0,

"status":"active",

"uptime":"2 days, 1 hours, 18 minutes, 42 seconds",

"memory":"95.4 MB (%19.4) of 490.7 MB",

"leader":true}]}]}

Collections and Cores

The bin/solr script can also help you create new collections (in SolrCloud mode) or cores (in standalone mode),

or delete collections.

Create

The create command detects the mode that Solr is running in (standalone or SolrCloud) and then creates a core

or collection depending on the mode.

22Apache Solr Reference Guide 5.1

bin/solr create options

bin/solr create -help

Available Parameters

Parameter Description Example

-c <name> Name of the core or collection to create (required). bin/solr

create -c

mycollection

-d <confdir> The configuration directory. This defaults to data_driven_schema_

.configs

See the section below forConfiguration Directories and SolrCloud

more details about this option when running in SolrCloud mode.

bin/solr

create -d

basic_configs

-n <configName> The configuration name. This defaults to the same name as the core

or collection.

bin/solr

create -n

basic

-p <port> Port of a local Solr instance to send the create command to; by

default the script tries to detect the port by looking for running Solr

instances.

This option is useful if you are running multiple standalone Solr

instances on the same host, thus requiring you to be specific about

which instance to create the core in.

bin/solr

create -p

8983

-s <shards>

-shards

Number of shards to split a collection into, default is 1; only applies

when Solr is running in SolrCloud mode.

bin/solr

create -s 2

-rf <replicas>

-replicationFactor

Number of copies of each document in the collection. The default is 1

(no replication).

bin/solr -rf

Configuration Directories and SolrCloud

Before creating a collection in SolrCloud, the configuration directory used by the collection must be uploaded to

ZooKeeper. The create command supports several use cases for how collections and configuration directories

work. The main decision you need to make is whether a configuration directory in ZooKeeper should be shared

across multiple collections. Let's work through a few examples to illustrate how configuration directories work in

SolrCloud.

First, if you don't provide the or options, then the default configuration (-d -n $SOLR_HOME/server/solr/con

) is uploaded to ZooKeeper using the same name as thefigsets/data_driven_schema_configs/conf

collection. For example, the following command will result in the configurationdata_driven_schema_configs

being uploaded to in ZooKeeper: . If you create/configs/contacts bin/solr create -c contacts

another collection, by doing , then another copy of the bin/solr create -c contacts2 data_driven_sch

23Apache Solr Reference Guide 5.1

directory will be uploaded to ZooKeeper under . Any changes you makeema_configs /configs/contacts2

to the configuration for the contacts collection will not affect the contacts2 collection. Put simply, the default

behavior creates a unique copy of the configuration directory for each collection you create.

You can override the name given to the configuration directory in ZooKeeper by using the option. For-n

instance, the command will upload the bin/solr create -c logs -d basic_configs -n basic serve

directory to ZooKeeper as .r/solr/configsets/basic_configs/conf /configs/basic

Notice that we used the option to specify a different configuration than the default. Solr provides several-d

built-in configurations under . However you can also provide the path to your ownserver/solr/configsets

configuration directory using the option. For instance, the command -d bin/solr create -c mycoll -d

, will upload into ZooKeeper under . To reiterate, the/tmp/myconfigs /tmp/myconfigs /configs/mycoll

configuration directory is named after the collection unless you override it using the option.-n

Other collections can share the same configuration by specifying the name of the shared configuration using the

option. For instance, the following command will create a new collection that shares the basic configuration-n

created previously: .bin/solr create -c logs2 -n basic

Data-driven schema and shared configurations

The data_driven_schema_configs schema can mutate as data is indexed. Consequently, we recommend that

you do not share data-driven configurations between collections unless you are certain that all collections should

inherit the changes made when indexing data into one of the collections.

Delete

The delete command detects the mode that Solr is running in (standalone or SolrCloud) and then deletes the

specified core (standalone) or collection (SolrCloud) as appropriate.

bin/solr delete [options]

bin/solr delete -help

If running in SolrCloud mode, the delete command checks if the configuration directory used by the collection

you are deleting is being used by other collections. If not, then the configuration directory is also deleted from

ZooKeeper. For example, if you created a collection by doing , then thebin/solr create -c contacts

delete command will check to see if the configuratiobin/solr delete -c contacts /configs/contacts

n directory is being used by any other collections. If not, then the directory is removed/configs/contacts

from ZooKeeper.

Available Parameters

Parameter Description Example

-c <name> Name of the core / collection to delete (required). bin/solr

delete -c

mycoll

-deleteConfig

<true|false>

Delete the configuration directory from ZooKeeper. The default is true.

If the configuration directory is being used by another collection, then it

will not be deleted even if you pass -deleteConfig true.

bin/solr

delete

-deleteConfig

false

24Apache Solr Reference Guide 5.1

-p <port> The port of a local Solr instance to send the delete command to. By

default the script tries to detect the port by looking for running Solr

instances.

This option is useful if you are running multiple standalone Solr

instances on the same host, thus requiring you to be specific about

which instance to delete the core from.

bin/solr

delete -p 8983

25Apache Solr Reference Guide 5.1

Upgrading Solr

If you are already using Solr 5.0, Solr 5.1 should not present any major problems. However, you should review

the file found in your Solr package for changes and updates that may effect your existing CHANGES.txt

implementation.

Upgrading from 5.0.x

SolrClient query functions now declare themselves as throwing in addition to IOException SolrSer

, to bring them in line with the update functions.verException

SolrRequest.process() is now final. Subclasses should instead be parameterized by their

corresponding type, and implement .SolrResponse createResponse()

The signature of has changed to take SolrDispatchFilter.createCoreContainer() (String,P

arguments.roperties)

Deprecated the option added to as part of in 5.0 release. 'lib' create-requesthandler SOLR-6801

Please use the . commandadd-runtimelib

Tika's runtime dependency of 'jhighlight' was removed as the latter was found to contain some LGPL-only

code. Until that's resolved by Tika, you can download the jar yourself and place it under contrib/extra

.ction/lib

The catch-all field in has been renamed to ._text data_driven_schema_configs _text_

Upgrading from Older Versions of Solr

Users upgrading from older versions are strongly encouraged to consult for the details of cha CHANGES.txt all

nges since the version they are upgrading from

For users upgrading from Solr 4.x, a summary of the significant changes can be found in the Major Changes

section.from Solr 4 to Solr 5

26Apache Solr Reference Guide 5.1
Using the Solr Administration User Interface
This section discusses the Solr Administration User Interface ("Admin UI").
The   explains the basic features of the user interface, what's on the initial Admin UIOverview of the Solr Admin UI
page, and how to configure the interface. In addition, there are pages describing each screen of the Admin UI:
Getting Assistance shows you how to get more information about the UI.
Logging explains the various logging levels available and how to invoke them.
Cloud Screens display information about nodes when running in SolrCloud mode.
Core Admin explains how to get management information about each core.
Java Properties shows the Java information about each core.
Thread Dump lets you see detailed information about each thread, along with state information.
Core-Specific Tools is a section explaining additional screens available for each named core.
Analysis - lets you analyze the data found in specific fields.
Dataimport - shows you information about the current status of the Data Import Handler.
Documents - provides a simple form allowing you to execute various Solr indexing commands
directly from the browser.
Files - shows the current core configuration files such as   and  .solrconfig.xml schema.xml
Ping - lets you ping a named core and determine whether the core is active.
Plugins/Stats - shows statistics for plugins and other installed components.
Query - lets you submit a structured query about various elements of a core.
Replication - shows you the current replication status for the core, and lets you enable/disable
replication.
Schema Browser - displays schema data in a browser window.
Segments Info - Provides a visualization of the underlying Lucene index segments.
Overview of the Solr Admin UI
Solr features a Web interface that makes it easy for Solr administrators and programmers to view Solr
details, run   document fields in order to fine-tune a Solr configuration and configuration queries and analyze
access   and other help.online documentation
 

27Apache Solr Reference Guide 5.1

Accessing the URL will show the main dashboard, which is divided into twohttp://hostname:8983/solr/

parts.

A left-side of the screen is a menu under the Solr logo that provides the navigation through the screens of the UI.

The first set of links are for system-level information and configuration and provide access to Logging, Core

Admin and Java Properties, among other things. At the end of this information is a list of Solr cores configured

for this instance. Clicking on a core name shows a secondary menu of information and configuration options for

the core specifically. Items in this list include the Schema, Config, Plugins, and an ability to perform Queries on

indexed data.

The center of the screen shows the detail of the option selected. This may include a sub-navigation for the option

or text or graphical representation of the requested data. See the sections in this guide for each screen for more

details.

Under the covers, the Solr Admin UI re-uses the same HTTP APIs available to all clients to access Solr-related

data to drive an external interface.

Configuring the Admin UI in solrconfig.xml

You can configure the Solr Admin UI by editing the file .solrconfig.xml

The block in the file determines the default query to be displayed in the Query<admin> solrconfig.xml

section of the core-specific pages. The default is , which is to find all documents. In this example, we have*:*

changed the default to the term .solr

<admin>

</admin>

Related Topics

Configuring solrconfig.xml

Getting Assistance

At the bottom of each screen of the Admin UI is a set of links that can be used to get more assistance with

configuring and using Solr.

Assistance icons

These icons include the following links.

The path to the Solr Admin UI given above is , which redirects to http://hostname:port/solr http

in the current version. A convenience redirect is also supported, so://hostname:port/solr/#/

simply accessing the Admin UI at will also redirect to http://hostname:port/ http://hostname:

.port/solr/#/

28Apache Solr Reference Guide 5.1

Link Description

Documentation Navigates to the Apache Solr documentation hosted on .http://lucene.apache.org/solr/

Issue Tracker Navigates to the JIRA issue tracking server for the Apache Solr project. This server resides

at .http://issues.apache.org/jira/browse/SOLR

IRC Channel Navigates to an Apache Wiki page describing how to join Solr's live-chat room: IRC https://wi

.ki.apache.org/solr/IRCChannels

Community

forum

Navigates to the Apache Solr web page http://lucene.apache.org/solr/resources.html#comm

which has further information about ways to engage.unity

Solr Query

Syntax

Navigates to the section " " in this reference guide.Query Syntax and Parsing

These links cannot be modified without editing the in the that contains the Admin UIadmin.html solr.war

files.

Logging

The Logging page shows messages from Solr's log files.

When you click the link for "Logging", a page similar to the one below will be displayed:

The Main Logging Screen, including an example of an error due to a bad document sent by a client

While this example shows logged messages for only one core, if you have multiple cores in a single instance,

they will each be listed, with the level for each.

Selecting a Logging Level

29Apache Solr Reference Guide 5.1

When you select the link on the left, you see the hierarchy of classpaths and classnames for yourLevel

instance. A row highlighted in yellow indicates that the class has logging capabilities. Click on a highlighted row,

and a menu will appear to allow you to change the log level for that class. Characters in boldface indicate that

the class will not be affected by level changes to root.

For an explanation of the various logging levels, see .Configuring Logging

Cloud Screens

When running in SolrCloud mode, an option will appear in the Admin UI between Logging and Core Admin for

Cloud. It's not possible at the current time to manage the nodes of the SolrCloud cluster from the Admin UI, but

you can view them and open the Solr Admin UI on each node to view the status and statistics for the node and

each core on each node.

Click on the Cloud option in the left-hand navigation, and a small sub-menu appears with options called "Tree",

"Graph", "Graph (Radial)" and "Dump". The default view ("Graph") shows a graph of each collection, the shards

that make up those collections, and the addresses of each replica for each shard. This example shows the very

simple two-node, two-shard, two-replica cluster you can get running the " " example:bin/solr -e cloud

Only Visible When using SolrCloud

The "Cloud" menu option is only available on Solr instances running in . Single node orSolrCloud mode

master/slave replication instances of Solr will not display this option.

30Apache Solr Reference Guide 5.1

The "Graph (Radial)" option provides a different visual view of each node. Using the same example cluster, the

radial graph view looks like:

The "Tree" option shows a directory structure of the files in ZooKeeper, including ,clusterstate.json

configuration files, and other status and information files. In this example, we show the leader definition for

"shard1" in the "gettingstarted" collection:

31Apache Solr Reference Guide 5.1

The final option is "Dump", which allows you to download an XML file with all the ZooKeeper configuration files.

Core Admin

The Core Admin screen lets you manage your cores.

The buttons at the top of the screen let you add a new core, unload the core displayed, rename the currently

displayed core, swap the existing core with one that you specify in a drop-down box, reload the current core, and

optimize the current core.

The main display and available actions correspond to the commands used with the , butCoreAdminHandler

provide another way of working with your cores.

Java Properties

The Java Properties screen provides easy access to one of the most essential components of a top-performing

Solr systems With the Java Properties screen, you can see all the properties of the JVM running Solr, including

the class paths, file encodings, JVM memory settings, operating system, and more.

32Apache Solr Reference Guide 5.1

Thread Dump

The Thread Dump screen lets you inspect the currently active threads on your server. Each thread is listed and

access to the stacktraces is available where applicable. Icons to the left indicate the state of the thread: for

example, threads with a green check-mark in a green circle are in a "RUNNABLE" state. On the right of the

thread name, a down-arrow means you can expand to see the stacktrace for that thread.

When you move your cursor over a thread name, a box floats over the name with the state for that thread.

Thread states can be:

State Meaning

NEW A thread that has not yet started.

RUNNABLE A thread executing in the Java virtual machine.

BLOCKED A thread that is blocked waiting for a monitor lock.

33Apache Solr Reference Guide 5.1

WAITING A thread that is waiting indefinitely for another thread to perform a particular action.

TIMED_WAITING A thread that is waiting for another thread to perform an action for up to a specified

waiting time.

TERMINATED A thread that has exited.

When you click on one of the threads that can be expanded, you'll see the stacktrace, as in the example below:

Inspecting a thread

You can also check the button to automatically enable expansion for all threads.Show all Stacktraces

Core-Specific Tools

In the left-hand navigation bar, you will see a pull-down menu titled "Core Selector". Clicking on the menu will

show a list of Solr cores, with a search box that can be used to find a specific core (handy if you have a lot of

cores). When you select a core, a secondary menu opens under the core name with the administration options

available for that particular core.

After selecting the core, the central part of the screen shows Statistics and other information about the core you

34Apache Solr Reference Guide 5.1

chose. You can define a file called that includes links or other information you would like toadmin-extra.html

display in the "Admin Extra" part of this main screen.

On the left side, under the core name, are links to other screens that display information or provide options for

the specific core chosen. The core-specific options are listed below, with a link to the section of this Guide to find

out more:

Analysis - lets you analyze the data found in specific fields.

Dataimport - shows you information about the current status of the Data Import Handler.

Documents - provides a simple form allowing you to execute various Solr indexing commands directly

from the browser.

Files - shows the current core configuration files such as and .solrconfig.xml schema.xml

Ping - lets you ping a named core and determine whether the core is active.

Plugins/Stats - shows statistics for plugins and other installed components.

Query - lets you submit a structured query about various elements of a core.

Replication - shows you the current replication status for the core, and lets you enable/disable replication.

Schema Browser - displays schema data in a browser window.

Segments Info - Provides a visualization of the underlying Lucene index segments.

Analysis Screen

The Analysis screen lets you inspect how data will be handled according to the field, field type and dynamic rule

configurations found in . You can analyze how content would be handled during indexing or duringschema.xml

query processing and view the results separately or at the same time. Ideally, you would want content to be

handled consistently, and this screen allows you to validate the settings in the field type or field analysis chains.

Enter content in one or both boxes at the top of the screen, and then choose the field or field type definitions to

use for analysis.

If you click the check box, you see more information, including transformations to the inputVerbose Output

(such as, convert to lower case, strip extra characters, etc.) and the bytes, type and detailed position information.

The information displayed will vary depending on the settings of the field or field type. Each step of the process is

displayed in a separate section, with an abbreviation for the tokenizer or filter that is applied in that step. Hover or

click on the abbreviation, and you'll see the name and path of the tokenizer or filter.

35Apache Solr Reference Guide 5.1

In example screenshot above, several transformations are applied to the input "Running is a sport." The words

"is" and "a" have been removed and the word "running" has been changed to its basic form, "run". This is

because we are using the field type in this scenario, which is configured to remove stop words (smalltext_en

words that usually do not provide a great deal of context) and "stem" terms when possible to find more possible

matches (this is particularly helpful with plural forms of words). If you click the question mark next to the Analyze

pull-down menu, the will open, showing you the settings for theFieldname/Field Type Schema Browser window

field specified.

The section describes in detail what each option is and how itUnderstanding Analyzers, Tokenizers, and Filters

may transform your data and the section has specific examples for using the AnalysisRunning Your Analyzer

screen.

Dataimport Screen

The Dataimport screen shows the configuration of the DataImportHandler (DIH) and allows you start, and

monitor the status of, import commands as defined by the options selected on the screen and defined in the

configuration file.

36Apache Solr Reference Guide 5.1

This screen also lets you adjust various options to control how the data is imported to Solr, and view the data

import configuration file that controls the import. For more information about data importing with DIH, see the

section on .Uploading Structured Data Store Data with the Data Import Handler

Documents Screen

The Documents screen provides a simple form allowing you to execute various Solr indexing commands in a

variety of formats directly from the browser.

The screen allows you to:

Copy documents in JSON, CSV or XML and submit them to the index

Upload documents (in JSON, CSV or XML)

Construct documents by selecting fields and field values

The first step is to define the RequestHandler to use (aka, 'qt'). By default will be defined. To use Solr/update

Cell, for example, change the request handler to ./update/extract

37Apache Solr Reference Guide 5.1

Then choose the Document Type to define the type of document to load. The remaining parameters will change

depending on the document type selected.

JSON

When using the JSON document type, the functionality is similar to using a requestHandler on the command line.

Instead of putting the documents in a curl command, they can instead be input into the Document entry box. The

document structure should still be in proper JSON format.

Then you can choose when documents should be added to the index (Commit Within), whether existing

documents should be overwritten with incoming documents with the same id (if this is not , then the incomingtrue

documents will be dropped), and, finally, if a document boost should be applied.

This option will only add or overwrite documents to the index; for other update tasks, see the optiSolr Command

on.

CSV

When using the CSV document type, the functionality is similar to using a requestHandler on the command line.

Instead of putting the documents in a curl command, they can instead be input into the Document entry box. The

document structure should still be in proper CSV format, with columns delimited and one row per document.

Then you can choose when documents should be added to the index (Commit Within), and whether existing

documents should be overwritten with incoming documents with the same id (if this is not , then the incomingtrue

documents will be dropped).

Document Builder

The Document Builder provides a wizard-like interface to enter fields of a document

File Upload

The File Upload option allows choosing a prepared file and uploading it. If using only for the/update

Request-Handler option, you will be limited to XML, CSV, and JSON.

However, to use the ExtractingRequestHandler (aka Solr Cell), you can modify the Request-Handler to /update

. You must have this defined in your file, with your desired defaults. You should/extract solrconfig.xml

also update the shown in the Extracting Req. Handler Params so the file chosen is given a&literal.id

unique id.

Then you can choose when documents should be added to the index (Commit Within), and whether existing

documents should be overwritten with incoming documents with the same id (if this is not , then the incomingtrue

documents will be dropped).

Solr Command

The Solr Command option allows you use XML or JSON to perform specific actions on documents, such as

defining documents to be added or deleted, updating only certain fields of documents, or commit and optimize

commands on the index.

The documents should be structured as they would be if using on the command line./update

XML

38Apache Solr Reference Guide 5.1

When using the XML document type, the functionality is similar to using a requestHandler on the command line.

Instead of putting the documents in a curl command, they can instead be input into the Document entry box. The

document structure should still be in proper Solr XML format, with each document separated by tags and<doc>

each field defined.

Then you can choose when documents should be added to the index (Commit Within), and whether existing

documents should be overwritten with incoming documents with the same id (if this is not , then the incomingtrue

documents will be dropped).

This option will only add or overwrite documents to the index; for other update tasks, see the optiSolr Command

on.

Related Topics

Uploading Data with Index Handlers

Uploading Data with Solr Cell using Apache Tika

Files Screen

The Files screen lets you browse & view the various configuration files (such and solrconfig.xml schema.x

) for the core you selected.ml

While the defines the behaviour of Solr as it indexes content and responds to queries, the solrconfig.xml sc

allows you to define the types of data in your content (field types), the fields your documents will behema.xml

broken into, and any dynamic fields that should be generated based on patterns of field names in the incoming

documents. Any other configuration files are used depending on how they are referenced in either solrconfig

or ..xml schema.xml

Configuration files cannot be edited with this screen, so a text editor of some kind must be used.

This screen is related to the , in that they both can display information from the schema,Schema Browser Screen

but the Schema Browser provides a way to drill into the analysis chain and displays linkages between field types,

39Apache Solr Reference Guide 5.1

fields, and dynamic field rules.

Many of the options defined in and are described throughout the rest of thissolrconfig.xml schema.xml

Guide. In particular, you will want to review these sections:

Indexing and Basic Data Operations

Searching

The Well-Configured Solr Instance

Documents, Fields, and Schema Design

Ping

Choosing Ping under a core name issues a request to check whether a server is up.ping

Ping is configured using a in the file:requestHandler solrconfig.xml

<str name="q">solrpingquery</str>

</lst>

</lst>

<!-- An optional feature of the PingRequestHandler is to configure the

handler with a "healthcheckFile" which can be used to enable/disable

the PingRequestHandler.

relative paths are resolved against the data dir

-->

</requestHandler>

The Ping option doesn't open a page, but the status of the request can be seen on the core overview page

shown when clicking on a collection name. The length of time the request has taken is displayed next to the Ping

option, in milliseconds.

Plugins & Stats Screen

The Plugins screen shows information and statistics about Solr's status and performance. You can find

information about the performance of Solr's caches, the state of Solr's searchers, and the configuration of

searchHandlers and requestHandlers.

Choose an area of interest on the right, and then drill down into more specifics by clicking on one of the names

that appear in the central part of the window. In this example, we've chosen to look at the Searcher stats, from

the Core area:

40Apache Solr Reference Guide 5.1

Searcher Statistics

The display is a snapshot taken when the page is loaded. You can get updated status by choosing to either Wat

or . Watching the changes will highlight those areas that have changed, whilech Changes Refresh Values

refreshing the values will reload the page with updated information.

Query Screen

You can use , shown under the name of each core, to submit a search query to a Solr server and analyzeQuery

the results. In the example in the screenshot, a query has been submitted, and the screen shows the query

results sent to the browser as JSON.

The query was sent to a core named "techproducts". We used Solr's default query for this screen (as defined in s

41Apache Solr Reference Guide 5.1

), which is . This query will find all records in the index for this core. We kept the otherolrconfig.xml *:*

defaults, but the table below explains these options, which are also covered in detail in later parts of this Guide.

The response is shown to the right of the form. Requests to Solr are simply HTTP requests, and the query

submitted is shown in light type above the results; if you click on this it will open a new browser window with just

this request and response (without the rest of the Solr Admin UI). The rest of the response is shown in JSON,

which is part of the request (see the part at the end).wt=json

The response has at least two sections, but may have several more depending on the options chosen. The two

sections it always has are the and the . The includes the statusresponseHeader response responseHeader

of the search ( ), the processing time ( ), and the parameters ( ) that were used to processstatus QTime params

the query.

The includes the documents that matched the query, in sub-sections. The fields return dependresponse doc

on the parameters of the query (and the defaults of the request handler used). The number of results is also

included in this section.

This screen allows you to experiment with different query options, and inspect how your documents were

indexed. The query parameters available on the form are some basic options that most users want to have

available, but there are dozens more available which could be simply added to the basic request by hand (if

opened in a browser). The table below explains the parameters available:

Field Description

Request-handler

(qt)

Specifies the query handler for the request. If a query handler is not specified, Solr

processes the response with the standard query handler.

q The query event. See for an explanation of this parameter.Searching

fq The filter queries. See for more information on this parameter.Common Query Parameters

sort Sorts the response to a query in either ascending or descending order based on the

response's score or another specified characteristic.

start, rows start is the offset into the query result starting at which documents should be returned.

The default value is 0, meaning that the query should return results starting with the first

document that matches. This field accepts the same syntax as the start query parameter,

which is described in . is the number of rows to return.Searching rows

fl Defines the fields to return for each document. You can explicitly list the stored fields, functi

, and you want to have returned by separating them with either aons doc transformers

comma or a space.

wt Specifies the Response Writer to be used to format the query response. Defaults to XML if

not specified.

indent Click this button to request that the Response Writer use indentation to make the

responses more readable.

debugQuery Click this button to augment the query response with debugging information, including

"explain info" for each document returned. This debugging information is intended to be

intelligible to the administrator or programmer.

42Apache Solr Reference Guide 5.1

dismax Click this button to enable the Dismax query parser. See forThe DisMax Query Parser

further information.

edismax Click this button to enable the Extended query parser. See The Extended DisMax Query

for further information.Parser

hl Click this button to enable highlighting in the query response. See for moreHighlighting

information.

facet Enables faceting, the arrangement of search results into categories based on indexed

terms. See for more information.Faceting

spatial Click to enable using location data for use in spatial or geospatial searches. See Spatial

for more information.Search

spellcheck Click this button to enable the Spellchecker, which provides inline query suggestions based

on other, similar, terms. See for more information.Spell Checking

Related Topics

Searching

Replication Screen

The Replication screen shows you the current replication state for the named core you have specified. SolrCloud

has supplanted much of this functionality, but if you are still using Master-Slave index replication, you can use

this screen to:

View the replicatable index state. (on a master node)

View the current replication status (on a slave node)

Disable replication. (on a master node)

More details on how to configure replication is available in the section called .Index Replication

Schema Browser Screen

The Schema Browser screen lets you see schema data in a browser window. If you have accessed this window

from the Analysis screen, it will be opened to a specific field, dynamic field rule or field type. If there is nothing

chosen, use the pull-down menu to choose the field or field type.

Caution When Using SolrCloud

When using , do not attempt to disable replication via this screen.SolrCloud

43Apache Solr Reference Guide 5.1

The screen provides a great deal of useful information about each particular field. In the example above, we

have chosen the field. On the right side of the center window, we see the field name, and a list of fields thattext

populate this field because they are defined to be copied to the field. Click on one of those field names,text

and you can see the definitions for that field. We can also see the field type, which would allow us to inspect the

type definitions as well.

In the left part of the center window, we see the field type again, and the defined properties for the field. We can

also see how many documents have populated this field. Then we see the analyzer used for indexing and query

processing. Click the icon to the left of either of those, and you'll see the definitions for the tokenizers and/or

filters that are used. The output of these processes is the information you see when testing how content is

handled for a particular field with the .Analysis Screen

Under the analyzer information is a button to . Clicking that button will show the top terms thatLoad Term Info N

are in the index for that field. Click on a term, and you will be taken to the to see the results of aQuery Screen

query of that term in that field. If you want to always see the term information for a field, choose and itAutoload

will always appear when there are terms for a field. A histogram shows the number of terms with a given

frequency in the field.

Segments Info

The Segments Info screen lets you see a visualization of the various segments in the underlying Lucene index,

with information about the size of each segment – both bytes and in number of documents – as well as other

basic metadata about those segments, notably the number of deleted documents.

44Apache Solr Reference Guide 5.1

This information may be useful for people to help make decisions about the optimal for their data.merge settings

45Apache Solr Reference Guide 5.1

Documents, Fields, and Schema Design

This section discusses how Solr organizes its data into documents and fields, as well as how to work with the

Solr schema file, . It includes the following topics:schema.xml

Overview of Documents, Fields, and Schema Design: An introduction to the concepts covered in this section.

Solr Field Types: Detailed information about field types in Solr, including the field types in the default Solr

schema.

Defining Fields: Describes how to define fields in Solr.

Copying Fields: Describes how to populate fields with data copied from another field.

Dynamic Fields: Information about using dynamic fields in order to catch and index fields that do not exactly

conform to other field definitions in your schema.

Schema API: Use curl commands to read various parts of a schema or create new fields and copyField rules.

Other Schema Elements: Describes other important elements in the Solr schema.

Putting the Pieces Together: A higher-level view of the Solr schema and how its elements work together.

DocValues: Describes how to create a docValues index for faster lookups.

Schemaless Mode: Automatically add previously unknown schema fields using value-based field type guessing.

Overview of Documents, Fields, and Schema Design

The fundamental premise of Solr is simple. You give it a lot of information, then later you can ask it questions

and find the piece of information you want. The part where you feed in all the information is called or indexing up

. When you ask a question, it's called a .dating query

One way to understand how Solr works is to think of a loose-leaf book of recipes. Every time you add a recipe to

the book, you update the index at the back. You list each ingredient and the page number of the recipe you just

added. Suppose you add one hundred recipes. Using the index, you can very quickly find all the recipes that use

garbanzo beans, or artichokes, or coffee, as an ingredient. Using the index is much faster than looking through

each recipe one by one. Imagine a book of one thousand recipes, or one million.

Solr allows you to build an index with many different fields, or types of entries. The example above shows how to

build an index with just one field, . You could have other fields in the index for the recipe's cookingingredients

style, like , , or , and you could have an index field for preparation times. Solr can answerAsian Cajun vegan

questions like "What Cajun-style recipes that have blood oranges as an ingredient can be prepared in fewer than

30 minutes?"

The schema is the place where you tell Solr how it should build indexes from input documents.

How Solr Sees the World

Solr's basic unit of information is a , which is a set of data that describes something. A recipe documentdocument

would contain the ingredients, the instructions, the preparation time, the cooking time, the tools needed, and so

on. A document about a person, for example, might contain the person's name, biography, favorite color, and

shoe size. A document about a book could contain the title, author, year of publication, number of pages, and so

on.

46Apache Solr Reference Guide 5.1

In the Solr universe, documents are composed of , which are more specific pieces of information. Shoe sizefields

could be a field. First name and last name could be fields.

Fields can contain different kinds of data. A name field, for example, is text (character data). A shoe size field

might be a floating point number so that it could contain values like 6 and 9.5. Obviously, the definition of fields is

flexible (you could define a shoe size field as a text field rather than a floating point number, for example), but if

you define your fields correctly, Solr will be able to interpret them correctly and your users will get better results

when they perform a query.

You can tell Solr about the kind of data a field contains by specifying its . The field type tells Solr how tofield type

interpret the field and how it can be queried.

When you add a document, Solr takes the information in the document's fields and adds that information to an

index. When you perform a query, Solr can quickly consult the index and return the matching documents.

Field Analysis

Field analysis tells Solr what to do with incoming data when building an index. A more accurate name for this

process would be or even , but the official name is .processing digestion analysis

Consider, for example, a biography field in a person document. Every word of the biography must be indexed so

that you can quickly find people whose lives have had anything to do with ketchup, or dragonflies, or

cryptography.

However, a biography will likely contains lots of words you don't care about and don't want clogging up your

index—words like "the", "a", "to", and so forth. Furthermore, suppose the biography contains the word "Ketchup",

capitalized at the beginning of a sentence. If a user makes a query for "ketchup", you want Solr to tell you about

the person even though the biography contains the capitalized word.

The solution to both these problems is field analysis. For the biography field, you can tell Solr how to break apart

the biography into words. You can tell Solr that you want to make all the words lower case, and you can tell Solr

to remove accents marks.

Field analysis is an important part of a field type. is a detailedUnderstanding Analyzers, Tokenizers, and Filters

description of field analysis.

Solr Field Types

The field type defines how Solr should interpret data in a field and how the field can be queried. There are many

field types included with Solr by default, and they can also be defined locally.

Topics covered in this section:

Field Type Definitions and Properties

Field Types Included with Solr

Working with Currencies and Exchange Rates

Working with Dates

Working with Enum Fields

Working with External Files and Processes

Field Properties by Use Case

47Apache Solr Reference Guide 5.1

Related Topics

SchemaXML-DataTypes

FieldType Javadoc

Field Type Definitions and Properties

A field type definition can include four types of information:

The name of the field type (mandatory)

An implementation class name (mandatory)

If the field type is , a description of the field analysis for the field typeTextField

Field type properties - depending on the implementation class, some properties may be mandatory.

Field Type Definitions in schema.xml

Field types are defined in . Each field type is defined between elements. They canschema.xml fieldType

optionally be grouped within a Here is an example of a field type definition for a type called types element. tex

:t_general

<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"

<!-- in this example, we will only use synonyms at query time

<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"

ignoreCase="true" expand="false"/>

-->

</analyzer>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"

<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"

ignoreCase="true" expand="true"/>

</analyzer>

</fieldType>

The first line in the example above contains the field type name, , and the name of thetext_general

implementing class, . The rest of the definition is about field analysis, described in solr.TextField Understand

.ing Analyzers, Tokenizers, and Filters

The implementing class is responsible for making sure the field is handled correctly. In the class names in schem

, the string is shorthand for or .a.xml solr org.apache.solr.schema org.apache.solr.analysis

Therefore, is really .solr.TextField org.apache.solr.schema.TextField.

Field Type Properties

The field type determines most of the behavior of a field type, but optional properties can also be defined.class

For example, the following definition of a date field type defines two properties, and sortMissingLast omitNo

48Apache Solr Reference Guide 5.1

.rms

<fieldType name="date" class="solr.TrieDateField"

sortMissingLast="true" omitNorms="true"/>

The properties that can be specified for a given field type fall into three major categories:

Properties specific to the field type's class.

General Properties Solr supports for any field type.

Field Default Properties that can be specified on the field type that will be inherited by fields that use this

type instead of the default behavior.

General Properties

Property Description Values

name The name of the fieldType. This value gets used in field definitions,

in the "type" attribute. It is strongly recommended that names

consist of alphanumeric or underscore characters only and not start

with a digit. This is not currently strictly enforced.

class The class name that gets used to store and index the data for this

type. Note that you may prefix included class names with "solr." and

Solr will automatically figure out which packages to search for the

class - so "solr.TextField" will work. If you are using a third-party

class, you will probably need to have a fully qualified class name.

The fully qualified equivalent for "solr.TextField" is

"org.apache.solr.schema.TextField".

positionIncrementGap For multivalued fields, specifies a distance between multiple values,

which prevents spurious phrase matches

integer

autoGeneratePhraseQueries For text fields. If true, Solr automatically generates phrase queries for

adjacent terms. If false, terms must be enclosed in double-quotes to

be treated as phrases.

true or

false

docValuesFormat Defines a custom to use for fields of this type.DocValuesFormat

This requires that a schema-aware codec, such as the SchemaCode

has been configured in solrconfig.xml.cFactory

n/a

postingsFormat Defines a custom to use for fields of this type.PostingsFormat

This requires that a schema-aware codec, such as the SchemaCode

has been configured in solrconfig.xml.cFactory

n/a

Field Default Properties

Lucene index back-compatibility is only supported for the default codec. If you choose to customize the p

or in your schema.xml, upgrading to a future version of Solr mayostingsFormat docValuesFormat

require you to either switch back to the default codec and optimize your index to rewrite it into the default

codec before upgrading, or re-build your entire index from scratch after upgrading.

49Apache Solr Reference Guide 5.1

Property Description Values

indexed If true, the value of the field can be used in queries to retrieve

matching documents

true or

false

stored If true, the actual value of the field can be retrieved by queries true or

false

docValues If true, the value of the field will be put in a column-oriented DocValues

structure

true or

false

sortMissingFirst

sortMissingLast

Control the placement of documents when a sort field is not present.

As of Solr 3.5, these work for all numeric fields, including Trie and date

fields.

true or

false

multiValued If true, indicates that a single document might contain multiple values

for this field type

true or

false

omitNorms If true, omits the norms associated with this field (this disables length

normalization and index-time boosting for the field, and saves some

memory). Defaults to true for all primitive (non-analyzed) field types,

such as int, float, data, bool, and string. Only full-text fields or fields

that need an index-time boost need norms.

true or

false

omitTermFreqAndPositions If true, omits term frequency, positions, and payloads from postings for

this field. This can be a performance boost for fields that don't require

that information. It also reduces the storage space required for the

index. Queries that rely on position that are issued on a field with this

option will silently fail to find documents. This property defaults to true

for all fields that are not text fields.

true or

false

omitPositions Similar to but preserves termomitTermFreqAndPositions

frequency information

true or

false

termVectors

termPositions

termOffsets

termPayloads

These options instruct Solr to maintain full term vectors for each

document, optionally including position, offset and payload information

for each term occurrence in those vectors. These can be used to

accelerate highlighting and other ancillary functionality, but impose a

substantial cost in terms of index size. They are not necessary for

typical uses of Solr.

true or

false

required Instructs Solr to reject any attempts to add a document which does not

have a value for this field. This property defaults to false.

true or

false

Field Types Included with Solr

The following table lists the field types that are available in Solr. The packageorg.apache.solr.schema

includes all the classes listed in this table.

Class Description

BinaryField Binary data.

50Apache Solr Reference Guide 5.1
BoolField Contains either true or false. Values of "1", "t", or "T" in the first
character are interpreted as true. Any other values in the first
character are interpreted as false.
CollationField Supports Unicode collation for sorting and range queries.
ICUCollationField is a better choice if you can use ICU4J. See the
section  .Unicode Collation
CurrencyField Supports currencies and exchange rates. See the section Working
.with Currencies and Exchange Rates
DateRangeField Supports indexing date ranges, to include point in time date instances
as well (single-millisecond durations). See the section Working with
. Consider using this even if it's just for date instances,Dates
particularly when the queries typically fall on UTC
year/month/day/hour etc. boundaries.
ExternalFileField Pulls values from a file on disk. See the section Working with External
.Files and Processes
EnumField Allows defining an enumerated set of values which may not be easily
sorted by either alphabetic or numeric order (such as a list of
severities, for example). This field type takes a configuration file, which
lists the proper order of the field values. See the section Working with
 for more information.Enum Fields
ICUCollationField Supports Unicode collation for sorting and range queries. See the
section  .Unicode Collation
LatLonType Spatial Search: a latitude/longitude coordinate pair. The latitude is
specified first in the pair.
PointType Spatial Search: An arbitrary n-dimensional point, useful for searching
sources such as blueprints or CAD drawings.
PreAnalyzedField Provides a way to send to Solr serialized token streams, optionally
with independent stored values of a field, and have this information
stored and indexed without any additional text processing. Useful if
you want to submit field content that was already processed by some
existing external text processing pipeline (e.g. tokenized, annotated,
stemmed, inserted synonyms, etc.), while using all the rich attributes
that Lucene's   provides via token attributes.TokenStream
RandomSortField Does not contain a value. Queries that sort on this field type will return
results in random order. Use a dynamic field to use this feature.
SpatialRecursivePrefixTreeFieldType (RPT for short)  : Accepts latitude comma longitudeSpatial Search
strings or other shapes in WKT format.
StrField String (UTF-8 encoded string or Unicode).
TextField Text, usually multiple words or tokens.

51Apache Solr Reference Guide 5.1

TrieDateField Date field. Represents a point in time with millisecond precision. See

the section . enablesWorking with Dates precisionStep="0"

efficient date sorting and minimizes index size; precisionStep="8"

(the default) enables efficient range queries.

TrieDoubleField Double field (64-bit IEEE floating point). precisionStep="0" enabl

es efficient numeric sorting and minimizes index size; precisionSte

p="8" (the default) enables efficient range queries.

TrieField If this field type is used, a "type" attribute must also be specified, valid

values are: , , , , . Using this field isinteger long float double date

the same as using any of the Trie fields. precisionStep="0" enabl

es efficient numeric sorting and minimizes index size; precisionSte

p="8" (the default) enables efficient range queries.

TrieFloatField Floating point field (32-bit IEEE floating point). precisionStep="0"

enables efficient numeric sorting and minimizes index size; precisio

nStep="8" (the default) enables efficient range queries.

TrieIntField Integer field (32-bit signed integer). precisionStep="0" enables

efficient numeric sorting and minimizes index size; precisionStep=

"8" (the default) enables efficient range queries.

TrieLongField Long field (64-bit signed integer). precisionStep="0" enables

efficient numeric sorting and minimizes index size; precisionStep=

"8" (the default) enables efficient range queries.

UUIDField Universally Unique Identifier (UUID). Pass in a value of "NEW" and

Solr will create a new UUID. : configuring a UUIDField instanceNote

with a default value of "NEW" is not advisable for most users when

using SolrCloud (and not possible if the UUID value is configured as

the unique key field) since the result will be that each replica of each

document will get a unique UUID value. Using

UUIDUpdateProcessorFactory to generate UUID values when

documents are added is recommended instead.

The has been added to relevant entries in (e.g.MultiTermAwareComponent solr.TextField schema.xml

, wildcards, regex, prefix, range, etc.) to allow automatic lowercasing for multi-term queries.

Further, you can optionally specify a multi-term analyzer in field types in your schema: <analyzer

; if you don't do this, will process the fields according to their specifictype="multiterm"> analyzer

attributes.

Working with Currencies and Exchange Rates

The FieldType provides support for monetary values to Solr/Lucene with query-time currencycurrency

conversion and exchange rates. The following features are supported:

Point queries

Range queries

Function range queries

52Apache Solr Reference Guide 5.1

Sorting

Currency parsing by either currency code or symbol

Symmetric & asymmetric exchange rates (asymmetric exchange rates are useful if there are fees

associated with exchanging the currency)

Configuring Currencies

The field type is defined in . This is the default configuration of this type:currency schema.xml

<fieldType name="currency" class="solr.CurrencyField" precisionStep="8"

defaultCurrency="USD" currencyConfig="currency.xml" />

In this example, we have defined the name and class of the field type, and defined the asdefaultCurrency

"USD", for U.S. Dollars. We have also defined a to use a file called "currency.xml". This is acurrencyConfig

file of exchange rates between our default currency to other currencies. There is an alternate implementation that

would allow regular downloading of currency data. See below for more.Exchange Rates

At indexing time, money fields can be indexed in a native currency. For example, if a product on an e-commerce

site is listed in Euros, indexing the price field as "1000,EUR" will index it appropriately. The price should be

separated from the currency by a comma, and the price must be encoded with a floating point value (a decimal

point).

During query processing, range and point queries are both supported.

Exchange Rates

You configure exchange rates by specifying a provider. Natively, two provider types are supported: FileExchan

or .geRateProvider OpenExchangeRatesOrgProvider

FileExchangeRateProvider

This provider requires you to provide a file of exchange rates. It is the default, meaning that to use this provider

you only need to specify the file path and name as a value for in the definition for this type.currencyConfig

There is a sample file included with Solr, found in the same directory as the file.currency.xml schema.xml

Here is a small snippet from this file:

53Apache Solr Reference Guide 5.1

<rates>

</rates>

</currencyConfig>

OpenExchangeRatesOrgProvider

You can configure Solr to download exchange rates from , with updates rates betweenOpenExchangeRates.Org

USD and 158 currencies hourly. These rates are symmetrical only.

In this case, you need to specify the in the definitions for the field type. Here is an example:providerClass

<fieldType name="currency" class="solr.CurrencyField" precisionStep="8"

providerClass="solr.OpenExchangeRatesOrgProvider"

refreshInterval="60"

ratesFileLocation="http://www.openexchangerates.org/api/latest.json?app_id=yourPerso

nalAppIdKey"/>

The is minutes, so the above example will download the newest rates every 60 minutes.refreshInterval

Working with Dates

Date Formatting

Solr's (and deprecated ) represents a point in time with millisecond precision. TheTrieDateField DateField

format used is a restricted form of the canonical representation of in the :dateTime XML Schema specification

YYYY-MM-DDThh:mm:ssZ

YYYY is the year.

MM is the month.

DD is the day of the month.

hh is the hour of the day as on a 24-hour clock.

mm is minutes.

ss is seconds.

Z is a literal 'Z' character indicating that this string representation of the date is in UTC

Note that no time zone can be specified; the String representations of dates is always expressed in Coordinated

Universal Time (UTC). Here is an example value:

54Apache Solr Reference Guide 5.1

1972-05-20T17:33:18Z

You can optionally include fractional seconds if you wish, although any precision beyond milliseconds will be

ignored. Here are examples value with sub-seconds include:

1972-05-20T17:33:18.772Z

1972-05-20T17:33:18.77Z

1972-05-20T17:33:18.7Z

Date Range Formatting

Solr's supports the same point in time date syntax described above (with describeDateRangeField date math

d below) and more to express date ranges. One class of examples is truncated dates, which represent the entire

date span to the precision indicated. The other class uses the range syntax ( ). Here are some[ TO ]

examples:

2000-11 – The entire month of November, 2000.

2000-11T13 – Likewise but for the 13th hour of the day (1pm-2pm).

-0009 – The year 10 BC. A 0 in the year position is 0 AD, and is also considered 1 BC.

[2000-11-01 TO 2014-12-01] – The specified date range at a day resolution.

[2014 TO 2014-12-01] – From the start of 2014 till the end of the first day of December.

[* TO 2014-12-01] – From the earliest representable time thru till the end of 2014-12-01.

Limitations: The range syntax doesn't support embedded date math. If you specify a date instance supported by

TrieDateField with date math truncating it, like , you still get the first millisecond of that day, not theNOW/DAY

entire day's range. Exclusive ranges (using & }) work in queries but not for indexing ranges.{

Date Math

Solr's date field types also supports expressions, which makes it easy to create times relative to fixeddate math

moments in time, include the current time which can be represented using the special value of " ".NOW

Date Math Syntax

Date math expressions consist either adding some quantity of time in a specified unit, or rounding the current

time by a specified unit. expressions can be chained and are evaluated left to right.

For example: this represents a point in time two months from now:

NOW+2MONTHS

This is one day ago:

NOW-1DAY

A slash is used to indicate rounding. This represents the beginning of the current hour:

NOW/HOUR

The following example computes (with millisecond precision) the point in time six months and three days into the

future and then rounds that time to the beginning of that day:

NOW+6MONTHS+3DAYS/DAY

Note that while date math is most commonly used relative to it can be applied to any fixed moment in timeNOW

55Apache Solr Reference Guide 5.1

as well:

1972-05-20T17:33:18.772Z+6MONTHS+3DAYS/DAY

Request Parameters That Affect Date Math

NOW

The parameter is used internally by Solr to ensure consistent date math expression parsing across multipleNOW

nodes in a distributed request. But it can be specified to instruct Solr to use an arbitrary moment in time (past or

future) to override for all situations where the the special value of " " would impact date math expressions.NOW

It must be specified as a (long valued) milliseconds since epoch

Example:

q=solr&fq=start_date:[* TO NOW]&NOW=1384387200000

By default, all date math expressions are evaluated relative to the UTC TimeZone, but the parameter can beTZ

specified to override this behaviour, by forcing all date based addition and rounding to be relative to the specified

.time zone

For example, the following request will use range faceting to facet over the current month, "per day" relative

UTC:

http://localhost:8983/solr/my_collection/select?q=*:*&facet.range=my_date_field&face

t=true&facet.range.start=NOW/MONTH&facet.range.end=NOW/MONTH%2B1MONTH&facet.range.ga

p=%2B1DAY

...

While in this example, the "days" will be computed relative to the specified time zone - including any applicable

Daylight Savings Time adjustments:

http://localhost:8983/solr/my_collection/select?q=*:*&facet.range=my_date_field&face

t=true&facet.range.start=NOW/MONTH&facet.range.end=NOW/MONTH%2B1MONTH&facet.range.ga

p=%2B1DAY&TZ=America/Los_Angeles

56Apache Solr Reference Guide 5.1

...

More DateRangeField Details

DateRangeField is almost a drop-in replacement for places where is used. The onlyTrieDateField

difference is that Solr's XML or SolrJ response formats will expose the stored data as a String instead of a Date.

The underlying index data for this field will be a bit larger. Queries that align to units of time a second on up

should be faster than TrieDateField, especially if it's in UTC. But the main point of DateRangeField as it's name

suggests is to allow indexing date ranges. To do that, simply supply strings in the format shown above. It also

supports specifying 3 different relational predicates between the indexed data, and the query range: Intersect

(default), , . You can specify the predicate by querying using the local-paramss Contains Within op

parameter like so:

fq={!field f=dateRange op=Contains}[2013 TO 2018]

In that example, it would find documents with indexed ranges that (or equals) the range 2013 thru 2018.contain

Multi-valued overlapping indexed ranges in a document are effectively coalesced.

Working with Enum Fields

The EnumField type allows defining a field whose values are a closed set, and the sort order is pre-determined

but is not alphabetic nor numeric. Examples of this are severity lists, or risk definitions.

Defining an EnumField in schema.xml

The EnumField type definition is quite simple, as in this example defining field types for "priorityLevel" and

"riskLevel" enumerations:

<fieldType name="priorityLevel" class="solr.EnumField" enumsConfig="enumsConfig.xml"

enumName="severity"/>

<fieldType name="riskLevel" class="solr.EnumField" enumsConfig="enumsConfig.xml"

enumName="risk" />

Besides the and the , which are common to all field types, this type also takes two additionalname class

parameters:

enumsConfig: the name of a configuration file that contains the list of field values and their<enum/>

order that you wish to use with this field type. If a path to the file is not defined specified, the file should be

in the directory for the collection.conf

enumName: the name of the specific enumeration in the file to use for this type.enumsConfig

57Apache Solr Reference Guide 5.1

Defining the EnumField configuration file

The file named with the parameter can contain multiple enumeration value lists with differentenumsConfig

names if there are multiple uses for enumerations in your Solr schema.

In this example, there are two value lists defined. Each list is between opening and closing tags:enum

<?xml version="1.0" ?>

<value>Not Available</value>

<value>Medium</value>

<value>Urgent</value>

</enum>

<value>Unknown</value>

<value>Medium</value>

<value>Critical</value>

</enum>

</enumsConfig>

Working with External Files and Processes

The TypeExternalFileField

The type makes it possible to specify the values for a field in a file outside the Solr index.ExternalFileField

For such a field, the file contains mappings from a key field to the field value. Another way to think of this is that,

instead of specifying the field in documents as they are indexed, Solr finds values for this field in the external file.

The type is handy for cases where you want to update a particular field in manyExternalFileField

documents more often than you want to update the rest of the documents. For example, suppose you have

implemented a document rank based on the number of views. You might want to update the rank of all the

documents daily or hourly, while the rest of the contents of the documents might be updated much less

frequently. Without , you would need to update each document just to change the rank.ExternalFileField

Using is much more efficient because all document values for a particular field are storedExternalFileField

in an external file that can be updated as frequently as you wish.

In , the definition of this field type might look like this:schema.xml

Changing Values

You cannot change the order, or remove, existing values in an without reindexing.<enum/>

You can however add new values to the end.

External fields are not searchable. They can be used only for function queries or display. For more

information on function queries, see the section on .Function Queries

58Apache Solr Reference Guide 5.1

<fieldType name="entryRankFile" keyField="pkId" defVal="0" stored="false"

indexed="false" class="solr.ExternalFileField" valType="pfloat"/>

The attribute defines the key that will be defined in the external file. It is usually the unique key for thekeyField

index, but it doesn't need to be as long as the can be used to identify documents in the index. A keyField defV

defines a default value that will be used if there is no entry in the external file for a particular document.al

The attribute specifies the actual type of values that will be found in the file. The type specified must bevalType

either a float field type, so valid values for this attribute are , or . This attribute can bepfloat float tfloat

omitted.

Format of the External File

The file itself is located in Solr's index directory, which by default is . The name of the file$SOLR_HOME/data

should be or . For the example above, then, the file could beexternal_fieldname external_ .*fieldname

named or .external_entryRankFile external_entryRankFile.txt

The file contains entries that map a key field, on the left of the equals sign, to a value, on the right. Here are a

few example entries:

doc33=1.414

doc34=3.14159

doc40=42

The keys listed in this file do not need to be unique. The file does not need to be sorted, but Solr will be able to

perform the lookup faster if it is.

Reloading an External File

It's possible to define an event listener to reload an external file when either a searcher is reloaded or when a

new searcher is started. See the section for more information, but a sample definition inQuery-Related Listeners

might look like this:solrconfig.xml

<listener event="newSearcher"

class="org.apache.solr.schema.ExternalFileFieldReloader"/>

<listener event="firstSearcher"

class="org.apache.solr.schema.ExternalFileFieldReloader"/>

Pre-Analyzing a Field Type

The type provides a way to send to Solr serialized token streams, optionally withPreAnalyzedField

independent stored values of a field, and have this information stored and indexed without any additional text

processing applied in Solr. This is useful if user wants to submit field content that was already processed by

some existing external text processing pipeline (e.g., it has been tokenized, annotated, stemmed, synonyms

inserted, etc.), while using all the rich attributes that Lucene's TokenStream provides (per-token attributes).

If any files using the name pattern (such as ) appear, the last (after being sorted by name) will.* .txt

be used and previous versions will be deleted. This behavior supports implementations on systems

where one may not be able to overwrite a file (for example, on Windows, if the file is in use).

59Apache Solr Reference Guide 5.1

The serialization format is pluggable using implementations of PreAnalyzedParser interface. There are two

out-of-the-box implementations:

JsonPreAnalyzedParser: as the name suggests, it parses content that uses JSON to represent field's

content. This is the default parser to use if the field type is not configured otherwise.

SimplePreAnalyzedParser: uses a simple strict plain text format, which in some situations may be easier

to create than JSON.

There is only one configuration parameter, . The value of this parameter should be a fully qualifiedparserImpl

class name of a class that implements PreAnalyzedParser interface. The default value of this parameter is org.

.apache.solr.schema.JsonPreAnalyzedParser

Field Properties by Use Case

Here is a summary of common use cases, and the attributes the fields or field types should have to support the

case. An entry of true or false in the table indicates that the option must be set to the given value for the use

case to function correctly. If no entry is provided, the setting of that attribute has no impact on the case.

Use Case indexed stored multiValued omitNorms termVectors termPositions docValues

search within

field

true

retrieve

contents

true

use as

unique key

true false

sort on field true7 false true 1 true7

use field

boosts 5

false

document

boosts affect

searches

within field

false

highlighting true 4true true2true 3

faceting 5true7 true7

add multiple

values,

maintaining

order

true

field length

affects doc

score

false

MoreLikeThis

true 6

60Apache Solr Reference Guide 5.1

Notes:

1 Recommended but not necessary.

Will be used if present, but not necessary.

(if termVectors=true)

A tokenizer must be defined for the field, but it doesn't need to be indexed.

Described in .

5Understanding Analyzers, Tokenizers, and Filters

Term vectors are not mandatory here. If not true, then a stored field is analyzed. So term vectors are

recommended, but only required if .stored=false

Either or must be true, but both are not required. can be more efficient in

7indexed docValues DocValues

many cases.

Defining Fields

Fields are defined in the fields element of . Once you have the field types set up, defining the fieldsschema.xml

themselves is simple.

Example

The following example defines a field named with a type named and a default value of ; the price float 0.0 i

and properties are explicitly set to , while any other properties specified on the fielndexed stored true float

d type are inherited.

Field Properties

Property Description

name The name of the field. Field names should consist of alphanumeric or underscore characters only

and not start with a digit. This is not currently strictly enforced, but other field names will not have

first class support from all components and back compatibility is not guaranteed. Names with both

leading and trailing underscores (e.g. ) are reserved. Every field must have a .version_ _ name

type The name of the for this field. This will be found in the " " attribute on the fieldType name field

definition. Every field must have a .Type type

default A default value that will be added automatically to any document that does not have a value in this

field when it is indexed. If this property is not specified, there is no default.

Optional Field Type Override Properties

Fields can have the same options as field types. The field type options serve as defaults which can be

overridden by options defined per field. Included below is the table of field type properties from the section Field

:Type Definitions and Properties

Property Description Values

indexed If true, the value of the field can be used in queries to retrieve

matching documents

true or

false

61Apache Solr Reference Guide 5.1

stored If true, the actual value of the field can be retrieved by queries true or

false

docValues If true, the value of the field will be put in a column-oriented DocValues

structure

true or

false

sortMissingFirst

sortMissingLast

Control the placement of documents when a sort field is not present.

As of Solr 3.5, these work for all numeric fields, including Trie and date

fields.

true or

false

multiValued If true, indicates that a single document might contain multiple values

for this field type

true or

false

omitNorms If true, omits the norms associated with this field (this disables length

normalization and index-time boosting for the field, and saves some

memory). Defaults to true for all primitive (non-analyzed) field types,

such as int, float, data, bool, and string. Only full-text fields or fields

that need an index-time boost need norms.

true or

false

omitTermFreqAndPositions If true, omits term frequency, positions, and payloads from postings for

this field. This can be a performance boost for fields that don't require

that information. It also reduces the storage space required for the

index. Queries that rely on position that are issued on a field with this

option will silently fail to find documents. This property defaults to true

for all fields that are not text fields.

true or

false

omitPositions Similar to but preserves termomitTermFreqAndPositions

frequency information

true or

false

termVectors

termPositions

termOffsets

termPayloads

These options instruct Solr to maintain full term vectors for each

document, optionally including position, offset and payload information

for each term occurrence in those vectors. These can be used to

accelerate highlighting and other ancillary functionality, but impose a

substantial cost in terms of index size. They are not necessary for

typical uses of Solr.

true or

false

required Instructs Solr to reject any attempts to add a document which does not

have a value for this field. This property defaults to false.

true or

false

Related Topics

SchemaXML-Fields

Field Options by Use Case

Copying Fields

You might want to interpret some document fields in more than one way. Solr has a mechanism for making

copies of fields so that you can apply several distinct field types to a single piece of incoming information.

The name of the field you want to copy is the , and the name of the copy is the . In source destination schema.xm

, it's very simple to make copies of fields:l

62Apache Solr Reference Guide 5.1

If the destination field has data of its own in the input documents, the contents of the field will betext cat

added as additional values – just as if all of the values had originally been specified by the client. Remember to

configure your fields as if they will ultimately get multiple values (either from amultivalued="true"

multivalued source, or multiple directives, etc...)copyField

The parameter, an parameter, establishes an upper limit for the number of characters to bemaxChars int

copied from the source value when constructing the value added to the destination field. This limit is useful for

situations in which you want to copy some data from the source field, but also control the size of index files.

Both the source and the destination of can contain either leading or trailing asterisks, which willcopyField

match anything. For example, the following line will copy the contents of all incoming fields that match the

wildcard pattern to the text field.:*_t

Related Topics

SchemaXML-Copy Fields

Dynamic Fields

Dynamic fields allow Solr to index fields that you did not explicitly define in your schema. This is useful if you

discover you have forgotten to define one or more fields. Dynamic fields can make your application less brittle by

providing some flexibility in the documents you can add to Solr.

A dynamic field is just like a regular field except it has a name with a wildcard in it. When you are indexing

documents, a field that does not match any explicitly defined fields can be matched with a dynamic field.

For example, suppose your schema includes a dynamic field with a name of . If you attempt to index a*_i

document with a field, but no explicit field is defined in the schema, then the field willcost_i cost_i cost_i

have the field type and analysis defined for .*_i

Like regular fields, dynamic fields have a name, a field type, and options.

It is recommended that you include basic dynamic field mappings (like that shown above) in your .schema.xml

The mappings can be very useful.

Related Topics

SchemaXML-Dynamic Fields

The command can use a wildcard (*) character in the parameter only if the pcopyField dest source

arameter contains one as well. uses the matching glob from the source field for the fiecopyField dest

ld name into which the source content is copied.

63Apache Solr Reference Guide 5.1

Other Schema Elements

This section describes several other important elements of .schema.xml

Unique Key

The element specifies which field is a unique identifier for documents. Although is notuniqueKey uniqueKey

required, it is nearly always warranted by your application design. For example, should be used ifuniqueKey

you will ever update a document in the index.

You can define the unique key field by naming it:

Schema defaults and cannot be used to populate the field. You also can't use copyFields uniqueKey UUIDU

to have values generated automatically.pdateProcessorFactory uniqueKey

Further, the operation will fail if the field is used, but is multivalued (or inherits the multivaluenessuniqueKey

from the ). However, will continue to work, as long as the field is properly used.fieldtype uniqueKey

Default Search Field

If you are using the Lucene query parser, queries that don't specify a field name will use the defaultSearchField.

The DisMax and Extended DisMax query parsers will also fallback to this if is not specified.qf

For more information about query parsers, see the section on .Query Syntax and Parsing

Query Parser Default Operator

In queries with multiple terms, Solr can either return results where all conditions are met or where one or more

conditions are met. The controls this behavior. An operator of AND means that all conditions must beoperator

fulfilled, while an operator of OR means that one or more conditions must be true.

In , the element controls what operator is used if an operator is not specified inschema.xml solrQueryParser

the query. The default operator setting only applies to the Lucene query parser, not the DisMax or Extended

DisMax query parsers, which internally hard-code their operators to OR.

Similarity

Similarity is a Lucene class used to score a document in searching.

A global declaration can be used to specify a custom similarity implementation that you want<similarity>

Solr to use when dealing with your index. A similarity can be specified either by referring directly to the name of a

Use of the element is deprecated in Solr versions 3.6 and higher. Instead, youdefaultSearchField

should use the request parameter. At some point, the element may bedf defaultSearchField

removed.

The query parser default operator parameter has been deprecated in Solr versions 3.6 and higher. You

are instead encouraged to specify the query parser parameter in your request handler.q.op

64Apache Solr Reference Guide 5.1

class with a no-argument constructor:

or by referencing a implementation, which may take optional initialization parameters:SimilarityFactory

</similarity>

A special is available, which allows individual field types to be configured with aSchemaSimilarityFactory

similarity to override the default behavior:

</similarity>

</fieldType>

If a field type does not have an explicitly configured similarity factory, then a instance isDefaultSimilarity

used.

This examples above show (divergence from random) and DFRSimilarityFactory IBSimilarityFactory

(using the Information-Based model), but there are several similarity implementations that can be used such as

the , , etc.... For details, see the Solr JavadocsSweetSpotSimilarityFactory BM25SimilarityFactory

for the .similarity factories

Related Topics

SchemaXML-Miscellaneous Settings

UniqueKey

Schema API

The Schema API provides read and write access to the Solr schema for each collection (or core, when using

standalone Solr). Read access to all schema elements is supported. Fields, dynamic fields, field types and

copyField rules may be added, removed or replaced. Future Solr releases will extend write access to allow more

schema elements to be modified.

To enable schema modification with this API, the schema will need to be managed and mutable. See the section

Re-index after schema modifications!

If you modify your schema, you will likely need to re-index all documents. If you do not, you may lose

access to documents, or not be able to interpret them properly, e.g. after replacing a field type.

65Apache Solr Reference Guide 5.1
 for more information.Managed Schema Definition in SolrConfig
The API allows two output modes for all calls: JSON or XML. When requesting the complete schema, there is
another output mode which is XML modeled after the schema.xml file itself.
When modifying the schema with the API, a core reload will automatically occur in order for the changes to be
available immediately.
The base address for the API is  . If for example youhttp://<host>:<port>/solr/<collection_name>
run Solr's " " example (via the   command shown below), which creates a " "cloud bin/solr gettingstarted
collection, then the base URL (as in all the sample URLs in this section) would be: http://localhost:8983/
. solr/gettingstarted
bin/solr -e cloud -noprompt
 
API Entry Points
Modify the Schema
Add a New Field
Delete a Field
Replace a Field
Add a Dynamic Field Rule
Delete a Dynamic Field Rule
Replace a Dynamic Field Rule
Add a New Field Type
Delete a Field Type
Replace a Field Type
Add a New Copy Field Rule
Delete a Copy Field Rule
Multiple Commands in a Single POST
Schema Changes among Replicas
Retrieve Schema Information
Retrieve the Entire Schema
List Fields
List Dynamic Fields
List Field Types
List Copy Fields
Show Schema Name
Show the Schema Version
List UniqueKey
Show Global Similarity
Get the Default Query Operator
Manage Resource Data
API Entry Points
/schema:   the schema, or   the schema to add, remove, or replace fields, dynamic fields, copyretrieve modify
fields, or field types
:   about all defined fields or a specific named field/schema/fields retrieve information

66Apache Solr Reference Guide 5.1

: about all dynamic field rules or a specific named dynamic rule/schema/dynamicfields retrieve information

: about all field types or a specific field type/schema/fieldtypes retrieve information

: about copy fields/schema/copyfields retrieve information

: the schema name/schema/name retrieve

: the schema version/schema/version retrieve

: the defined uniqueKey/schema/uniquekey retrieve

: the global similarity definition/schema/similarity retrieve

: the default operator/schema/solrqueryparser/defaultoperator retrieve

Modify the Schema

POST / /schemacollection

To add, remove or replace fields, dynamic field rules, copy field rules, or new field types, you can send a POST

request to the endpoint with a sequence of commands to perform the requested/collection/schema/

actions. The following commands are supported:

add-field: add a new field with parameters you provide.

delete-field: delete a field.

replace-field: replace an existing field with one that is differently configured.

add-dynamic-field: add a new dynamic field rule with parameters you provide.

delete-dynamic-field: delete a dynamic field rule.

replace-dynamic-field: replace an existing dynamic field rule with one that is differently configured.

add-field-type: add a new field type with parameters you provide.

delete-field-type: delete a field type.

replace-field-type: replace an existing field type with one that is differently configured.

add-copy-field: add a new copy field rule.

delete-copy-field: delete a copy field rule.

These commands can be issued in separate POST requests or in the same POST request. Commands are

executed in the order in which they are specified.

In each case, the response will include the status and the time to process the request, but will not include the

entire schema.

When modifying the schema with the REST API, a core reload will automatically occur in order for the changes

to be available immediately.

Add a New Field

The command adds a new field definition to your schema. If a field with the same name exists anadd-field

error is thrown.

All of the properties available when defining a field with manual edits can be passed via the API.schema.xml

These request attributes are described in detail in the section . Defining Fields

For example, to define a new stored field named "sell-by", of type "tdate", you would POST the following request:

67Apache Solr Reference Guide 5.1

curl -X POST -H 'Content-type:application/json' --data-binary '{

"add-field":{

"name":"sell-by",

"type":"tdate",

"stored":true }

}' http://localhost:8983/solr/gettingstarted/schema

Delete a Field

The command removes a field definition from your schema. If the field does not exist in thedelete-field

schema, or if the field is the source or destination of a copy field rule, an error is thrown.

For example, to delete a field named "sell-by", you would POST the following request:

curl -X POST -H 'Content-type:application/json' --data-binary '{

"delete-field" : { "name":"sell-by" }

}' http://localhost:8983/solr/gettingstarted/schema

Replace a Field

The command replaces a field's definition. Note that you must supply the full definition for areplace-field

field - this command will partially modify a field's definition. If the field does not exist in the schema an errornot

is thrown.

All of the properties available when defining a field with manual edits can be passed via the API.schema.xml

These request attributes are described in detail in the section . Defining Fields

For example, to replace the definition of an existing field "sell-by", to make it be of type "date" and to not be

stored, you would POST the following request:

curl -X POST -H 'Content-type:application/json' --data-binary '{

"replace-field":{

"name":"sell-by",

"type":"date",

"stored":false }

}' http://localhost:8983/solr/gettingstarted/schema

Add a Dynamic Field Rule

The command adds a new dynamic field rule to your schema. add-dynamic-field

All of the properties available when editing can be passed with the POST request. The section schema.xml Dyn

has details on all of the attributes that can be defined for a dynamic field rule.amic Fields

For example, to create a new dynamic field rule where all incoming fields ending with "_s" would be stored and

have field type "string", you can POST a request like this:

68Apache Solr Reference Guide 5.1

curl -X POST -H 'Content-type:application/json' --data-binary '{

"add-dynamic-field":{

"name":"*_s",

"type":"string",

"stored":true }

}' http://localhost:8983/solr/gettingstarted/schema

Delete a Dynamic Field Rule

The command deletes a dynamic field rule from your schema. If the dynamic fielddelete-dynamic-field

rule does not exist in the schema, or if the schema contains a copy field rule with a target or destination that

matches only this dynamic field rule, an error is thrown.

For example, to delete a dynamic field rule matching "*_s", you can POST a request like this:

curl -X POST -H 'Content-type:application/json' --data-binary '{

"delete-dynamic-field":{ "name":"*_s" }

}' http://localhost:8983/solr/gettingstarted/schema

Replace a Dynamic Field Rule

The command replaces a dynamic field rule in your schema. Note that you mustreplace-dynamic-field

supply the full definition for a dynamic field rule - this command will partially modify a dynamic field rule'snot

definition. If the dynamic field rule does not exist in the schema an error is thrown.

All of the properties available when editing can be passed with the POST request. The section schema.xml Dyn

has details on all of the attributes that can be defined for a dynamic field rule.amic Fields

For example, to replace the definition of the "*_s" dynamic field rule with one where the field type is

"text_general" and it's not stored, you can POST a request like this:

curl -X POST -H 'Content-type:application/json' --data-binary '{

"replace-dynamic-field":{

"name":"*_s",

"type":"text_general",

"stored":false }

}' http://localhost:8983/solr/gettingstarted/schema

Add a New Field Type

The command adds a new field type to your schema. add-field-type

All of the field type properties available when editing by hand are available for use in a POSTschema.xml

request. The structure of the command is a json mapping of the standard field type definition, including the

name, class, index and query analyzer definitions, etc. Details of all of the available options are described in the

section .Solr Field Types

For example, to create a new field type named "myNewTxtField", you can POST a request as follows:

69Apache Solr Reference Guide 5.1

curl -X POST -H 'Content-type:application/json' --data-binary '{

"add-field-type" : {

"name":"myNewTxtField",

"class":"solr.TextField",

"positionIncrementGap":"100",

"analyzer" : {

"charFilters":[{

"class":"solr.PatternReplaceCharFilterFactory",

"replacement":"$1$1",

"pattern":"([a-zA-Z])\\\\1+" }],

"tokenizer":{

"class":"solr.WhitespaceTokenizerFactory" },

"filters":[{

"class":"solr.WordDelimiterFilterFactory",

"preserveOriginal":"0" }]}}

}' http://localhost:8983/solr/gettingstarted/schema

Note in this example that we have only defined a single analyzer section that will apply to index analysis and

query analysis. If we wanted to define separate analysis, we would replace the section in the aboveanalyzer

example with separate sections for and . As in this example:indexAnalyzer queryAnalyzer

curl -X POST -H 'Content-type:application/json' --data-binary '{

"add-field-type":{

"name":"myNewTextField",

"class":"solr.TextField",

"indexAnalyzer":{

"tokenizer":{

"class":"solr.PathHierarchyTokenizerFactory",

"delimiter":"/" }},

"queryAnalyzer":{

"tokenizer":{

"class":"solr.KeywordTokenizerFactory" }}}

}' http://localhost:8983/solr/gettingstarted/schema

Delete a Field Type

The command removes a field type from your schema. If the field type does not exist indelete-field-type

the schema, or if any field or dynamic field rule in the schema uses the field type, an error is thrown.

For example, to delete the field type named "myNewTxtField", you can make a POST request as

follows:

curl -X POST -H 'Content-type:application/json' --data-binary '{

"delete-field-type":{ "name":"myNewTxtField" }

}' http://localhost:8983/solr/gettingstarted/schema

Replace a Field Type

The command replaces a field type in your schema. Note that you must supply the fullreplace-field-type

definition for a field type - this command will partially modify a field type's definition. If the field type does notnot

exist in the schema an error is thrown.

All of the field type properties available when editing by hand are available for use in a POSTschema.xml

70Apache Solr Reference Guide 5.1

request. The structure of the command is a json mapping of the standard field type definition, including the

name, class, index and query analyzer definitions, etc. Details of all of the available options are described in the

section .Solr Field Types

For example, to replace the definition of a field type named "myNewTxtField", you can make a POST request as

follows:

curl -X POST -H 'Content-type:application/json' --data-binary '{

"replace-field-type":{

"name":"myNewTxtField",

"class":"solr.TextField",

"positionIncrementGap":"100",

"analyzer":{

"tokenizer":{

"class":"solr.StandardTokenizerFactory" }}}

}' http://localhost:8983/solr/gettingstarted/schema

Add a New Copy Field Rule

The command adds a new copy field rule to your schema.add-copy-field

The attributes supported by the command are the same as when creating copy field rules by manually editing

the , as below: schema.xml

Name Required Description

source Yes The source field.

dest Yes A field or an array of fields to which the source field will be copied.

maxChars No The upper limit for the number of characters to be copied. The section Copying Fields

has more details.

For example, to define a rule to copy the field "shelf" to the "location" and "catchall" fields, you would POST the

following request:

curl -X POST -H 'Content-type:application/json' --data-binary '{

"add-copy-field":{

"source":"shelf",

"dest":[ "location", "catchall" ]}

}' http://localhost:8983/solr/gettingstarted/schema

Delete a Copy Field Rule

The command deletes a copy field rule from your schema. If the copy field rule does notdelete-copy-field

exist in the schema an error is thrown.

The and attributes are required by this command.source dest

For example, to delete a rule to copy the field "shelf" to the "location" field, you would POST the following

request:

71Apache Solr Reference Guide 5.1

curl -X POST -H 'Content-type:application/json' --data-binary '{

"delete-copy-field":{ "source":"shelf", "dest":"location" }

}' http://localhost:8983/solr/gettingstarted/schema

Multiple Commands in a Single POST

It is possible to perform one or more add requests in a single command. The API is transactional and all

commands in a single call either succeed or fail together.

The commands are executed in the order in which they are specified. This means that if you want to create a

new field type and in the same request use the field type on a new field, the section of the request that creates

the field type must come before the section that creates the new field. Similarly, since a field must exist for it to

be used in a copy field rule, a request to add a field must come before a request for the field to be used as either

the source or the destination for a copy field rule.

The syntax for making multiple requests supports several approaches. First, the commands can simply be made

serially, as in this request to create a new field type and then a field that uses that type:

curl -X POST -H 'Content-type:application/json' --data-binary '{

"add-field-type":{

"name":"myNewTxtField",

"class":"solr.TextField",

"positionIncrementGap":"100",

"analyzer":{

"charFilters":[{

"class":"solr.PatternReplaceCharFilterFactory",

"replacement":"$1$1",

"pattern":"([a-zA-Z])\\\\1+" }],

"tokenizer":{

"class":"solr.WhitespaceTokenizerFactory" },

"filters":[{

"class":"solr.WordDelimiterFilterFactory",

"preserveOriginal":"0" }]}},

"add-field" : {

"name":"sell-by",

"type":"myNewTxtField",

"stored":true }

}' http://localhost:8983/solr/gettingstarted/schema

Or, the same command can be repeated, as in this example:

72Apache Solr Reference Guide 5.1

curl -X POST -H 'Content-type:application/json' --data-binary '{

"add-field":{

"name":"shelf",

"type":"myNewTxtField",

"stored":true },

"add-field":{

"name":"location",

"type":"myNewTxtField",

"stored":true },

"add-copy-field":{

"source":"shelf",

"dest":[ "location", "catchall" ]}

}' http://localhost:8983/solr/gettingstarted/schema

Finally, repeated commands can be sent as an array:

curl -X POST -H 'Content-type:application/json' --data-binary '{

"add-field":[

{ "name":"shelf",

"type":"myNewTxtField",

"stored":true },

{ "name":"location",

"type":"myNewTxtField",

"stored":true }]

}' http://localhost:8983/solr/gettingstarted/schema

Schema Changes among Replicas

When running in SolrCloud mode, changes made to the schema on one node will propagate to all replicas in the

collection. You can pass the parameter with your request to set the number of seconds toupdateTimeoutSecs

wait until all replicas confirm they applied the schema updates. This helps your client application be more robust

in that you can be sure that all replicas have a given schema change within a defined amount of time. If

agreement is not reached by all replicas in the specified time, then the request fails and the error message will

include information about which replicas had trouble. In most cases, the only option is to re-try the change after

waiting a brief amount of time. If the problem persists, then you'll likely need to investigate the server logs on the

replicas that had trouble applying the changes. If you do not supply an parameter, theupdateTimeoutSecs

default behavior is for the receiving node to return immediately after persisting the updates to ZooKeeper. All

other replicas will apply the updates asynchronously. Consequently, without supplying a timeout, your client

application cannot be sure that all replicas have applied the changes.

Retrieve Schema Information

The following endpoints allow you to read how your schema has been defined. You can GET the entire schema,

or only portions of it as needed.

To modify the schema, see the previous section .Modify the Schema

Retrieve the Entire Schema

GET / /schemacollection

INPUT

73Apache Solr Reference Guide 5.1

Path Parameters

Key Description

collection The collection (or core) name.

Query Parameters

The query parameters should be added to the API request after '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are , or json xml schem

. If not specified, JSON will be returned by default.a.xml

OUTPUT

Output Content

The output will include all fields, field types, dynamic rules and copy field rules, in the format requested (JSON or

XML). The schema name and version are also included.

EXAMPLES

Get the entire schema in JSON.

curl http://localhost:8983/solr/gettingstarted/schema?wt=json

74Apache Solr Reference Guide 5.1

{

"responseHeader":{

"status":0,

"QTime":5},

"schema":{

"name":"example",

"version":1.5,

"uniqueKey":"id",

"fieldTypes":[{

"name":"alphaOnlySort",

"class":"solr.TextField",

"sortMissingLast":true,

"omitNorms":true,

"analyzer":{

"tokenizer":{

"class":"solr.KeywordTokenizerFactory"},

"filters":[{

"class":"solr.LowerCaseFilterFactory"},

{

"class":"solr.TrimFilterFactory"},

{

"class":"solr.PatternReplaceFilterFactory",

"replace":"all",

"replacement":"",

"pattern":"([^a-z])"}]}},

...

"fields":[{

"name":"_version_",

"type":"long",

"indexed":true,

"stored":true},

{

"name":"author",

"type":"text_general",

"indexed":true,

"stored":true},

{

"name":"cat",

"type":"string",

"multiValued":true,

"indexed":true,

"stored":true},

...

"copyFields":[{

"source":"author",

"dest":"text"},

{

"source":"cat",

"dest":"text"},

{

"source":"content",

"dest":"text"},

...

{

"source":"author",

"dest":"author_s"}]}}

75Apache Solr Reference Guide 5.1

Get the entire schema in XML.

curl http://localhost:8983/solr/gettingstarted/schema?wt=xml

</lst>

<str name="name">example</str>

<lst>

<str name="name">alphaOnlySort</str>

<str name="class">solr.TextField</str>

<str name="class">solr.KeywordTokenizerFactory</str>

</lst>

<lst>

<str name="class">solr.LowerCaseFilterFactory</str>

</lst>

<lst>

<str name="class">solr.TrimFilterFactory</str>

</lst>

<lst>

<str name="class">solr.PatternReplaceFilterFactory</str>

</lst>

</arr>

</lst>

...

<lst>

<str name="source">author</str>

<str name="dest">author_s</str>

</lst>

</arr>

</lst>

</response>

Get the entire schema in "schema.xml" format.

curl http://localhost:8983/solr/gettingstarted/schema?wt=schema.xml

76Apache Solr Reference Guide 5.1

<types>

<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true"

omitNorms="true">

<filter class="solr.PatternReplaceFilterFactory" replace="all"

replacement="" pattern="([^a-z])"/>

</analyzer>

</fieldType>

...

</schema>

List Fields

GET / /schema/fieldscollection

GET / /schema/fields/collection fieldname

INPUT

Path Parameters

Key Description

collection The collection (or core) name.

fieldname The specific fieldname (if limiting request to a single field).

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

OUTPUT

Output Content

The output will include each field and any defined configuration for each field. The defined configuration can vary

for each field, but will minimally include the field , the , if it is and if it is . If name type indexed stored multiVa

is defined as either true or false (most likely true), that will also be shown. See the section flued Defining Fields

or more information about each parameter.

EXAMPLES

77Apache Solr Reference Guide 5.1

Get a list of all fields.

curl http://localhost:8983/solr/gettingstarted/schema/fields?wt=json

The sample output below has been truncated to only show a few fields.

{

"fields": [

{

"indexed": true,

"name": "_version_",

"stored": true,

"type": "long"

{

"indexed": true,

"name": "author",

"stored": true,

"type": "text_general"

{

"indexed": true,

"multiValued": true,

"name": "cat",

"stored": true,

"type": "string"

...

"responseHeader": {

"QTime": 1,

"status": 0

}

List Dynamic Fields

GET / /schema/dynamicfieldscollection

GET / /schema/dynamicfields/collection name

INPUT

Path Parameters

Key Description

collection The collection (or core) name.

name The name of the dynamic field rule (if limiting request to a single dynamic field rule).

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

78Apache Solr Reference Guide 5.1

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

OUTPUT

Output Content

The output will include each dynamic field rule and the defined configuration for each rule. The defined

configuration can vary for each rule, but will minimally include the dynamic field , the , if it is name type indexed

and if it is . See the section for more information about each parameter.stored Dynamic Fields

EXAMPLES

Get a list of all dynamic field declarations:

curl http://localhost:8983/solr/gettingstarted/schema/dynamicfields?wt=json

The sample output below has been truncated.

79Apache Solr Reference Guide 5.1

{

"dynamicFields": [

{

"indexed": true,

"name": "*_coordinate",

"stored": false,

"type": "tdouble"

{

"multiValued": true,

"name": "ignored_*",

"type": "ignored"

{

"name": "random_*",

"type": "random"

{

"indexed": true,

"multiValued": true,

"name": "attr_*",

"stored": true,

"type": "text_general"

{

"indexed": true,

"multiValued": true,

"name": "*_txt",

"stored": true,

"type": "text_general"

}

...

"responseHeader": {

"QTime": 1,

"status": 0

}

List Field Types

GET / /schema/fieldtypescollection

GET / /schema/fieldtypes/collection name

INPUT

Path Parameters

Key Description

collection The collection (or core) name.

name The name of the field type (if limiting request to a single field type).

Query Parameters

80Apache Solr Reference Guide 5.1

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

OUTPUT

Output Content

The output will include each field type and any defined configuration for the type. The defined configuration can

vary for each type, but will minimally include the field type and the . If query or index analyzers,name class

tokenizers, or filters are defined, those will also be shown with other defined parameters. See the section Solr

for more information about how to configure various types of fields.Field Types

EXAMPLES

Get a list of all field types.

curl http://localhost:8983/solr/gettingstarted/schema/fieldtypes?wt=json

The sample output below has been truncated to show a few different field types from different parts of the list.

81Apache Solr Reference Guide 5.1

{

"fieldTypes": [

{

"analyzer": {

"class": "solr.TokenizerChain",

"filters": [

{

"class": "solr.LowerCaseFilterFactory"

{

"class": "solr.TrimFilterFactory"

{

"class": "solr.PatternReplaceFilterFactory",

"pattern": "([^a-z])",

"replace": "all",

"replacement": ""

}

"tokenizer": {

"class": "solr.KeywordTokenizerFactory"

}

"class": "solr.TextField",

"dynamicFields": [],

"fields": [],

"name": "alphaOnlySort",

"omitNorms": true,

"sortMissingLast": true

...

{

"class": "solr.TrieFloatField",

"dynamicFields": [

"*_fs",

"*_f"

"fields": [

"price",

"weight"

"name": "float",

"positionIncrementGap": "0",

"precisionStep": "0"

...

}

List Copy Fields

GET / /schema/copyfieldscollection

INPUT

Path Parameters

82Apache Solr Reference Guide 5.1

Key Description

collection The collection (or core) name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

OUTPUT

Output Content

The output will include the and ination of each copy field rule defined in . For moresource dest schema.xml

information about copying fields, see the section .Copying Fields

EXAMPLES

Get a list of all copyfields.

curl http://localhost:8983/solr/gettingstarted/schema/copyfields?wt=json

The sample output below has been truncated to the first few copy definitions.

{

"copyFields": [

{

"dest": "text",

"source": "author"

{

"dest": "text",

"source": "cat"

{

"dest": "text",

"source": "content"

{

"dest": "text",

"source": "content_type"

...

"responseHeader": {

"QTime": 3,

"status": 0

}

Show Schema Name

83Apache Solr Reference Guide 5.1

GET / /schema/namecollection

INPUT

Path Parameters

Key Description

collection The collection (or core) name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

OUTPUT

Output Content

The output will be simply the name given to the schema.

EXAMPLES

Get the schema name.

curl http://localhost:8983/solr/gettingstarted/schema/name?wt=json

{

"responseHeader":{

"status":0,

"QTime":1},

"name":"example"}

Show the Schema Version

GET / /schema/versioncollection

INPUT

Path Parameters

Key Description

collection The collection (or core) name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

84Apache Solr Reference Guide 5.1

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

OUTPUT

Output Content

The output will simply be the schema version in use.

EXAMPLES

Get the schema version

curl http://localhost:8983/solr/gettingstarted/schema/version?wt=json

{

"responseHeader":{

"status":0,

"QTime":2},

"version":1.5}

List UniqueKey

GET / /schema/uniquekeycollection

INPUT

Path Parameters

Key Description

collection The collection (or core) name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

OUTPUT

Output Content

The output will include simply the field name that is defined as the uniqueKey for the index.

EXAMPLES

List the uniqueKey.

85Apache Solr Reference Guide 5.1

curl http://localhost:8983/solr/gettingstarted/schema/uniquekey?wt=json

{

"responseHeader":{

"status":0,

"QTime":2},

"uniqueKey":"id"}

Show Global Similarity

GET / /schema/similaritycollection

INPUT

Path Parameters

Key Description

collection The collection (or core) name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

OUTPUT

Output Content

The output will include the class name of the global similarity defined (if any).

EXAMPLES

Get the similarity implementation.

curl http://localhost:8983/solr/gettingstarted/schema/similarity?wt=json

{

"responseHeader":{

"status":0,

"QTime":1},

"similarity":{

"class":"org.apache.solr.search.similarities.DefaultSimilarityFactory"}}

Get the Default Query Operator

86Apache Solr Reference Guide 5.1

GET / /schema/solrqueryparser/defaultoperatorcollection

INPUT

Path Parameters

Key Description

collection The collection (or core) name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

OUTPUT

Output Content

The output will include simply the default operator if none is defined by the user.

EXAMPLES

Get the default operator.

curl

http://localhost:8983/solr/gettingstarted/schema/solrqueryparser/defaultoperator?wt=

json

{

"responseHeader":{

"status":0,

"QTime":2},

"defaultOperator":"OR"}

Manage Resource Data

The REST API provides a mechanism for any Solr plugin to expose resources that shouldManaged Resources

support CRUD (Create, Read, Update, Delete) operations. Depending on what Field Types and Analyzers are

configured in your Schema, additional REST API paths may exist. See the secti/schema/ Managed Resources

on for more information and examples.

Putting the Pieces Together

At the highest level, is structured as follows. This example is not real XML, but it gives you an ideaschema.xml

of the structure of the file.

87Apache Solr Reference Guide 5.1

<types>

</schema>

Obviously, most of the excitement is in and , where the field types and the actual field definitionstypes fields

live. These are supplemented by . The must always be defined. In older Solr versionscopyFields uniqueKey

you would find defaultSearchField and solrQueryParser tags as well, but although these still work they

are deprecated and discouraged, see .Other Schema Elements

Choosing Appropriate Numeric Types

For general numeric needs, use , , , and TrieIntField TrieLongField TrieFloatField TrieDoubleFiel

with . d precisionStep="0"

If you expect users to make frequent range queries on numeric types, use the default (by notprecisionStep

specifying it) or specify it as (which is the default). This offers faster speed for rangeprecisionStep="8"

queries at the expense of increasing index size.

Working With Text

Handling text properly will make your users happy by providing them with the best possible results for text

searches.

One technique is using a text field as a catch-all for keyword searching. Most users are not sophisticated about

their searches and the most common search is likely to be a simple keyword search. You can use tocopyField

take a variety of fields and funnel them all into a single text field for keyword searches. In the schema.xml file for

the " " example included with Solr, declarations are used to dump the contents of techproducts copyField ca

, , , , and into a single field, . In addition, it could be a good idea to copy t name manu features includes text

into in case users wanted to search for a particular product by passing its product number to a keywordID text

search.

Another technique is using to use the same field in different ways. Suppose you have a field that iscopyField

a list of authors, like this:

Schildt, Herbert; Wolpert, Lewis; Davies, P.

For searching by author, you could tokenize the field, convert to lower case, and strip out punctuation:

schildt / herbert / wolpert / lewis / davies / p

For sorting, just use an untokenized field, converted to lower case, with punctuation stripped:

Types and fields are optional tags

Note that the and sections are optional, meaning you are free to mix , types fields field dynamiFi

, and definitions on the top level. This allows for a more logical grouping ofeld copyField fieldType

related tags in your schema.

88Apache Solr Reference Guide 5.1

schildt herbert wolpert lewis davies p

Finally, for faceting, use the primary author only via a :StringField

Schildt, Herbert

Related Topics

SchemaXML

DocValues

DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting

and faceting, then traditional indexing.

Why DocValues?

The standard way that Solr builds the index is with an . This style builds a list of terms found in allinverted index

the documents in the index and next to each term is a list of documents that the term appears in (as well as how

many times the term appears in that document). This makes search very fast - since users search by terms,

having a ready list of term-to-document values makes the query process faster.

For other features that we now commonly associate with search, such as sorting, faceting, and highlighting, this

approach is not very efficient. The faceting engine, for example, must look up each term that appears in each

document that will make up the result set and pull the document IDs in order to build the facet list. In Solr, this is

maintained in memory, and can be slow to load (depending on the number of documents, terms, etc.).

In Lucene 4.0, a new approach was introduced. DocValue fields are now column-oriented fields with a

document-to-value mapping built at index time. This approach promises to relieve some of the memory

requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.

How to Use DocValues

To use docValues, you only need to enable it for a field that you will use it with. As with all schema design, you

need to define a field type and then define fields of that type with docValues enabled. All of these actions are

done in .schema.xml

Enabling a field for docValues only requires adding to the field (or field type) definition, asdocValues="true"

in this example from the of Solr's :schema.xml sample_techproducts_configs config set

<field name="manu_exact" type="string" indexed="false" stored="false"

docValues="true" />

DocValues are only available for specific field types. The types chosen determine the underlying Lucene

docValue type that will be used. The available Solr field types are:

StrField and .UUIDField

If the field is single-valued (i.e., multi-valued is false), Lucene will use the SORTED type.

If the field is multi-valued, Lucene will use the SORTED_SET type.

If you have already indexed data into your Solr index, you will need to completely re-index your content

after changing your field definitions in in order to successfully use docValues.schema.xml

89Apache Solr Reference Guide 5.1

Any numeric fields and .Trie* EnumField

If the field is single-valued (i.e., multi-valued is false), Lucene will use the NUMERIC type.

If the field is multi-valued, Lucene will use the SORTED_SET type.

These Lucene types are related to how the values are sorted and stored.

There is an additional configuration option available, which is to modify the docValuesFormat used by the field

. The default implementation employs a mixture of loading some things into memory and keeping some ontype

disk. In some cases, however, you may choose to specify an alternative . ForDocValuesFormat implementation

example, you could choose to keep everything in memory by specifying on adocValuesFormat="Memory"

field type:

<fieldType name="string_in_mem_dv" class="solr.StrField" docValues="true"

docValuesFormat="Memory" />

Please note that the option may change in future releases.docValuesFormat

Schemaless Mode

Schemaless Mode is a set of Solr features that, when used together, allow users to rapidly construct an effective

schema by simply indexing sample data, without having to manually edit the schema. These Solr features, all

specified in , are:solrconfig.xml

Managed schema: Schema modifications are made through Solr APIs rather than manual edits - see Man

.aged Schema Definition in SolrConfig

Field value class guessing: Previously unseen fields are run through a cascading set of value-based

parsers, which guess the Java class of field values - parsers for Boolean, Integer, Long, Float, Double,

and Date are currently available.

Automatic schema field addition, based on field value class(es): Previously unseen fields are added to the

schema, based on field value Java classes, which are mapped to schema field types - see Solr Field

.Types

Using the Schemaless Example

The three features of schemaless mode are pre-configured in the idata_driven_schema_configs config set

n the Solr distribution. To start an example instance of Solr using these configs, run the following command:

bin/solr start -e schemaless

This will launch a Solr server, and automatically create a collection (named " ") that containsgettingstarted

only three fields in the initial schema: , , and .id _version_ _text_

You can use the to confirm this: /schema/fields Schema API curl http://localhost:8983/solr/get

will output: tingstarted/schema/fields

Lucene index back-compatibility is only supported for the default codec. If you choose to customize the d

in your schema.xml, upgrading to a future version of Solr may require you to eitherocValuesFormat

switch back to the default codec and optimize your index to rewrite it into the default codec before

upgrading, or re-build your entire index from scratch after upgrading.

90Apache Solr Reference Guide 5.1

{

"responseHeader":{

"status":0,

"QTime":1},

"fields":[{

"name":"_text_",

"type":"text_general",

"multiValued":true,

"indexed":true,

"stored":false},

{

"name":"_version_",

"type":"long",

"indexed":true,

"stored":true},

{

"name":"id",

"type":"string",

"multiValued":false,

"indexed":true,

"required":true,

"stored":true,

"uniqueKey":true}]}

Configuring Schemaless Mode

As described above, there are three configuration elements that need to be in place to use Solr in schemaless

mode. In the config set included with Solr these are already configured. If,data_driven_schema_configs

however, you would like to implement schemaless on your own, you should make the following changes.

Enable Managed Schema

As described in the section , changing the will allowManaged Schema Definition in SolrConfig schemaFactory

the schema to be modified by the . Your should have a section like the oneSchema API solrconfig.xml

below (and the ClassicIndexSchemaFactory should be commented out or removed).

<str name="managedSchemaResourceName">managed-schema</str>

</schemaFactory>

Define an UpdateRequestProcessorChain

The UpdateRequestProcessorChain allows Solr to guess field types, and you can define the default field type

classes to use. To start, you should define it as follows (see the javadoc links below for update processor factory

Because the config set includes a directive that causesdata_driven_schema_configs copyField

all content to be indexed in a predefined "catch-all" field, to enable single-field search that_text_

includes all fields' content, the index will be larger than it would be without the . When youcopyField

nail down your schema, consider removing the field and the corresponding directiv_text_ copyField

e if you don't need it.

91Apache Solr Reference Guide 5.1

documentation):

92Apache Solr Reference Guide 5.1

<!-- UUIDUpdateProcessorFactory will generate an id if none is present in the

incoming document -->

</processor>

</arr>

</processor>

<str name="defaultFieldType">strings</str>

<str name="valueClass">java.lang.Boolean</str>

<str name="fieldType">booleans</str>

</lst>

<str name="fieldType">tdates</str>

</lst>

<str name="valueClass">java.lang.Integer</str>

<str name="fieldType">tlongs</str>

</lst>

<str name="valueClass">java.lang.Number</str>

<str name="fieldType">tdoubles</str>

</lst>

</processor>

</updateRequestProcessorChain>

93Apache Solr Reference Guide 5.1

Javadocs for update processor factories mentioned above:

UUIDUpdateProcessorFactory

RemoveBlankFieldUpdateProcessorFactory

FieldNameMutatingUpdateProcessorFactory

ParseBooleanFieldUpdateProcessorFactory

ParseLongFieldUpdateProcessorFactory

ParseDoubleFieldUpdateProcessorFactory

ParseDateFieldUpdateProcessorFactory

AddSchemaFieldsUpdateProcessorFactory

Make the UpdateRequestProcessorChain the Default for the UpdateRequestHandler

Once the UpdateRequestProcessorChain has been defined, you must instruct your UpdateRequestHandlers to

use it when working with index updates (i.e., adding, removing, replacing documents). Here is an example using

to set the defaults on all request handlers:InitParams /update

<str name="update.chain">add-unknown-fields-to-the-schema</str>

</lst>

</initParams>

Examples of Indexed Documents

Once the schemaless mode has been enabled (whether you configured it manually or are using data_driven_

), documents that include fields that are not defined in your schema should be added to the schema_configs

index, and the new fields added to the schema.

For example, adding a CSV document will cause its fields that are not in the schema to be added, with

fieldTypes based on values:

curl "http://localhost:8983/solr/gettingstarted/update?commit=true" -H

"Content-type:application/csv" -d '

id,Artist,Album,Released,Rating,FromDistributor,Sold

44C,Old Shews,Mead for Walking,1988-08-13,0.01,14,0'

Output indicating success:

<lst name="responseHeader"><int name="status">0</int><int

name="QTime">106</int></lst>

</response>

The fields now in the schema (output from curl http://localhost:8983/solr/gettingstarted/sche

): ma/fields

After each of these changes have been made, Solr should be restarted (or, you can reload the cores to

load the new definitions).solrconfig.xml

94Apache Solr Reference Guide 5.1

{

"responseHeader":{

"status":0,

"QTime":1},

"fields":[{

"name":"Album",

"type":"strings"}, // Field value guessed as String -> strings fieldType

{

"name":"Artist",

"type":"strings"}, // Field value guessed as String -> strings fieldType

{

"name":"FromDistributor",

"type":"tlongs"}, // Field value guessed as Long -> tlongs fieldType

{

"name":"Rating",

"type":"tdoubles"}, // Field value guessed as Double -> tdoubles fieldType

{

"name":"Released",

"type":"tdates"}, // Field value guessed as Date -> tdates fieldType

{

"name":"Sold",

"type":"tlongs"}, // Field value guessed as Long -> tlongs fieldType

{

"name":"_text_",

...

{

"name":"_version_",

...

{

"name":"id",

...

}]}

Once a field has been added to the schema, its field type is fixed. As a consequence, adding documents with

field value(s) that conflict with the previously guessed field type will fail. For example, after adding the above

document, the " " field has the fieldType , but the document below has a non-integral decimalSold tlongs

value in this field:

curl "http://localhost:8983/solr/gettingstarted/update?commit=true" -H

"Content-type:application/csv" -d '

id,Description,Sold

19F,Cassettes by the pound,4.93'

This document will fail, as shown in this output:

You Can Still Be Explicit

Even if you want to use schemaless mode for most fields, you can still use the toSchema API

pre-emptively create some fields, with explicit types, before you index documents that use them.

Internally, the Schema REST API and the Schemaless Update Processors both use the same Managed

functionality.Schema

95Apache Solr Reference Guide 5.1

</lst>

<str name="msg">ERROR: [doc=19F] Error adding field 'Sold'='4.93' msg=For input

string: "4.93"</str>

</lst>

</response>

96Apache Solr Reference Guide 5.1

Understanding Analyzers, Tokenizers, and Filters

The following sections describe how Solr breaks down and works with textual data. There are three main

concepts to understand: analyzers, tokenizers, and filters.

Field analyzers are used both during ingestion, when a document is indexed, and at query time. An analyzer

examines the text of fields and generates a token stream. Analyzers may be a single class or they may be

composed of a series of tokenizer and filter classes.

Tokenizers break field data into lexical units, or .tokens

Filters examine a stream of tokens and keep them, transform or discard them, or create new ones. Tokenizers

and filters may be combined to form pipelines, or , where the output of one is input to the next. Such achains

sequence of tokenizers and filters is called an and the resulting output of an analyzer is used to matchanalyzer

query results or build indices.

Using Analyzers, Tokenizers, and Filters

Although the analysis process is used for both indexing and querying, the same analysis process need not be

used for both operations. For indexing, you often want to simplify, or normalize, words. For example, setting all

letters to lowercase, eliminating punctuation and accents, mapping words to their stems, and so on. Doing so

can increase recall because, for example, "ram", "Ram" and "RAM" would all match a query for "ram". To

increase query-time precision, a filter could be employed to narrow the matches by, for example, ignoring all-cap

acronyms if you're interested in male sheep, but not Random Access Memory.

The tokens output by the analysis process define the values, or , of that field and are used either to build anterms

index of those terms when a new document is added, or to identify which documents contain the terms you are

querying for.

For More Information

These sections will show you how to configure field analyzers and also serves as a reference for the details of

configuring each of the available tokenizer and filter classes. It also serves as a guide so that you can configure

your own analysis classes if you have special needs that cannot be met with the included filters or tokenizers.

For Analyzers, see:

Analyzers: Detailed conceptual information about Solr analyzers.

Running Your Analyzer: Detailed information about testing and running your Solr analyzer.

For Tokenizers, see:

About Tokenizers: Detailed conceptual information about Solr tokenizers.

Tokenizers: Information about configuring tokenizers, and about the tokenizer factory classes included in

this distribution of Solr.

For Filters, see:

About Filters: Detailed conceptual information about Solr filters.

Filter Descriptions: Information about configuring filters, and about the filter factory classes included in this

distribution of Solr.

CharFilterFactories: Information about filters for pre-processing input characters.

To find out how to use Tokenizers and Filters with various languages, see:

97Apache Solr Reference Guide 5.1

Language Analysis: Information about tokenizers and filters for character set conversion or for use with

specific languages.

Analyzers

An analyzer examines the text of fields and generates a token stream. Analyzers are specified as a child of the <

element in the configuration file (in the same directory as )fieldType> schema.xml conf/ solrconfig.xml

In normal usage, only fields of type will specify an analyzer. The simplest way to configure ansolr.TextField

analyzer is with a single element whose class attribute is a fully qualified Java class name. The<analyzer>

named class must derive from . For example:org.apache.lucene.analysis.Analyzer

</fieldType>

In this case a single class, , is responsible for analyzing the content of the named textWhitespaceAnalyzer

field and emitting the corresponding tokens. For simple cases, such as plain English prose, a single analyzer

class like this may be sufficient. But it's often necessary to do more complex analysis of the field content.

Even the most complex analysis requirements can usually be decomposed into a series of discrete, relatively

simple processing steps. As you will soon discover, the Solr distribution comes with a large selection of

tokenizers and filters that covers most scenarios you are likely to encounter. Setting up an analyzer chain is very

straightforward; you specify a simple element (no class attribute) with child elements that name<analyzer>

factory classes for the tokenizer and filters to use, in the order you want them to run.

For example:

</analyzer>

</fieldType>

Note that classes in the package may be referred to here with the shorthand org.apache.solr.analysis so

prefix.lr.

In this case, no Analyzer class was specified on the element. Rather, a sequence of more<analyzer>

specialized classes are wired together and collectively act as the Analyzer for the field. The text of the field is

passed to the first item in the list ( ), and the tokens that emerge from thesolr.StandardTokenizerFactory

last one ( ) are the terms that are used for indexing or querying anysolr.EnglishPorterFilterFactory

fields that use the "nametext" .fieldType

98Apache Solr Reference Guide 5.1

Analysis Phases

Analysis takes place in two contexts. At index time, when a field is being created, the token stream that results

from analysis is added to an index and defines the set of terms (including positions, sizes, and so on) for the

field. At query time, the values being searched for are analyzed and the terms that result are matched against

those that are stored in the field's index.

In many cases, the same analysis should be applied to both phases. This is desirable when you want to query

for exact string matches, possibly with case-insensitivity, for example. In other cases, you may want to apply

slightly different analysis steps during indexing than those used at query time.

If you provide a simple definition for a field type, as in the examples above, then it will be used for<analyzer>

both indexing and queries. If you want distinct analyzers for each phase, you may include two defi<analyzer>

nitions distinguished with a type attribute. For example:

</analyzer>

</analyzer>

</fieldType>

In this theoretical example, at index time the text is tokenized, the tokens are set to lowercase, any that are not

listed in are discarded and those that remain are mapped to alternate values as defined by thekeepwords.txt

synonym rules in the file . This essentially builds an index from a restricted set of possible values andsyns.txt

then normalizes them to values that may not even occur in the original text.

At query time, the only normalization that happens is to convert the query terms to lowercase. The filtering and

mapping steps that occur at index time are not applied to the query terms. Queries must then, in this example, be

very precise, using only the normalized terms that were stored at index time.

About Tokenizers

The job of a is to break up a stream of text into tokens, where each token is (usually) a sub-sequencetokenizer

of the characters in the text. An analyzer is aware of the field it is configured for, but a tokenizer is not.

Tokenizers read from a character stream (a Reader) and produce a sequence of Token objects (a

TokenStream).

Field Values versus Indexed Terms

The output of an Analyzer affects the indexed in a given field (and the terms used when parsingterms

queries against those fields) but it has no impact on the value for the fields. For example: anstored

analyzer might split "Brown Cow" into two indexed terms "brown" and "cow", but the stored value will still

be a single String: "Brown Cow"

99Apache Solr Reference Guide 5.1

Characters in the input stream may be discarded, such as whitespace or other delimiters. They may also be

added to or replaced, such as mapping aliases or abbreviations to normalized forms. A token contains various

metadata in addition to its text value, such as the location at which the token occurs in the field. Because a

tokenizer may produce tokens that diverge from the input text, you should not assume that the text of the token is

the same text that occurs in the field, or that its length is the same as the original text. It's also possible for more

than one token to have the same position or refer to the same offset in the original text. Keep this in mind if you

use token metadata for things like highlighting search results in the field text.

</analyzer>

</fieldType>

The class named in the tokenizer element is not the actual tokenizer, but rather a class that implements the org

interface. This factory class will be called upon to create new.apache.solr.analysis.TokenizerFactory

tokenizer instances as needed. Objects created by the factory must derive from org.apache.lucene.analys

, which indicates that they produce sequences of tokens. If the tokenizer produces tokens thatis.TokenStream

are usable as is, it may be the only component of the analyzer. Otherwise, the tokenizer's output tokens will

serve as input to the first filter stage in the pipeline.

A is available that creates a that filters tokens based on theirTypeTokenFilterFactory TypeTokenFilter

TypeAttribute, which is set in .factory.getStopTypes

For a complete list of the available TokenFilters, see the section .Tokenizers

When To use a CharFilter vs. a TokenFilter

There are several pairs of CharFilters and TokenFilters that have related (ie: and MappingCharFilter ASCIIF

) or nearly identical (ie: and oldingFilter PatternReplaceCharFilterFactory PatternReplaceFilte

) functionality and it may not always be obvious which is the best choice.rFactory

The decision about which to use depends largely on which Tokenizer you are using, and whether you need to

preprocess the stream of characters.

For example, suppose you have a tokenizer such as and although you are pretty happyStandardTokenizer

with how it works overall, you want to customize how some specific characters behave. You could modify the

rules and re-build your own tokenizer with JFlex, but it might be easier to simply map some of the characters

before tokenization with a .CharFilter

About Filters

Like , consume input and produce a stream of tokens. Filters also derive from tokenizers filters org.apache.lu

. Unlike tokenizers, a filter's input is another TokenStream. The job of a filter iscene.analysis.TokenStream

usually easier than that of a tokenizer since in most cases a filter looks at each token in the stream sequentially

and decides whether to pass it along, replace it or discard it.

A filter may also do more complex analysis by looking ahead to consider multiple tokens at once, although this is

less common. One hypothetical use for such a filter might be to normalize state names that would be tokenized

as two words. For example, the single token "california" would be replaced with "CA", while the token pair

"rhode" followed by "island" would become the single token "RI".

100Apache Solr Reference Guide 5.1

Because filters consume one and produce a new , they can be chained one afterTokenStream TokenStream

another indefinitely. Each filter in the chain in turn processes the tokens produced by its predecessor. The order

in which you specify the filters is therefore significant. Typically, the most general filtering is done first, and later

filtering stages are more specialized.

</analyzer>

</fieldType>

This example starts with Solr's standard tokenizer, which breaks the field's text into tokens. Those tokens then

pass through Solr's standard filter, which removes dots from acronyms, and performs a few other common

operations. All the tokens are then set to lowercase, which will facilitate case-insensitive matching at query time.

The last filter in the above example is a stemmer filter that uses the Porter stemming algorithm. A stemmer is

basically a set of mapping rules that maps the various forms of a word back to the base, or , word fromstem

which they derive. For example, in English the words "hugs", "hugging" and "hugged" are all forms of the stem

word "hug". The stemmer will replace all of these terms with "hug", which is what will be indexed. This means

that a query for "hug" will match the term "hugged", but not "huge".

Conversely, applying a stemmer to your query terms will allow queries containing non stem terms, like "hugging",

to match documents with different variations of the same stem word, such as "hugged". This works because both

the indexer and the query will map to the same stem ("hug").

Word stemming is, obviously, very language specific. Solr includes several language-specific stemmers created

by the generator that are based on the Porter stemming algorithm. The generic Snowball PorterSnowball

Stemmer Filter can be used to configure any of these language stemmers. Solr also includes a convenience

wrapper for the English Snowball stemmer. There are also several purpose-built stemmers for non-English

languages. These stemmers are described in .Language Analysis

Tokenizers

You configure the tokenizer for a text field type in with a element, as a child of schema.xml <tokenizer> <an

:alyzer>

</analyzer>

</fieldType>

The class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory

classes implement the . A TokenizerFactory's org.apache.solr.analysis.TokenizerFactory create()

method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it passes a Reader

object that provides the content of the text field.

Tokenizers discussed in this section:

Standard Tokenizer

101Apache Solr Reference Guide 5.1

Classic Tokenizer

Keyword Tokenizer

Letter Tokenizer

Lower Case Tokenizer

N-Gram Tokenizer

Edge N-Gram Tokenizer

ICU Tokenizer

Path Hierarchy Tokenizer

Regular Expression Pattern Tokenizer

UAX29 URL Email Tokenizer

White Space Tokenizer

Related Topics

Arguments may be passed to tokenizer factories by setting attributes on the element.<tokenizer>

</fieldType>

The following sections describe the tokenizer factory classes included in this release of Solr.

For more information about Solr's tokenizers, see .http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Standard Tokenizer

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter

characters are discarded, with the following exceptions:

Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain

names.

The "@" character is among the set of token-splitting punctuation, so email addresses are preservednot

as single tokens.

Note that words are split at hyphens.

The Standard Tokenizer supports word boundaries with the following tokenUnicode standard annex UAX#29

types: , , , , and .<ALPHANUM> <NUM> <SOUTHEAST_ASIAN> <IDEOGRAPHIC> <HIRAGANA>

Factory class: solr.StandardTokenizerFactory

Arguments:

maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by

.maxTokenLength

Example:

</analyzer>

102Apache Solr Reference Guide 5.1

In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

Out: "Please", "email", "john.doe", "foo.com", "by", "03", "09", "re", "m37", "xq"

Classic Tokenizer

The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions 3.1 and

previous. It does not use the word boundary rules that the Standard TokenizerUnicode standard annex UAX#29

uses. This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter

characters are discarded, with the following exceptions:

Periods (dots) that are not followed by whitespace are kept as part of the token.

Words are split at hyphens, unless there is a number in the word, in which case the token is not split and

the numbers and hyphen(s) are preserved.

Recognizes Internet domain names and email addresses and preserves them as a single token.

Factory class: solr.ClassicTokenizerFactory

Arguments:

maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by

.maxTokenLength

Example:

</analyzer>

In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

Out: "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"

Keyword Tokenizer

This tokenizer treats the entire text field as a single token.

Factory class: solr.KeywordTokenizerFactory

Arguments: None

Example:

</analyzer>

In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

Out: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

Letter Tokenizer

This tokenizer creates tokens from strings of contiguous letters, discarding all non-letter characters.

103Apache Solr Reference Guide 5.1

Factory class: solr.LetterTokenizerFactory

Arguments: None

Example:

</analyzer>

In: "I can't."

Out: "I", "can", "t"

Lower Case Tokenizer

Tokenizes the input stream by delimiting at non-letters and then converting all letters to lowercase. Whitespace

and non-letters are discarded.

Factory class: solr.LowerCaseTokenizerFactory

Arguments: None

Example:

</analyzer>

In: "I just my iPhone!"LOVE

Out: "i", "just", "love", "my", "iphone"

N-Gram Tokenizer

Reads the field text and generates n-gram tokens of sizes in the given range.

Factory class: solr.NGramTokenizerFactory

Arguments:

minGramSize: (integer, default 1) The minimum n-gram size, must be > 0.

maxGramSize: (integer, default 2) The maximum n-gram size, must be >= .minGramSize

Example:

Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at whitespace.

As a result, the space character is included in the encoding.

</analyzer>

In: "hey man"

104Apache Solr Reference Guide 5.1

Out: "h", "e", "y", " ", "m", "a", "n", "he", "ey", "y ", " m", "ma", "an"

Example:

With an n-gram size range of 4 to 5:

</analyzer>

In: "bicycle"

Out: "bicy", "bicyc", "icyc", "icycl", "cycl", "cycle", "ycle"

Edge N-Gram Tokenizer

Reads the field text and generates edge n-gram tokens of sizes in the given range.

Factory class: solr.EdgeNGramTokenizerFactory

Arguments:

minGramSize: (integer, default is 1) The minimum n-gram size, must be > 0.

maxGramSize: (integer, default is 1) The maximum n-gram size, must be >= .minGramSize

side: ("front" or "back", default is "front") Whether to compute the n-grams from the beginning (front) of the text

or from the end (back).

Example:

Default behavior (min and max default to 1):

</analyzer>

In: "babaloo"

Out: "b"

Example:

Edge n-gram range of 2 to 5

<tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2"

maxGramSize="5"/>

</analyzer>

In: "babaloo"

Out:"ba", "bab", "baba", "babal"

Example:

Edge n-gram range of 2 to 5, from the back side:

105Apache Solr Reference Guide 5.1

<tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5"

side="back"/>

</analyzer>

In: "babaloo"

Out: "oo", "loo", "aloo", "baloo"

ICU Tokenizer

This tokenizer processes multilingual text and tokenizes it appropriately based on its script attribute.

You can customize this tokenizer's behavior by specifying . To add per-script rules, add a per-script rule files rul

argument, which should contain a comma-separated list of pairs in the followingefiles code:rulefile

format: four-letter ISO 15924 script code, followed by a colon, then a resource path. For example, to specify rules

for Latin (script code "Latn") and Cyrillic (script code "Cyrl"), you would enter Latn:my.Latin.rules.rbbi,C

.yrl:my.Cyrillic.rules.rbbi

The default provides UAX#29 word break rules tokenization (like solr.ICUTokenizerFactory solr.Stand

), but also includes custom tailorings for Hebrew (specializing handling of double and singleardTokenizer

quotation marks), and for syllable tokenization for Khmer, Lao, and Myanmar.

Factory class: solr.ICUTokenizerFactory

Arguments:

rulefile: a comma-separated list of pairs in the following format: four-letter ISO 15924 scriptcode:rulefile

code, followed by a colon, then a resource path.

Example:

</analyzer>

<tokenizer class="solr.ICUTokenizerFactory"

rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"

</analyzer>

Path Hierarchy Tokenizer

This tokenizer creates synonyms from file path hierarchies.

Factory class: solr.PathHierarchyTokenizerFactory

Arguments:

delimiter: (character, no default) You can specify the file path delimiter and replace it with a delimiter you

provide. This can be useful for working with backslash delimiters.

106Apache Solr Reference Guide 5.1

replace: (character, no default) Specifies the delimiter character Solr uses in the tokenized output.

Example:

<tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="\"

replace="/"/>

</analyzer>

</fieldType>

In: "c:\usr\local\apache"

Out: "c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache"

Regular Expression Pattern Tokenizer

This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression

provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match

patterns that should be extracted from the text as tokens.

See for more information on Java regular expression syntax.the Javadocs for java.util.regex.Pattern

Factory class: solr.PatternTokenizerFactory

Arguments:

pattern: (Required) The regular expression, as defined by in .java.util.regex.Pattern

group: (Optional, default -1) Specifies which regex group to extract as the token(s). The value -1 means the

regex should be treated as a delimiter that separates tokens. Non-negative group numbers (>= 0) indicate that

character sequences matching that regex group should be converted to tokens. Group zero refers to the entire

regex, groups greater than zero refer to parenthesized sub-expressions of the regex, counted from left to right.

Example:

A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or

more spaces.

</analyzer>

In: "fee,fie, foe , fum, foo"

Out: "fee", "fie", "foe", "fum", "foo"

Example:

Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of

either case is extracted as a token.

107Apache Solr Reference Guide 5.1

<tokenizer class="solr.PatternTokenizerFactory" pattern="[A-Z][A-Za-z]*"

group="0"/>

</analyzer>

In: "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."

Out: "Hello", "My", "Inigo", "Montoya", "You", "Prepare"

Example:

Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an optional

semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex capture groups

are numbered by counting left parenthesis from left to right. Group 3 is the subexpression "[0-9-]+", which

matches one or more digits or hyphens.

<tokenizer class="solr.PatternTokenizerFactory"

pattern="(SKU|Part(\sNumber)?):?\s(\[0-9-\]+)" group="3"/>

</analyzer>

In: "SKU: 1234, Part Number 5678, Part: 126-987"

Out: "1234", "5678", "126-987"

UAX29 URL Email Tokenizer

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter

characters are discarded, with the following exceptions:

Periods (dots) that are not followed by whitespace are kept as part of the token.

Words are split at hyphens, unless there is a number in the word, in which case the token is not split and

the numbers and hyphen(s) are preserved.

Recognizes top-level Internet domain names (validated against the white list in the IANA Root Zone

when the tokenizer was generated); email addresses; , , and aDatabase file : // http(s):// ftp://

ddresses; IPv4 and IPv6 addresses; and preserves them as a single token.

The UAX29 URL Email Tokenizer supports word boundaries with the followingUnicode standard annex UAX#29

token types: , , , , , , and <ALPHANUM> <NUM> <URL> <EMAIL> <SOUTHEAST_ASIAN> <IDEOGRAPHIC> <HIRAGAN

.A>

Factory class: solr.UAX29URLEmailTokenizerFactory

Arguments:

maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by

.maxTokenLength

Example:

108Apache Solr Reference Guide 5.1

</analyzer>

In: "Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com"

Out: "Visit", "http://accarol.com/contact.htm?from=external&a=10", "or", "e", "mail", "bob.cratchet@accarol.com"

White Space Tokenizer

Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters

as tokens. Note that any punctuation be included in the tokenization.will

Factory class: solr.WhitespaceTokenizerFactory

Arguments: None

Example:

</analyzer>

In: "To be, or what?"

Out: "To", "be,", "or", "what?"

Related Topics

TokenizerFactories

Filter Descriptions

You configure each filter with a element in as a child of , following the <filter> schema.xml <analyzer> <t

element. Filter definitions should follow a tokenizer or another filter definition because they take a okenizer> T

as input. For example.okenStream

<filter class="solr.LowerCaseFilterFactory"/>...

</analyzer>

</fieldType>

The class attribute names a factory class that will instantiate a filter object as needed. Filter factory classes must

implement the interface. Like tokenizers, filters areorg.apache.solr.analysis.TokenFilterFactory

also instances of TokenStream and thus are producers of tokens. Unlike tokenizers, filters also consume tokens

from a TokenStream. This allows you to mix and match filters, in any order you prefer, downstream of a

tokenizer.

Arguments may be passed to tokenizer factories to modify their behavior by setting attributes on the e<filter>

lement. For example:

109Apache Solr Reference Guide 5.1
<fieldType name="semicolonDelimited" class="solr.TextField">
  <analyzer type="query">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="; " />
    <filter class="solr.LengthFilterFactory" min="2" max="7"/>
  </analyzer>
</fieldType>
The following sections describe the filter factories that are included in this release of Solr.
For more information about Solr's filters, see  .http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
Filters discussed in this section:
ASCII Folding Filter
Beider-Morse Filter
Classic Filter
Common Grams Filter
Collation Key Filter
Daitch-Mokotoff Soundex Filter
Double Metaphone Filter
Edge N-Gram Filter
English Minimal Stem Filter
Hunspell Stem Filter
Hyphenated Words Filter
ICU Folding Filter
ICU Normalizer 2 Filter
ICU Transform Filter
Keep Words Filter
KStem Filter
Length Filter
Lower Case Filter
Managed Stop Filter
Managed Synonym Filter
N-Gram Filter
Numeric Payload Token Filter
Pattern Replace Filter
Phonetic Filter
Porter Stem Filter
Remove Duplicates Token Filter
Reversed Wildcard Filter
Shingle Filter
Snowball Porter Stemmer Filter
Standard Filter
Stop Filter
Suggest Stop Filter
Synonym Filter
Token Offset Payload Filter
Trim Filter
Type As Payload Filter
Type Token Filter
Word Delimiter Filter

110Apache Solr Reference Guide 5.1

Related Topics

ASCII Folding Filter

This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the Basic Latin

Unicode block (the first 127 ASCII characters) to their ASCII equivalents, if one exists. This filter converts

characters from the following Unicode blocks:

C1 Controls and Latin-1 Supplement (PDF)

Latin Extended-A (PDF)

Latin Extended-B (PDF)

Latin Extended Additional (PDF)

Latin Extended-C (PDF)

Latin Extended-D (PDF)

IPA Extensions (PDF)

Phonetic Extensions (PDF)

Phonetic Extensions Supplement (PDF)

General Punctuation (PDF)

Superscripts and Subscripts (PDF)

Enclosed Alphanumerics (PDF)

Dingbats (PDF)

Supplemental Punctuation (PDF)

Alphabetic Presentation Forms (PDF)

Halfwidth and Fullwidth Forms (PDF)

Factory class: solr.ASCIIFoldingFilterFactory

Arguments: None

Example:

</analyzer>

In: "á" (Unicode character 00E1)

Out: "a" (ASCII character 97)

Beider-Morse Filter

Implements the Beider-Morse Phonetic Matching (BMPM) algorithm, which allows identification of similar names,

even if they are spelled differently or in different languages. More information about how this works is available in

the section on .Phonetic Matching

Factory class: solr.BeiderMorseFilterFactory

Arguments:

BeiderMorseFilter changed its behavior in Solr 5.0 (version 3.04 of the BMPM algorithm is implemented,

while previous Solr versions implemented BMPM version 3.00 - see http://stevemorse.org/phoneticinfo.h

), so any index built using this filter with earlier versions of Solr will need to be rebuilt.tm

111Apache Solr Reference Guide 5.1

nameType: Types of names. Valid values are GENERIC, ASHKENAZI, or SEPHARDIC. If not processing

Ashkenazi or Sephardic names, use GENERIC.

ruleType: Types of rules to apply. Valid values are APPROX or EXACT.

concat: Defines if multiple possible matches should be combined with a pipe ("|").

languageSet: The language set to use. The value "auto" will allow the Filter to identify the language, or a

comma-separated list can be supplied.

Example:

<filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX"

concat="true" languageSet="auto">

</filter>

</analyzer>

Classic Filter

This filter takes the output of the and strips periods from acronyms and "'s" from possessives.Classic Tokenizer

Factory class: solr.ClassicFilterFactory

Arguments: None

Example:

</analyzer>

In: "I.B.M. cat's can't"

Tokenizer to Filter: "I.B.M", "cat's", "can't"

Out: "IBM", "cat", "can't"

Common Grams Filter

This filter creates word shingles by combining common tokens such as stop words with regular tokens. This is

useful for creating phrase queries containing common words, such as "the cat." Solr normally ignores stop words

in queried phrases, so searching for "the cat" would return all matches for the word "cat."

Factory class: solr.CommonGramsFilterFactory

Arguments:

words: (a common word file in .txt format) Provide the name of a common word file, such as .stopwords.txt

format: (optional) If the stopwords list has been formatted for Snowball, you can specify format="snowball"

so Solr can read the stopwords file.

112Apache Solr Reference Guide 5.1

ignoreCase: (boolean) If true, the filter ignores the case of words when comparing them to the common word

file. The default is false.

Example:

<filter class="solr.CommonGramsFilterFactory" words="stopwords.txt"

ignoreCase="true"/>

</analyzer>

In: "the Cat"

Tokenizer to Filter: "the", "Cat"

Out: "the_cat"

Collation Key Filter

Collation allows sorting of text in a language-sensitive way. It is usually used for sorting, but can also be used

with advanced searches. We've covered this in much more detail in the section on .Unicode Collation

Daitch-Mokotoff Soundex Filter

Implements the Daitch-Mokotoff Soundex algorithm, which allows identification of similar names, even if they are

spelled differently. More information about how this works is available in the section on .Phonetic Matching

Factory class: solr.DaitchMokotoffSoundexFilterFactory

Arguments:

inject : (true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens

are replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the exact

spelling of the target word may not match.

Example:

</analyzer>

Double Metaphone Filter

This filter creates tokens using the encoding algorithm from commons-codec. For more DoubleMetaphone

information, see the section.Phonetic Matching

Factory class: solr.DoubleMetaphoneFilterFactory

Arguments:

inject: (true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens

are replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the exact

spelling of the target word may not match.

113Apache Solr Reference Guide 5.1

maxCodeLength: (integer) The maximum length of the code to be generated.

Example:

Default behavior for inject (true): keep the original token and add phonetic token(s) at the same position.

</analyzer>

In: "four score and Kuczewski"

Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "Kuczewski"(4)

Out: "four"(1), "FR"(1), "score"(2), "SKR"(2), "and"(3), "ANT"(3), "Kuczewski"(4), "KSSK"(4), "KXFS"(4)

The phonetic tokens have a position increment of 0, which indicates that they are at the same position as the

token they were derived from (immediately preceding). Note that "Kuczewski" has two encodings, which are

added at the same position.

Example:

Discard original token ( ).inject="false"

</analyzer>

In: "four score and Kuczewski"

Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "Kuczewski"(4)

Out: "FR"(1), "SKR"(2), "ANT"(3), "KSSK"(4), "KXFS"(4)

Note that "Kuczewski" has two encodings, which are added at the same position.

Edge N-Gram Filter

This filter generates edge n-gram tokens of sizes within the given range.

Factory class: solr.EdgeNGramFilterFactory

Arguments:

minGramSize: (integer, default 1) The minimum gram size.

maxGramSize: (integer, default 1) The maximum gram size.

Example:

Default behavior.

114Apache Solr Reference Guide 5.1

</analyzer>

In: "four score and twenty"

Tokenizer to Filter: "four", "score", "and", "twenty"

Out: "f", "s", "a", "t"

Example:

A range of 1 to 4.

</analyzer>

In: "four score"

Tokenizer to Filter: "four", "score"

Out: "f", "fo", "fou", "four", "s", "sc", "sco", "scor"

Example:

A range of 4 to 6.

</analyzer>

In: "four score and twenty"

Tokenizer to Filter: "four", "score", "and", "twenty"

Out: "four", "scor", "score", "twen", "twent", "twenty"

English Minimal Stem Filter

This filter stems plural English words to their singular form.

Factory class: solr.EnglishMinimalStemFilterFactory

Arguments: None

Example:

</analyzer>

115Apache Solr Reference Guide 5.1

In: "dogs cats"

Tokenizer to Filter: "dogs", "cats"

Out: "dog", "cat"

Hunspell Stem Filter

The provides support for several languages. You must provide the dictionary ( ) andHunspell Stem Filter .dic

rules ( ) files for each language you wish to use with the Hunspell Stem Filter. You can download those.aff

language files . Be aware that your results will vary widely based on the quality of the provided dictionaryhere

and rules files. For example, some languages have only a minimal word list with no morphological information.

On the other hand, for languages that have no stemmer but do have an extensive dictionary file, the Hunspell

stemmer may be a good choice.

Factory class: solr.HunspellStemFilterFactory

Arguments:

dictionary: (required) The path of a dictionary file.

: (required) The path of a rules file. affix

: (boolean) controls whether matching is case sensitive or not. The default is false.ignoreCase

: (boolean) controls whether the affix parsing is strict or not. If true, an error whilestrictAffixParsing

reading an affix rule causes a ParseException, otherwise is ignored. The default is true.

Example:

<filter class="solr.HunspellStemFilterFactory"

dictionary="en_GB.dic"

affix="en_GB.aff"

ignoreCase="true"

strictAffixParsing="true" />

</analyzer>

In: "jump jumping jumped"

Tokenizer to Filter: "jump", "jumping", "jumped"

Out: "jump", "jump", "jump"

Hyphenated Words Filter

This filter reconstructs hyphenated words that have been tokenized as two tokens because of a line break or

other intervening whitespace in the field test. If a token ends with a hyphen, it is joined with the following token

and the hyphen is discarded. Note that for this filter to work properly, the upstream tokenizer must not remove

trailing hyphen characters. This filter is generally only useful at index time.

Factory class: solr.HyphenatedWordsFilterFactory

Arguments: None

Example:

116Apache Solr Reference Guide 5.1

</analyzer>

In: "A hyphen- ated word"

Tokenizer to Filter: "A", "hyphen-", "ated", "word"

Out: "A", "hyphenated", "word"

ICU Folding Filter

This filter is a custom Unicode normalization form that applies the foldings specified in Unicode Technical Report

in addition to the normalization form as described in . This filter is a30 NFKC_Casefold ICU Normalizer 2 Filter

better substitute for the combined behavior of the , , and ASCII Folding Filter Lower Case Filter ICU Normalizer 2

.Filter

To use this filter, see for instructions on which jars yousolr/contrib/analysis-extras/README.txt

need to add to your .solr_home/lib

Factory class: solr.ICUFoldingFilterFactory

Arguments: None

Example:

</analyzer>

For detailed information on this normalization form, see .http://www.unicode.org/reports/tr30/tr30-4.html

ICU Normalizer 2 Filter

This filter factory normalizes text according to one of five Unicode Normalization Forms as described in Unicode

:Standard Annex #15

NFC: (name="nfc" mode="compose") Normalization Form C, canonical decomposition

NFD: (name="nfc" mode="decompose") Normalization Form D, canonical decomposition, followed by

canonical composition

NFKC: (name="nfkc" mode="compose") Normalization Form KC, compatibility decomposition

NFKD: (name="nfkc" mode="decompose") Normalization Form KD, compatibility decomposition, followed

by canonical composition

NFKC_Casefold: (name="nfkc_cf" mode="compose") Normalization Form KC, with additional Unicode

case folding. Using the ICU Normalizer 2 Filter is a better-performing substitution for the Lower Case Filter

and NFKC normalization.

Factory class: solr.ICUNormalizer2FilterFactory

Arguments:

117Apache Solr Reference Guide 5.1

name: (string) The name of the normalization form; , , , , nfc nfd nfkc nfkd nfkc_cf

mode: (string) The mode of Unicode character composition and decomposition; or compose decompose

Example:

</analyzer>

For detailed information about these Unicode Normalization Forms, see .http://unicode.org/reports/tr15/

To use this filter, see for instructions on which jars yousolr/contrib/analysis-extras/README.txt

need to add to your .solr_home/lib

ICU Transform Filter

This filter applies to text. This filter supports only ICU System Transforms. Custom rule sets areICU Tranforms

not supported.

Factory class: solr.ICUTransformFilterFactory

Arguments:

id: (string) The identifier for the ICU System Transform you wish to apply with this filter. For a full list of ICU

System Transforms, see http://demo.icu-project.org/icu-bin/translit?TEMPLATE_FILE=data/translit_rule_main.ht

.ml

Example:

</analyzer>

For detailed information about ICU Transforms, see .http://userguide.icu-project.org/transforms/general

To use this filter, see for instructions on which jars yousolr/contrib/analysis-extras/README.txt

need to add to your .solr_home/lib

Keep Words Filter

This filter discards all tokens except those that are listed in the given word list. This is the inverse of the Stop

Words Filter. This filter can be useful for building specialized indices for a constrained set of terms.

Factory class: solr.KeepWordFilterFactory

Arguments:

words: (required) Path of a text file containing the list of keep words, one per line. Blank lines and lines that

begin with "#" are ignored. This may be an absolute path, or a simple filename in the Solr config directory.

ignoreCase: (true/false) If then comparisons are done case-insensitively. If this argument is true, then thetrue

words file is assumed to contain only lowercase words. The default is .false

118Apache Solr Reference Guide 5.1

enablePositionIncrements: if is or earlier and luceneMatchVersion 4.3 enablePositionIncrement

, no position holes will be left by this filter when it removes tokens. s="false" This argument is invalid if luc

is or later.eneMatchVersion 5.0

Example:

Where contains:keepwords.txt

happy

funny

silly

</analyzer>

In: "Happy, sad or funny"

Tokenizer to Filter: "Happy", "sad", "or", "funny"

Out: "funny"

Example:

Same , case insensitive:keepwords.txt

<filter class="solr.KeepWordFilterFactory" words="keepwords.txt"

ignoreCase="true"/>

</analyzer>

In: "Happy, sad or funny"

Tokenizer to Filter: "Happy", "sad", "or", "funny"

Out: "Happy", "funny"

Example:

Using LowerCaseFilterFactory before filtering for keep words, no flag.ignoreCase

</analyzer>

In: "Happy, sad or funny"

Tokenizer to Filter: "Happy", "sad", "or", "funny"

Filter to Filter: "happy", "sad", "or", "funny"

119Apache Solr Reference Guide 5.1

Out: "happy", "funny"

KStem Filter

KStem is an alternative to the Porter Stem Filter for developers looking for a less aggressive stemmer. KStem

was written by Bob Krovetz, ported to Lucene by Sergio Guzman-Lara (UMASS Amherst). This stemmer is only

appropriate for English language text.

Factory class: solr.KStemFilterFactory

Arguments: None

Example:

</analyzer>

In: "jump jumping jumped"

Tokenizer to Filter: "jump", "jumping", "jumped"

Out: "jump", "jump", "jump"

Length Filter

This filter passes tokens whose length falls within the min/max limit specified. All other tokens are discarded.

Factory class: solr.LengthFilterFactory

Arguments:

min: (integer, required) Minimum token length. Tokens shorter than this are discarded.

max: (integer, required, must be >= min) Maximum token length. Tokens longer than this are discarded.

enablePositionIncrements: if is or earlier and luceneMatchVersion 4.3 enablePositionIncrement

, no position holes will be left by this filter when it removes tokens. s="false" This argument is invalid if luc

is or later.eneMatchVersion 5.0

Example:

</analyzer>

In: "turn right at Albuquerque"

Tokenizer to Filter: "turn", "right", "at", "Albuquerque"

Out: "turn", "right"

Lower Case Filter

120Apache Solr Reference Guide 5.1

Converts any uppercase letters in a token to the equivalent lowercase token. All other characters are left

unchanged.

Factory class: solr.LowerCaseFilterFactory

Arguments: None

Example:

</analyzer>

In: "Down With CamelCase"

Tokenizer to Filter: "Down", "With", "CamelCase"

Out: "down", "with", "camelcase"

Managed Stop Filter

This is specialized version of the that uses a set of stop words that are Stop Words Filter Factory managed from

a REST API.

Arguments:

managed: The name that should be used for this set of stop words in the managed REST API.

Example:

With this configuration the set of words is named "english" and can be managed via /solr/collection_name

/schema/analysis/stopwords/english

</analyzer>

See for example input/output.Stop Filter

Managed Synonym Filter

This is specialized version of the that uses a mapping on synonyms that is Synonym Filter Factory managed

from a REST API.

Arguments:

managed: The name that should be used for this mapping on synonyms in the managed REST API.

Example:

With this configuration the set of mappings is named "english" and can be managed via /solr/collection_n

ame/schema/analysis/synonyms/english

121Apache Solr Reference Guide 5.1

</analyzer>

See for example input/output.Synonym Filter

N-Gram Filter

Generates n-gram tokens of sizes in the given range. Note that tokens are ordered by position and then by gram

size.

Factory class: solr.NGramFilterFactory

Arguments:

minGramSize: (integer, default 1) The minimum gram size.

maxGramSize: (integer, default 2) The maximum gram size.

Example:

Default behavior.

</analyzer>

In: "four score"

Tokenizer to Filter: "four", "score"

Out: "f", "o", "u", "r", "fo", "ou", "ur", "s", "c", "o", "r", "e", "sc", "co", "or", "re"

Example:

A range of 1 to 4.

</analyzer>

In: "four score"

Tokenizer to Filter: "four", "score"

Out: "f", "fo", "fou", "four", "s", "sc", "sco", "scor"

Example:

A range of 3 to 5.

122Apache Solr Reference Guide 5.1

</analyzer>

In: "four score"

Tokenizer to Filter: "four", "score"

Out: "fou", "four", "our", "sco", "scor", "score", "cor", "core", "ore"

Numeric Payload Token Filter

This filter adds a numeric floating point payload value to tokens that match a given type. Refer to the Javadoc for

the class for more information about token types and payloads.org.apache.lucene.analysis.Token

Factory class: solr.NumericPayloadTokenFilterFactory

Arguments:

payload: (required) A floating point value that will be added to all matching tokens.

typeMatch: (required) A token type name string. Tokens with a matching type name will have their payload set

to the above floating point value.

Example:

<filter class="solr.NumericPayloadTokenFilterFactory" payload="0.75"

typeMatch="word"/>

</analyzer>

In: "bing bang boom"

Tokenizer to Filter: "bing", "bang", "boom"

Out: "bing"[0.75], "bang"[0.75], "boom"[0.75]

Pattern Replace Filter

This filter applies a regular expression to each token and, for those that match, substitutes the given replacement

string in place of the matched pattern. Tokens which do not match are passed though unchanged.

Factory class: solr.PatternReplaceFilterFactory

Arguments:

pattern: (required) The regular expression to test against each token, as per .java.util.regex.Pattern

replacement: (required) A string to substitute in place of the matched pattern. This string may contain

references to capture groups in the regex pattern. See the Javadoc for .java.util.regex.Matcher

replace: ("all" or "first", default "all") Indicates whether all occurrences of the pattern in the token should be

replaced, or only the first.

123Apache Solr Reference Guide 5.1

Example:

Simple string replace:

</analyzer>

In: "cat concatenate catycat"

Tokenizer to Filter: "cat", "concatenate", "catycat"

Out: "dog", "condogenate", "dogydog"

Example:

String replacement, first occurrence only:

<filter class="solr.PatternReplaceFilterFactory" pattern="cat" replacement="dog"

replace="first"/>

</analyzer>

In: "cat concatenate catycat"

Tokenizer to Filter: "cat", "concatenate", "catycat"

Out: "dog", "condogenate", "dogycat"

Example:

More complex pattern with capture group reference in the replacement. Tokens that start with non-numeric

characters and end with digits will have an underscore inserted before the numbers. Otherwise the token is

passed through.

<filter class="solr.PatternReplaceFilterFactory" pattern="(\D+)(\d+)$"

replacement="$1_$2"/>

</analyzer>

In: "cat foo1234 9987 blah1234foo"

Tokenizer to Filter: "cat", "foo1234", "9987", "blah1234foo"

Out: "cat", "foo_1234", "9987", "blah1234foo"

Phonetic Filter

This filter creates tokens using one of the phonetic encoding algorithms in the org.apache.commons.codec.

package. For more information, see the section on .language Phonetic Matching

Factory class: solr.PhoneticFilterFactory

124Apache Solr Reference Guide 5.1

Arguments:

encoder: (required) The name of the encoder to use. The encoder name must be one of the following (case

insensitive): " ", " ", " ", " ", " " (v2.0), "DoubleMetaphone Metaphone Soundex RefinedSoundex Caverphone CologneP

", or " ".honetic Nysiis

inject: (true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens

are replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the exact

spelling of the target word may not match.

maxCodeLength: (integer) The maximum length of the code to be generated by the Metaphone or Double

Metaphone encoders.

Example:

Default behavior for DoubleMetaphone encoding.

</analyzer>

In: "four score and twenty"

Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)

Out: "four"(1), "FR"(1), "score"(2), "SKR"(2), "and"(3), "ANT"(3), "twenty"(4), "TNT"(4)

The phonetic tokens have a position increment of 0, which indicates that they are at the same position as the

token they were derived from (immediately preceding).

Example:

Discard original token.

<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone"

inject="false"/>

</analyzer>

In: "four score and twenty"

Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)

Out: "FR"(1), "SKR"(2), "ANT"(3), "TWNT"(4)

Example:

Default Soundex encoder.

</analyzer>

125Apache Solr Reference Guide 5.1

In: "four score and twenty"

Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)

Out: "four"(1), "F600"(1), "score"(2), "S600"(2), "and"(3), "A530"(3), "twenty"(4), "T530"(4)

Porter Stem Filter

This filter applies the Porter Stemming Algorithm for English. The results are similar to using the Snowball Porter

Stemmer with the argument. But this stemmer is coded directly in Java and is not basedlanguage="English"

on Snowball. It does not accept a list of protected words and is only appropriate for English language text.

However, it has been benchmarked as than the English Snowball stemmer, so can provide afour times faster

performance enhancement.

Factory class: solr.PorterStemFilterFactory

Arguments: None

Example:

</analyzer>

In: "jump jumping jumped"

Tokenizer to Filter: "jump", "jumping", "jumped"

Out: "jump", "jump", "jump"

Remove Duplicates Token Filter

The filter removes duplicate tokens in the stream. Tokens are considered to be duplicates if they have the same

text and position values.

Factory class: solr.RemoveDuplicatesTokenFilterFactory

Arguments: None

Example:

One example of where is in situations where a synonym file isRemoveDuplicatesTokenFilterFactory

being used in conjuntion with a stemmer causes some synonyms to be reduced to the same stem. Consider the

following entry from a file:synonyms.txt

Television, Televisions, TV, TVs

When used in the following configuration:

126Apache Solr Reference Guide 5.1

</analyzer>

In: "Watch TV"

Tokenizer to Synonym Filter: "Watch"(1) "TV"(2)

Synonym Filter to Stem Filter: "Watch"(1) "Television"(2) "Televisions"(2) "TV"(2) "TVs"(2)

Stem Filter to Remove Dups Filter: "Watch"(1) "Television"(2) "Television"(2) "TV"(2) "TV"(2)

Out: "Watch"(1) "Television"(2) "TV"(2)

Reversed Wildcard Filter

This filter reverses tokens to provide faster leading wildcard and prefix queries. Tokens without wildcards are not

reversed.

Factory class: solr.ReveresedWildcardFilterFactory

Arguments:

withOriginal (boolean) If true, the filter produces both original and reversed tokens at the same positions. If

false, produces only reversed tokens.

maxPosAsterisk (integer, default = 2) The maximum position of the asterisk wildcard ('*') that triggers the

reversal of the query term. Terms with asterisks at positions above this value are not reversed.

maxPosQuestion (integer, default = 1) The maximum position of the question mark wildcard ('?') that triggers

the reversal of query term. To reverse only pure suffix queries (queries with a single leading asterisk), set this to

0 and to 1.maxPosAsterisk

maxFractionAsterisk (float, default = 0.0) An additional parameter that triggers the reversal if asterisk ('*')

position is less than this fraction of the query token length.

minTrailing (integer, default = 2) The minimum number of trailing characters in a query token after the last

wildcard character. For good performance this should be set to a value larger than 1.

Example:

<filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"

maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2" maxFractionAsterisk="0"/>

</analyzer>

In: "*foo *bar"

Tokenizer to Filter: "*foo", "*bar"

Out: "oof*", "rab*"

127Apache Solr Reference Guide 5.1

Shingle Filter

This filter constructs shingles, which are token n-grams, from the token stream. It combines runs of tokens into a

single token.

Factory class: solr.ShingleFilterFactory

Arguments:

minShingleSize: (integer, default 2) The minimum number of tokens per shingle.

maxShingleSize: (integer, must be >= 2, default 2) The maximum number of tokens per shingle.

outputUnigrams: (true/false) If true (the default), then each individual token is also included at its original

position.

outputUnigramsIfNoShingles: (true/false) If false (the default), then individual tokens will be output if no

shingles are possible.

tokenSeparator: (string, default is " ") The default string to use when joining adjacent tokens to form a shingle.

Example:

Default behavior.

</analyzer>

In: "To be, or what?"

Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)

Out: "To"(1), "To be"(1), "be"(2), "be or"(2), "or"(3), "or what"(3), "what"(4)

Example:

A shingle size of four, do not include original token.

<filter class="solr.ShingleFilterFactory" maxShingleSize="4"

outputUnigrams="false"/>

</analyzer>

In: "To be, or not to be."

Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "not"(4), "to"(5), "be"(6)

Out: "To be"(1), "To be or"(1), "To be or not"(1), "be or"(2), "be or not"(2), "be or not to"(2), "or not"(3), "or not

to"(3), "or not to be"(3), "not to"(4), "not to be"(4), "to be"(5)

Snowball Porter Stemmer Filter

This filter factory instantiates a language-specific stemmer generated by Snowball. Snowball is a software

128Apache Solr Reference Guide 5.1

package that generates pattern-based word stemmers. This type of stemmer is not as accurate as a table-based

stemmer, but is faster and less complex. Table-driven stemmers are labor intensive to create and maintain and

so are typically commercial products.

Solr contains Snowball stemmers for Armenian, Basque, Catalan, Danish, Dutch, English, Finnish, French,

German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish and Turkish. For

more information on Snowball, visit .http://snowball.tartarus.org/

StopFilterFactory, , and canCommonGramsFilterFactory CommonGramsQueryFilterFactory

optionally read stopwords in Snowball format (specify in the configuration of thoseformat="snowball"

FilterFactories).

Factory class: solr.SnowballPorterFilterFactory

Arguments:

language: (default "English") The name of a language, used to select the appropriate Porter stemmer to use.

Case is significant. This string is used to select a package name in the "org.tartarus.snowball.ext" class

hierarchy.

protected: Path of a text file containing a list of protected words, one per line. Protected words will not be

stemmed. Blank lines and lines that begin with "#" are ignored. This may be an absolute path, or a simple file

name in the Solr config directory.

Example:

Default behavior:

</analyzer>

In: "flip flipped flipping"

Tokenizer to Filter: "flip", "flipped", "flipping"

Out: "flip", "flip", "flip"

Example:

French stemmer, English words:

</analyzer>

In: "flip flipped flipping"

Tokenizer to Filter: "flip", "flipped", "flipping"

Out: "flip", "flipped", "flipping"

Example:

129Apache Solr Reference Guide 5.1

Spanish stemmer, Spanish words:

</analyzer>

In: "cante canta"

Tokenizer to Filter: "cante", "canta"

Out: "cant", "cant"

Standard Filter

This filter removes dots from acronyms and the substring "'s" from the end of tokens. This filter depends on the

tokens being tagged with the appropriate term-type to recognize acronyms and words with apostrophes.

Factory class: solr.StandardFilterFactory

Arguments: None

Stop Filter

This filter discards, or analysis of, tokens that are on the given stop words list. A standard stop words list isstops

included in the Solr config directory, named , which is appropriate for typical English languagestopwords.txt

text.

Factory class: solr.StopFilterFactory

Arguments:

words: (optional) The path to a file that contains a list of stop words, one per line. Blank lines and lines that

begin with "#" are ignored. This may be an absolute path, or path relative to the Solr config directory.

format: (optional) If the stopwords list has been formatted for Snowball, you can specify format="snowball"

so Solr can read the stopwords file.

ignoreCase: (true/false, default false) Ignore case when testing for stop words. If true, the stop list should

contain lowercase words.

enablePositionIncrements: if is or earlier and luceneMatchVersion 4.4 enablePositionIncrement

, no position holes will be left by this filter when it removes tokens. s="false" This argument is invalid if luc

is or later.eneMatchVersion 5.0

Example:

Case-sensitive matching, capitalized words not stopped. Token positions skip stopped words.

This filter is no longer operational in Solr when the (in ) isluceneMatchVersion solrconfig.xml

higher than "3.1".

130Apache Solr Reference Guide 5.1

</analyzer>

In: "To be or what?"

Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)

Out: "To"(1), "what"(4)

Example:

</analyzer>

In: "To be or what?"

Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)

Out: "what"(4)

Suggest Stop Filter

Like , this filter discards, or analysis of, tokens that are on the given stop words list. SuggestStop Filter stops

Stop Filter differs from Stop Filter in that it will not remove the last token unless it is followed by a token

separator. For example, a query " " would preserve the ' ' since it was not followed by a space,find the the

punctuation etc., and mark it as a so that following filters will not change or remove it. By contrast, aKEYWORD

query like " " would remove " " as a stopword, since it's followed by a space. Whenfind the popsicle the

using one of the analyzing suggesters, you would normally use the ordinary in your indexStopFilterFactory

analyzer and then SuggestStopFilter in your query analyzer.

Factory class: solr.SuggestStopFilterFactory

Arguments:

words: (optional; default: ) The name of a stopwords file to StopAnalyzer#ENGLISH_STOP_WORDS_SET

parse.

format: (optional; default: ) Defines how the words file will be parsed. If is not specified, then wordset words f

must not be specified. The valid values for the format option are:ormat

wordset: This is the default format, which supports one word per line (including any intra-word

whitespace) and allows whole line comments begining with the " " character. Blank lines are ignored.#

snowball: This format allows for multiple words specified on each line, and trailing comments may be

specified using the vertical line (" "). Blank lines are ignored.|

ignoreCase: (optional; default: ) If , matching is case-insensitive.false true

Example:

131Apache Solr Reference Guide 5.1

<filter class="solr.SuggestStopFilterFactory" ignoreCase="true"

words="stopwords.txt" format="wordset"/>

</analyzer>

In: "The The"

Tokenizer to Filter: "the"(1), "the"(2)

Out: "the"(2)

Synonym Filter

This filter does synonym mapping. Each token is looked up in the list of synonyms and if a match is found, then

the synonym is emitted in place of the token. The position value of the new tokens are set such they all occur at

the same position as the original token.

Factory class: solr.SynonymFilterFactory

Arguments:

synonyms: (required) The path of a file that contains a list of synonyms, one per line. In the (default) formasolr

t - see the argument below for alternatives - blank lines and lines that begin with " " are ignored. Thisformat #

may be an absolute path, or path relative to the Solr config directory. There are two ways to specify synonym

mappings:

A comma-separated list of words. If the token matches any of the words, then all the words in the list are

substituted, which will include the original token.

Two comma-separated lists of words with the symbol "=>" between them. If the token matches any word

on the left, then the list on the right is substituted. The original token will not be included unless it is also in

the list on the right.

ignoreCase: (optional; default: ) If , synonyms will be matched case-insensitively.false true

expand: (optional; default: ) If , a synonym will be expanded to all equivalent synonyms. If , alltrue true false

equivalent synonyms will be reduced to the first in the list.

format: (optional; default: ) Controls how the synonyms will be parsed. The short names (for solr solr SolrS

and (for ) are supported, or you may alternatively supply)ynonymParser wordnet WordnetSynonymParser

the name of your own subclass. SynonymMap.Builder

tokenizerFactory: (optional; default: ) The name of the tokenizer factoryWhitespaceTokenizerFactory

to use when parsing the synonyms file. Arguments with the name prefix " will betokenizerFactory."

supplied as init params to the specified tokenizer factory. Any arguments not consumed by the synonym filter

factory, including those without the " " prefix, will also be supplied as init params to thetokenizerFactory.

tokenizer factory. If is specified, then may not be, and vice versa.tokenizerFactory analyzer

analyzer: (optional; default: ) The name of the analyzer class to use whenWhitespaceTokenizerFactory

parsing the synonyms file. If is specified, then may not be, and vice versa.analyzer tokenizerFactory

For the following examples, assume a synonyms file named :mysynonyms.txt

132Apache Solr Reference Guide 5.1

couch,sofa,divan

teh => the

huge,ginormous,humungous => large

small => tiny,teeny,weeny

Example:

</analyzer>

In: "teh small couch"

Tokenizer to Filter: "teh"(1), "small"(2), "couch"(3)

Out: "the"(1), "tiny"(2), "teeny"(2), "weeny"(2), "couch"(3), "sofa"(3), "divan"(3)

Example:

</analyzer>

In: "teh ginormous, humungous sofa"

Tokenizer to Filter: "teh"(1), "ginormous"(2), "humungous"(3), "sofa"(4)

Out: "the"(1), "large"(2), "large"(3), "couch"(4), "sofa"(4), "divan"(4)

Token Offset Payload Filter

This filter adds the numeric character offsets of the token as a payload value for that token.

Factory class: solr.TokenOffsetPayloadTokenFilterFactory

Arguments: None

Example:

</analyzer>

In: "bing bang boom"

Tokenizer to Filter: "bing", "bang", "boom"

Out: "bing"[0,4], "bang"[5,9], "boom"[10,14]

Trim Filter

133Apache Solr Reference Guide 5.1

This filter trims leading and/or trailing whitespace from tokens. Most tokenizers break tokens at whitespace, so

this filter is most often used for special situations.

Factory class: solr.TrimFilterFactory

Arguments:

updateOffsets: if is or earlier and , trimmed tokens'luceneMatchVersion 4.3 updateOffsets="true"

start and end offsets will be updated to those of the first and last characters (plus one) remaining in the token. T

his argument is invalid if is or later.luceneMatchVersion 5.0

Example:

The PatternTokenizerFactory configuration used here splits the input on simple commas, it does not remove

whitespace.

</analyzer>

In: "one, two , three ,four "

Tokenizer to Filter: "one", " two ", " three ", "four "

Out: "one", "two", "three", "four"

Type As Payload Filter

This filter adds the token's type, as an encoded byte sequence, as its payload.

Factory class: solr.TypeAsPayloadTokenFilterFactory

Arguments: None

Example:

</analyzer>

In: "Pay Bob's I.O.U."

Tokenizer to Filter: "Pay", "Bob's", "I.O.U."

Out: "Pay"[<ALPHANUM>], "Bob's"[<APOSTROPHE>], "I.O.U."[<ACRONYM>]

Type Token Filter

This filter blacklists or whitelists a specified list of token types, assuming the tokens have type metadata

associated with them. For example, the emits "<URL>" and "<EMAIL>" typedUAX29 URL Email Tokenizer

tokens, as well as other types. This filter would allow you to pull out only e-mail addresses from text as tokens, if

you wish.

Factory class: solr.TypeTokenFilterFactory

134Apache Solr Reference Guide 5.1

Arguments:

types: Defines the location of a file of types to filter.

useWhitelist: If , the file defined in should be used as include list. If , or undefined, the filetrue types false

defined in is used as a blacklist.types

enablePositionIncrements: if is or earlier and luceneMatchVersion 4.3 enablePositionIncrement

, no position holes will be left by this filter when it removes tokens. s="false" This argument is invalid if luc

is or later.eneMatchVersion 5.0

Example:

<filter class="solr.TypeTokenFilterFactory" types="stoptypes.txt"

useWhitelist="true"/>

</analyzer>

Word Delimiter Filter

This filter splits tokens at word delimiters. The rules for determining delimiters are determined as follows:

A change in case within a word: "CamelCase" "Camel", "Case". This can be disabled by setting -> split

.OnCaseChange="0"

A transition from alpha to numeric characters or vice versa: "Gonzo5000" "Gonzo", "5000" "4500XL" -> ->

"4500", "XL". This can be disabled by setting .splitOnNumerics="0"

Non-alphanumeric characters (discarded): "hot-spot" "hot", "spot"->

A trailing "'s" is removed: "O'Reilly's" "O", "Reilly"->

Any leading or trailing delimiters are discarded: "--hot-spot--" "hot", "spot"->

Factory class: solr.WordDelimiterFilterFactory

Arguments:

generateWordParts: (integer, default 1) If non-zero, splits words at delimiters. For example:"CamelCase",

"hot-spot" "Camel", "Case", "hot", "spot"->

generateNumberParts: (integer, default 1) If non-zero, splits numeric strings at delimiters:"1947-32" "1947",->

"32"

splitOnCaseChange: (integer, default 1) If 0, words are not split on camel-case changes:"BugBlaster-XL" "B->

ugBlaster", "XL". Example 1 below illustrates the default (non-zero) splitting behavior.

splitOnNumerics: (integer, default 1) If 0, don't split words on transitions from alpha to numeric:"FemBot3000"

"Fem", "Bot3000"->

catenateWords: (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor's" -

"hotspotsensor">

catenateNumbers: (integer, default 0) If non-zero, maximal runs of number parts will be joined: 1947-32" "1->

94732"

135Apache Solr Reference Guide 5.1

catenateAll: (0/1, default 0) If non-zero, runs of word and number parts will be joined: "Zap-Master-9000" "->

ZapMaster9000"

preserveOriginal: (integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" "Zap->

-Master-9000", "Zap", "Master", "9000"

protected: (optional) The pathname of a file that contains a list of protected words that should be passed

through without splitting.

stemEnglishPossessive: (integer, default 1) If 1, strips the possessive "'s" from each subword.

Example:

Default behavior. The whitespace tokenizer is used here to preserve non-alphanumeric characters.

</analyzer>

In: "hot-spot RoboBlaster/9000 100XL"

Tokenizer to Filter: "hot-spot", "RoboBlaster/9000", "100XL"

Out: "hot", "spot", "Robo", "Blaster", "9000", "100", "XL"

Example:

Do not split on case changes, and do not generate number parts. Note that by not generating number parts,

tokens containing only numeric parts are ultimately discarded.

<filter class="solr.WordDelimiterFilterFactory" generateNumberParts="0"

splitOnCaseChange="0"/>

</analyzer>

In: "hot-spot RoboBlaster/9000 100-42"

Tokenizer to Filter: "hot-spot", "RoboBlaster/9000", "100-42"

Out: "hot", "spot", "RoboBlaster", "9000"

Example:

Concatenate word parts and number parts, but not word and number parts that occur in the same token.

<filter class="solr.WordDelimiterFilterFactory" catenateWords="1"

catenateNumbers="1"/>

</analyzer>

In: "hot-spot 100+42 XL40"

Tokenizer to Filter: "hot-spot"(1), "100+42"(2), "XL40"(3)

136Apache Solr Reference Guide 5.1

Out: "hot"(1), "spot"(2), "hotspot"(2), "100"(3), "42"(4), "10042"(4), "XL"(5), "40"(6)

Example:

Concatenate all. Word and/or number parts are joined together.

</analyzer>

In: "XL-4000/ES"

Tokenizer to Filter: "XL-4000/ES"(1)

Out: "XL"(1), "4000"(2), "ES"(3), "XL4000ES"(3)

Example:

Using a protected words list that contains "AstroBlaster" and "XL-5000" (among others).

</analyzer>

In: "FooBar AstroBlaster XL-5000 ==ES-34-"

Tokenizer to Filter: "FooBar", "AstroBlaster", "XL-5000", "==ES-34-"

Out: "FooBar", "FooBar", "AstroBlaster", "XL-5000", "ES", "34"

Related Topics

TokenFilterFactories

CharFilterFactories

Char Filter is a component that pre-processes input characters. Char Filters can be chained like Token Filters

and placed in front of a Tokenizer. Char Filters can add, change, or remove characters while preserving the

original character offsets to support features like highlighting.

Topics discussed in this section:

solr.MappingCharFilterFactory

solr.HTMLStripCharFilterFactory

solr.ICUNormalizer2CharFilterFactory

solr.PatternReplaceCharFilterFactory

Related Topics

solr.MappingCharFilterFactory

This filter creates , which can be used for changingorg.apache.lucene.analysis.MappingCharFilter

one string to another (for example, for normalizing to .).é e

This filter requires specifying a argument, which is the path and name of a file containing the mappingsmapping

137Apache Solr Reference Guide 5.1

to perform.

Example:

<charFilter class="solr.MappingCharFilterFactory"

mapping="mapping-FoldToASCII.txt"/>

[...]

</analyzer>

Mapping file syntax:

Comment lines beginning with a hash mark ( ), as well as blank lines, are ignored.#

Each non-comment, non-blank line consists of a mapping of the form: "source" => "target"

Double-quoted source string, optional whitespace, an arrow ( ), optional whitespace,=>

double-quoted target string.

Trailing comments on mapping lines are not allowed.

The source string must contain at least one character, but the target string may be empty.

The following character escape sequences are recognized within source and target strings:

Escape

sequence

Resulting character ( aliaECMA-48

Unicode

character

Example mapping line

\\ \ U+005C "\\" => "/"

\" " U+0022 "\"and\"" => "'and'"

\b backspace (BS) U+0008 "\b" => " "

\t tab (HT) U+0009 "\t" => ","

\n newline (LF) U+000A "\n" => "<br>"

\f form feed (FF) U+000C "\f" => "\n"

\r carriage return (CR) U+000D "\r" =>

"/carriage-return/"

\uXXXX Unicode char referenced by the 4

hex digits

U+XXXX "\uFEFF" => ""

A backslash followed by any other character is interpreted as if the character were present without

the backslash.

solr.HTMLStripCharFilterFactory

This filter creates . This Char Filter strips HTML fromorg.apache.solr.analysis.HTMLStripCharFilter

the input stream and passes the result to another Char Filter or a Tokenizer.

This filter:

Removes HTML/XML tags while preserving other content.

Removes attributes within tags and supports optional attribute quoting.

138Apache Solr Reference Guide 5.1

Removes XML processing instructions, such as: <?foo bar?>

Removes XML comments.

Removes XML elements starting with <!>.

Removes contents of <script> and <style> elements.

Handles XML comments inside these elements (normal comment processing will not always work).

Replaces numeric character entities references like ; or ; with the corresponding character.&#65 &#x7f

The terminating ';' is optional if the entity reference at the end of the input; otherwise the terminating ';' is

mandatory, to avoid false matches on something like "Alpha&Omega Corp".

Replaces all named character entity references with the corresponding character.

  is replaced with a space instead of the 0xa0 character.

Newlines are substituted for block-level elements.

<CDATA> sections are recognized.

Inline tags, such as , , or will be removed.<b> <i> <span>

Uppercase character entities like , , and are recognized and handled as lowercase.quot gt lt amp

The table below presents examples of HTML stripping.

Input Output

my <a href="www.foo.bar">link</a> my link

<br>hello hello

hello<script></script>'); --></script> hello

if a<b then print a; if a<b then print a;

hello <td height=22 nowrap align="left"> hello

a<b &#65 Alpha&Omega a<b A Alpha&Omega

solr.ICUNormalizer2CharFilterFactory

This filter performs pre-tokenization Unicode normalization using .ICU4J

Arguments:

name: A , one of , , . Default is .Unicode Normalization Form nfc nfkc nfkc_cf nfkc_cf

mode: Either or . Default is . Use with or compose decompose compose decompose name="nfc" name="nfkc

to get NFD or NFKD, respectively."

filter: A pattern. Codepoints outside the set are always left unchanged. Default is (the null set,UnicodeSet []

no filtering - all codepoints are subject to normalization).

Example:

The input need not be an HTML document. The filter removes only constructs that look like HTML. If the

input doesn't include anything that looks like HTML, the filter won't remove any input.

139Apache Solr Reference Guide 5.1

[...]

</analyzer>

solr.PatternReplaceCharFilterFactory

This filter uses to replace or change character patterns.regular expressions

Arguments:

pattern: the regular expression pattern to apply to the incoming text.

replacement: the text to use to replace matching patterns.

You can configure this filter in like this:schema.xml

<charFilter class="solr.PatternReplaceCharFilterFactory"

pattern="([nN][oO]\.)\s*(\d+)" replacement="$1$2"/>

[...]

</analyzer>

The table below presents examples of regex-based pattern replacement:

Input pattern replacement Output Description

see-ing looking (\w+)(ing) $1 see-ing look Removes "ing" from the end

of word.

see-ing looking (\w+)ing $1 see-ing look Same as above. 2nd

parentheses can be omitted.

No.1 NO. no.

543

[nN][oO]\.\s*(\d+) #$1 #1 NO. #543 Replace some string literals

abc=1234=5678 (\w+)=(\d+)=(\d+) $3=$1=$2 5678=abc=1234 Change the order of the

groups.

Related Topics

CharFilterFactories

Language Analysis

This section contains information about tokenizers and filters related to character set conversion or for use with

specific languages. For the European languages, tokenization is fairly straightforward. Tokens are delimited by

white space and/or a relatively small set of punctuation characters. In other languages the tokenization rules are

often not so simple. Some European languages may require special tokenization rules as well, such as rules for

decompounding German words.

140Apache Solr Reference Guide 5.1

For information about language detection at index time, see .Detecting Languages During Indexing

Topics discussed in this section:

KeywordMarkerFilterFactory

StemmerOverrideFilterFactory

Dictionary Compound Word Token Filter

Unicode Collation

ASCII Folding Filter

Language-Specific Factories

Related Topics

KeywordMarkerFilterFactory

Protects words from being modified by stemmers. A customized protected word list may be specified with the

"protected" attribute in the schema. Any words in the protected word list will not be modified by any stemmer in

Solr.

A sample Solr with comments can be found in the protwords.txt sample_techproducts_configs config

directory:set

</analyzer>

</fieldtype>

StemmerOverrideFilterFactory

Overrides stemming algorithms by applying a custom mapping, then protecting these terms from being modified

by stemmers.

A customized mapping of words to stems, in a tab-separated file, can be specified to the "dictionary" attribute in

the schema. Words in this mapping will be stemmed to the stems from the file, and will not be further changed by

any stemmer.

A sample with comments can be found in the Source Repository.stemdict.txt

</analyzer>

</fieldtype>

Dictionary Compound Word Token Filter

This filter splits, or , compound words into individual words using a dictionary of the componentdecompounds

words. Each input token is passed through unchanged. If it can also be decompounded into subwords, each

subword is also added to the stream at the same logical position.

141Apache Solr Reference Guide 5.1

Compound words are most commonly found in Germanic languages.

Factory class: solr.DictionaryCompoundWordTokenFilterFactory

Arguments:

dictionary: (required) The path of a file that contains a list of simple words, one per line. Blank lines and lines

that begin with "#" are ignored. This path may be an absolute path, or path relative to the Solr config directory.

minWordSize: (integer, default 5) Any token shorter than this is not decompounded.

minSubwordSize: (integer, default 2) Subwords shorter than this are not emitted as tokens.

maxSubwordSize: (integer, default 15) Subwords longer than this are not emitted as tokens.

onlyLongestMatch: (true/false) If true (the default), only the longest matching subwords will generate new

tokens.

Example:

Assume that contains at least the following words: germanwords.txt dumm kopf donau dampf schiff

<filter class="solr.DictionaryCompoundWordTokenFilterFactory"

dictionary="germanwords.txt"/>

</analyzer>

In: "Donaudampfschiff dummkopf"

Tokenizer to Filter: "Donaudampfschiff"(1), "dummkopf"(2),

Out: "Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2), "kopf"(2)

Unicode Collation

Unicode Collation is a language-sensitive method of sorting text that can also be used for advanced search

purposes.

Unicode Collation in Solr is fast, because all the work is done at index time.

Rather than specifying an analyzer within , the <fieldtype ... class="solr.TextField"> solr.Colla

and field type classes provide this functionality. tionField solr.ICUCollationField solr.ICUCollatio

, which is backed by , provides more flexible configuration, has more locales, isnField the ICU4J library

significantly faster, and requires less memory and less index space, since its keys are smaller than those

produced by the JDK implementation that backs .solr.CollationField

solr.ICUCollationField is included in the Solr contrib - see analysis-extras solr/contrib/analys

for instructions on which jars you need to add to your in order tois-extras/README.txt SOLR_HOME/lib

use it.

solr.ICUCollationField and fields can be created in two ways:solr.CollationField

Based upon a system collator associated with a Locale.

Based upon a tailored ruleset.RuleBasedCollator

142Apache Solr Reference Guide 5.1

Arguments for , specified as attributes within the element:solr.ICUCollationField <fieldtype>

Using a System collator:

locale: (required) locale ID. See for a list of supported locales.RFC 3066 the ICU locale explorer

strength: Valid values are , , , , or . See primary secondary tertiary quaternary identical Comparison

for more information.Levels in ICU Collation Concepts

decomposition: Valid values are or . See for moreno canonical Normalization in ICU Collation Concepts

information.

Using a Tailored ruleset:

custom: (required) Path to a UTF-8 text file containing rules supported by the ICU RuleBasedCollator

strength: Valid values are , , , , or . See primary secondary tertiary quaternary identical Comparison

for more information.Levels in ICU Collation Concepts

decomposition: Valid values are or . See for moreno canonical Normalization in ICU Collation Concepts

information.

Expert options:

alternate: Valid values are or . Can be used to ignore punctuation/whitespace.shifted non-ignorable

caseLevel: (true/false) If true, in combination with , accents are ignored but case isstrength="primary"

taken into account. The default is false. See for more information.CaseLevel in ICU Collation Concepts

caseFirst: Valid values are or . Useful to control which is sorted first when case is not ignored.lower upper

numeric: (true/false) If true, digits are sorted according to numeric value, e.g. foobar-9 sorts before foobar-10.

The default is false.

variableTop: Single character or contraction. Controls what is variable for alternate

Sorting Text for a Specific Language

In this example, text is sorted according to the default German rules provided by ICU4J.

Locales are typically defined as a combination of language and country, but you can specify just the language if

you want. For example, if you specify "de" as the language, you will get sorting that works well for the German

language. If you specify "de" as the language and "CH" as the country, you will get German sorting specifically

tailored for Switzerland.

143Apache Solr Reference Guide 5.1

<fieldType name="collatedGERMAN" class="solr.ICUCollationField"

locale="de"

strength="primary" />

...

<field name="manuGERMAN" type="collatedGERMAN" indexed="false" stored="false"

docValues="true"/>

...

<!-- Copy the text to this field. We could create French, English, Spanish versions

too,

and sort differently for different users! -->

In the example above, we defined the strength as "primary". The strength of the collation determines how strict

the sort order will be, but it also depends upon the language. For example, in English, "primary" strength ignores

differences in case and accents.

Another example:

<fieldType name="polishCaseInsensitive" class="solr.ICUCollationField"

locale="pl_PL"

strength="secondary" />

...

...

...

The type will be used for the fields where the data contains Polish text. The "secondary" strength will ignore case

differences, but, unlike "primary" strength, a letter with diacritic(s) will be sorted differently from the same base

letter without diacritics.

An example using the "city_sort" field to sort:

q=*:*&fl=city&sort=city_sort+asc

Sorting Text for Multiple Languages

There are two approaches to supporting multiple languages: if there is a small list of languages you wish to

support, consider defining collated fields for each language and using . However, adding a largecopyField

number of sort fields can increase disk and indexing costs. An alternative approach is to use the Unicode defau

collator.lt

The Unicode or locale has rules that are designed to work well for most languages. To use the default ROOT d

locale, simply define the locale as the empty string. This Unicode default sort is still significantly moreefault

advanced than the standard Solr sort.

144Apache Solr Reference Guide 5.1

<fieldType name="collatedROOT" class="solr.ICUCollationField"

locale=""

strength="primary" />

Sorting Text with Custom Rules

You can define your own set of sorting rules. It's easiest to take existing rules that are close to what you want

and customize them.

In the example below, we create a custom rule set for German called DIN 5007-2. This rule set treats umlauts in

German differently: it treats ö as equivalent to oe, ä as equivalent to ae, and ü as equivalent to ue. For more

information, see the .ICU RuleBasedCollator javadocs

This example shows how to create a custom rule set for and dump it to a file:solr.ICUCollationField

// get the default rules for Germany

// these are called DIN 5007-1 sorting

RuleBasedCollator baseCollator = (RuleBasedCollator) Collator.getInstance(new

ULocale("de", "DE"));

// define some tailorings, to make it DIN 5007-2 sorting.

// For example, this makes ö equivalent to oe

String DIN5007_2_tailorings =

"& ae , a\u0308 & AE , A\u0308"+

"& oe , o\u0308 & OE , O\u0308"+

"& ue , u\u0308 & UE , u\u0308";

// concatenate the default rules to the tailorings, and dump it to a String

RuleBasedCollator tailoredCollator = new RuleBasedCollator(baseCollator.getRules() +

DIN5007_2_tailorings);

String tailoredRules = tailoredCollator.getRules();

// write these to a file, be sure to use UTF-8 encoding!!!

FileOutputStream os = new FileOutputStream(new

File("/solr_home/conf/customRules.dat"));

IOUtils.write(tailoredRules, os, "UTF-8");

This rule set can now be used for custom collation in Solr:

<fieldType name="collatedCUSTOM" class="solr.ICUCollationField"

custom="customRules.dat"

strength="primary" />

JDK Collation

As mentioned above, ICU Unicode Collation is better in several ways than JDK Collation, but if you cannot use

ICU4J for some reason, you can use .solr.CollationField

The principles of JDK Collation are the same as those of ICU Collation; you just specify , anlanguage country

d arguments instead of the combined argument.variant locale

Arguments for , specified as attributes within the element:solr.CollationField <fieldtype>

145Apache Solr Reference Guide 5.1

Using a System collator (see ):Oracle's list of locales supported in Java 7

language: (required) language codeISO-639

country: country codeISO-3166

variant: Vendor or browser-specific code

strength: Valid values are , , or . See primary secondary tertiary identical Oracle Java 7 Collator

for more information.javadocs

decomposition: Valid values are , , or . See for moreno canonical full Oracle Java 7 Collator javadocs

information.

Using a Tailored ruleset:

custom: (required) Path to a UTF-8 text file containing rules supported by the JDK RuleBasedCollator

strength: Valid values are , , or . See primary secondary tertiary identical Oracle Java 7 Collator

for more information.javadocs

decomposition: Valid values are , , or . See for moreno canonical full Oracle Java 7 Collator javadocs

information.

A example:solr.CollationField

<fieldType name="collatedGERMAN" class="solr.CollationField"

language="de"

country="DE"

strength="primary" /> <!-- ignore Umlauts and letter case when sorting

-->

...

<field name="manuGERMAN" type="collatedGERMAN" indexed="false" stored="false"

docValues="true" />

...

ASCII Folding Filter

This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII

characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. Only those characters with

reasonable ASCII alternatives are converted:

This can increase recall by causing more matches. On the other hand, it can reduce precision because

language-specific character differences may be lost.

Factory class: solr.ASCIIFoldingFilterFactory

Arguments: None

Example:

</analyzer>

146Apache Solr Reference Guide 5.1
In: "Björn Ångström"
Tokenizer to Filter: "Björn", "Ångström"
Out: "Bjorn", "Angstrom"
Language-Specific Factories
These factories are each designed to work with specific languages. The languages covered here are:
Arabic
Brazilian Portuguese
Bulgarian
Catalan
Chinese
Simplified Chinese
CJK
Czech
Danish
Dutch
Finnish
French
Galician
German
Greek
Hebrew, Lao, Myanmar, Khmer
Hindi
Indonesian
Italian
Irish
Japanese
Latvian
Norwegian
Persian
Polish
Portuguese
Romanian
Russian
Scandinavian
Serbian
Spanish
Swedish
Thai
Turkish
Arabic
Solr provides support for the   (PDF) stemming algorithm, and Lucene includes an example stopword list.Light-10
This algorithm defines both character normalization and stemming, so these are split into two filters to provide
more flexibility.

147Apache Solr Reference Guide 5.1

Factory classes: , solr.ArabicStemFilterFactory solr.ArabicNormalizationFilterFactory

Arguments: None

Example:

</analyzer>

Brazilian Portuguese

This is a Java filter written specifically for stemming the Brazilian dialect of the Portuguese language. It uses the

Lucene class . Although that stemmer can beorg.apache.lucene.analysis.br.BrazilianStemmer

configured to use a list of protected words (which should not be stemmed), this factory does not accept any

arguments to specify such a list.

Factory class: solr.BrazilianStemFilterFactory

Arguments: None

Example:

</analyzer>

In: "praia praias"

Tokenizer to Filter: "praia", "praias"

Out: "pra", "pra"

Bulgarian

Solr includes a light stemmer for Bulgarian, following (PDF), and Lucene includes an examplethis algorithm

stopword list.

Factory class: solr.BulgarianStemFilterFactory

Arguments: None

Example:

</analyzer>

Catalan

148Apache Solr Reference Guide 5.1

Solr can stem Catalan using the Snowball Porter Stemmer with an argument of . Solrlanguage="Catalan"

includes a set of contractions for Catalan, which can be stripped using .solr.ElisionFilterFactory

Factory class: solr.SnowballPorterFilterFactory

Arguments:

language: (required) stemmer language, "Catalan" in this case

Example:

<filter class="solr.ElisionFilterFactory"

articles="lang/contractions_ca.txt"/>

</analyzer>

In: "llengües llengua"

Tokenizer to Filter: "llengües"(1) "llengua"(2),

Out: "llengu"(1), "llengu"(2)

Chinese

Chinese Tokenizer

The Chinese Tokenizer is deprecated as of Solr 3.4. Use the instead. solr.StandardTokenizerFactory

Factory class: solr.ChineseTokenizerFactory

Arguments: None

Example:

</analyzer>

Chinese Filter Factory

The Chinese Filter Factory is deprecated as of Solr 3.4. Use the instead. solr.StopFilterFactory

Factory class: solr.ChineseFilterFactory

Arguments: None

Example:

</analyzer>

149Apache Solr Reference Guide 5.1

Simplified Chinese

For Simplified Chinese, Solr provides support for Chinese sentence and word segmentation with the solr.HMMC

in the contrib module. This component includes a largehineseTokenizerFactory analysis-extras

dictionary and segments Chinese text into words with the Hidden Markov Model. To use this filter, see solr/co

for instructions on which jars you need to add to your ntrib/analysis-extras/README.txt solr_home/l

.ib

Factory class: solr.HMMChineseTokenizerFactory

Arguments: None

Examples:

To use the default setup with fallback to English Porter stemmer for English words, use:

Or to configure your own analysis setup, use the along with yoursolr.HMMChineseTokenizerFactory

custom filter setup.

<filter class="solr.StopFilterFactory

words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>

</analyzer>

CJK

This tokenizer breaks Chinese, Japanese and Korean language text into tokens. These are not whitespace

delimited languages. The tokens generated by this tokenizer are "doubles", overlapping pairs of CJK characters

found in the field text.

Factory class: solr.CJKTokenizerFactory

Arguments: None

Example:

</analyzer>

Czech

Solr includes a light stemmer for Czech, following , and Lucene includes an example stopword list.this algorithm

Factory class: solr.CzechStemFilterFactory

Arguments: None

Example:

150Apache Solr Reference Guide 5.1

In: "prezidenští, prezidenta, prezidentského"

Tokenizer to Filter: "prezidenští", "prezidenta", "prezidentského"

Out: "preziden", "preziden", "preziden"

Danish

Solr can stem Danish using the Snowball Porter Stemmer with an argument of .language="Danish"

Also relevant are the .Scandinavian normalization filters

Factory class: solr.SnowballPorterFilterFactory

Arguments:

language: (required) stemmer language, "Danish" in this case

Example:

</analyzer>

In: "undersøg undersøgelse"

Tokenizer to Filter: "undersøg"(1) "undersøgelse"(2),

Out: "undersøg"(1), "undersøg"(2)

Dutch

Solr can stem Dutch using the Snowball Porter Stemmer with an argument of .language="Dutch"

Factory class: solr.SnowballPorterFilterFactory

Arguments:

language: (required) stemmer language, "Dutch" in this case

Example:

</analyzer>

151Apache Solr Reference Guide 5.1

In: "kanaal kanalen"

Tokenizer to Filter: "kanaal", "kanalen"

Out: "kanal", "kanal"

Finnish

Solr includes support for stemming Finnish, and Lucene includes an example stopword list.

Factory class: solr.FinnishLightStemFilterFactory

Arguments: None

Example:

</analyzer>

In: "kala kalat"

Tokenizer to Filter: "kala", "kalat"

Out: "kala", "kala"

French

Elision Filter

Removes article elisions from a token stream. This filter can be useful for languages such as French, Catalan,

Italian, and Irish.

Factory class: solr.ElisionFilterFactory

Arguments:

articles: The pathname of a file that contains a list of articles, one per line, to be stripped. Articles are words

such as "le", which are commonly abbreviated, such as in (the plane). This file should include thel'avion

abbreviated form, which precedes the apostrophe. In this case, simply " ". If no attribute is specified, alarticles

default set of French articles is used.

ignoreCase: (boolean) If true, the filter ignores the case of words when comparing them to the common word

file. Defaults to false

Example:

<filter class="solr.ElisionFilterFactory"

ignoreCase="true"

articles="lang/contractions_fr.txt"/>

</analyzer>

In: "L'histoire d'art"

152Apache Solr Reference Guide 5.1

Tokenizer to Filter: "L'histoire", "d'art"

Out: "histoire", "art"

French Light Stem Filter

Solr includes three stemmers for French: one in the , a lightersolr.SnowballPorterFilterFactory

stemmer called , and an even less aggressive stemmer called solr.FrenchLightStemFilterFactory solr

. Lucene includes an example stopword list..FrenchMinimalStemFilterFactory

Factory classes: , solr.FrenchLightStemFilterFactory solr.FrenchMinimalStemFilterFactory

Arguments: None

Examples:

<filter class="solr.ElisionFilterFactory"

articles="lang/contractions_fr.txt"/>

</analyzer>

<filter class="solr.ElisionFilterFactory"

articles="lang/contractions_fr.txt"/>

</analyzer>

In: "le chat, les chats"

Tokenizer to Filter: "le", "chat", "les", "chats"

Out: "le", "chat", "le", "chat"

Galician

Solr includes a stemmer for Galician following , and Lucene includes an example stopword list.this algorithm

Factory class: solr.GalicianStemFilterFactory

Arguments: None

Example:

</analyzer>

In: "felizmente Luzes"

153Apache Solr Reference Guide 5.1

Tokenizer to Filter: "felizmente", "luzes"

Out: "feliz", "luz"

German

Solr includes four stemmers for German: one in the solr.SnowballPorterFilterFactory

, a stemmer called , a lighter stemmer called language="German" solr.GermanStemFilterFactory solr.

, and an even less aggressive stemmer called GermanLightStemFilterFactory solr.GermanMinimalSt

. Lucene includes an example stopword list.emFilterFactory

Factory classes: , , solr.GermanStemFilterFactory solr.LightGermanStemFilterFactory solr.M

inimalGermanStemFilterFactory

Arguments: None

Examples:

</analyzer>

</analyzer>

</analyzer>

In: "haus häuser"

Tokenizer to Filter: "haus", "häuser"

Out: "haus", "haus"

Greek

This filter converts uppercase letters in the Greek character set to the equivalent lowercase character.

Factory class: solr.GreekLowerCaseFilterFactory

Arguments: None

Example:

Use of custom charsets is not longer supported as of Solr 3.1. If you need to index text in these

encodings, please use Java's character set conversion facilities (InputStreamReader, and so on.) during

I/O, so that Lucene can analyze this text as Unicode instead.

154Apache Solr Reference Guide 5.1

</analyzer>

Hindi

Solr includes support for stemming Hindi following (PDF), support for common spelling differencesthis algorithm

through the , support for encoding differences through the solr.HindiNormalizationFilterFactory solr

following , and Lucene includes an example stopword.IndicNormalizationFilterFactory this algorithm

list.

Factory classes: , solr.IndicNormalizationFilterFactory solr.HindiNormalizationFilterFac

, tory solr.HindiStemFilterFactory

Arguments: None

Example:

</analyzer>

Indonesian

Solr includes support for stemming Indonesian (Bahasa Indonesia) following (PDF), and Lucenethis algorithm

includes an example stopword list.

Factory class: solr.IndonesianStemFilterFactory

Arguments: None

Example:

</analyzer>

In: "sebagai sebagainya"

Tokenizer to Filter: "sebagai", "sebagainya"

Out: "bagai", "bagai"

Italian

Solr includes two stemmers for Italian: one in the solr.SnowballPorterFilterFactory

, and a lighter stemmer called . Lucenelanguage="Italian" solr.ItalianLightStemFilterFactory

155Apache Solr Reference Guide 5.1

includes an example stopword list.

Factory class: solr.ItalianStemFilterFactory

Arguments: None

Example:

<filter class="solr.ElisionFilterFactory"

articles="lang/contractions_it.txt"/>

</analyzer>

In: "propaga propagare propagamento"

Tokenizer to Filter: "propaga", "propagare", "propagamento"

Out: "propag", "propag", "propag"

Irish

Solr can stem Irish using the Snowball Porter Stemmer with an argument of . Solr includes language="Irish"

, which can handle Irish-specific constructs. Solr also includes a setsolr.IrishLowerCaseFilterFactory

of contractions for Irish which can be stripped using .solr.ElisionFilterFactory

Factory class: solr.SnowballPorterFilterFactory

Arguments:

language: (required) stemmer language, "Irish" in this case

Example:

<filter class="solr.ElisionFilterFactory"

articles="lang/contractions_ga.txt"/>

</analyzer>

In: "siopadóireacht síceapatacha b'fhearr m'athair"

Tokenizer to Filter: "siopadóireacht", "síceapatacha", "b'fhearr", "m'athair"

Out: "siopadóir", "síceapaite", "fearr", "athair"

Japanese

Solr includes support for analyzing Japanese, via the Lucene Kuromoji morphological analyzer, which includes

several analysis components - more details on each below:

JapaneseIterationMarkCharFilter normalizes Japanese horizontal iteration marks (odoriji) to their

156Apache Solr Reference Guide 5.1

expanded form.

JapaneseTokenizer tokenizes Japanese using morphological analysis, and annotates each term with

part-of-speech, base form (a.k.a. lemma), reading and pronunciation.

JapaneseBaseFormFilter replaces original terms with their base forms (a.k.a. lemmas).

JapanesePartOfSpeechStopFilter removes terms that have one of the configured parts-of-speech.

JapaneseKatakanaStemFilter normalizes common katakana spelling variations ending in a long

sound character (U+30FC) by removing the long sound character.

Also useful for Japanese analysis, from lucene-analyzers-common:

CJKWidthFilter folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds halfwidth

Katakana variants into their equivalent fullwidth forms.

Japanese Iteration Mark CharFilter

Normalizes horizontal Japanese iteration marks (odoriji) to their expanded form. Vertical iteration marks are not

supported.

Factory class: JapaneseIterationMarkCharFilterFactory

Arguments:

normalizeKanji: set to to not normalize kanji iteration marks (default is )false true

normalizeKana: set to to not normalize kana iteration marks (default is )false true

Japanese Tokenizer

Tokenizer for Japanese that uses morphological analysis, and annotates each term with part-of-speech, base

form (a.k.a. lemma), reading and pronunciation.

JapaneseTokenizer has a mode (the default) that does segmentation useful for search: a heuristic issearch

used to segment compound terms into their constituent parts while also keeping the original compound terms as

synonyms.

Factory class: solr.JapaneseTokenizerFactory

Arguments:

mode: Use mode to get a noun-decompounding effect useful for search. mode improvessearch search

segmentation for search at the expense of part-of-speech accuracy. Valid values for are:mode

normal: default segmentation

search: segmentation useful for search (extra compound splitting)

extended: search mode plus unigramming of unknown words (experimental)

For some applications it might be good to use mode for indexing and mode for queries tosearch normal

increase precision and prevent parts of compounds from being matched and highlighted.

userDictionary: filename for a user dictionary, which allows overriding the statistical model with your own

entries for segmentation, part-of-speech tags and readings without a need to specify weights. See lang/userd

for a sample user dictionary file.ict_ja.txt

userDictionaryEncoding: user dictionary encoding (default is UTF-8)

157Apache Solr Reference Guide 5.1

discardPunctuation: set to to keep punctuation, to discard (the default)false true

Japanese Base Form Filter

Replaces original terms' text with the corresponding base form (lemma). ( annotatesJapaneseTokenizer

each term with its base form.)

Factory class: JapaneseBaseFormFilterFactory

(no arguments)

Japanese Part Of Speech Stop Filter

Removes terms with one of the configured parts-of-speech. annotates terms withJapaneseTokenizer

parts-of-speech.

Factory class :JapanesePartOfSpeechStopFilterFactory

Arguments:

tags: filename for a list of parts-of-speech for which to remove terms; see inconf/lang/stoptags_ja.txt

the for an example.sample_techproducts_config config set

enablePositionIncrements: if is or earlier and luceneMatchVersion 4.3 enablePositionIncrement

, no position holes will be left by this filter when it removes tokens. s="false" This argument is invalid if luc

is or later.eneMatchVersion 5.0

Japanese Katakana Stem Filter

Normalizes common katakana spelling variations ending in a long sound character (U+30FC) by removing the

long sound character.

CJKWidthFilterFactory should be specified prior to this filter to normalize half-width katakana to full-width.

Factory class: JapaneseKatakanaStemFilterFactory

Arguments:

minimumLength: terms below this length will not be stemmed. Default is 4, value must be 2 or more.

CJK Width Filter

Folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds halfwidth Katakana variants into

their equivalent fullwidth forms.

Factory class: CJKWidthFilterFactory

(no arguments)

Example:

158Apache Solr Reference Guide 5.1

<fieldType name="text_ja" positionIncrementGap="100"

autoGeneratePhraseQueries="false">

<tokenizer class="solr.JapaneseTokenizerFactory" mode="search"

userDictionary="lang/userdict_ja.txt"/>

<filter class="solr.JapanesePartOfSpeechStopFilterFactory"

tags="lang/stoptags_ja.txt"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"

words="lang/stopwords_ja.txt"/>

</analyzer>

</fieldType>

Hebrew, Lao, Myanmar, Khmer

Lucene provides support, in addition to UAX#29 word break rules, for Hebrew's use of the double and single

quote characters, and for segmenting Lao, Myanmar, and Khmer into syllables with the solr.ICUTokenizerF

in the contrib module. To use this tokenizer, see actory analysis-extras solr/contrib/analysis-ext

instructions on which jars you need to add to your .ras/README.txt for solr_home/lib

See for more information.the ICUTokenizer

Latvian

Solr includes support for stemming Latvian, and Lucene includes an example stopword list.

Factory class: solr.LatvianStemFilterFactory

Arguments: None

Example:

</analyzer>

</fieldType>

In: "tirgiem tirgus"

Tokenizer to Filter: "tirgiem", "tirgus"

Out: "tirg", "tirg"

Norwegian

Solr includes two classes for stemming Norwegian, and NorwegianLightStemFilterFactory NorwegianM

. Lucene includes an example stopword list.inimalStemFilterFactory

159Apache Solr Reference Guide 5.1

Another option is to use the Snowball Porter Stemmer with an argument of language="Norwegian".

Also relevant are the .Scandinavian normalization filters

Norwegian Light Stemmer

The requires a "two-pass" sort for the -dom and -het endings. ThisNorwegianLightStemFilterFactory

means that in the first pass the word "kristendom" is stemmed to "kristen", and then all the general rules apply so

it will be further stemmed to "krist". The effect of this is that "kristen," "kristendom," "kristendommen," and

"kristendommens" will all be stemmed to "krist."

The second pass is to pick up -dom and -het endings. Consider this example:

One pass Two passes

Before After Before After

forlegen forleg forlegen forleg

forlegenhet forlegen forlegenhet forleg

forlegenheten forlegen forlegenheten forleg

forlegenhetens forlegen forlegenhetens forleg

firkantet firkant firkantet firkant

firkantethet firkantet firkantethet firkant

firkantetheten firkantet firkantetheten firkant

Factory class: solr.NorwegianLightStemFilterFactory

Arguments: Choose the Norwegian language variant to use. Valid values are:variant:

nb: Bokmål (default)

nn: Nynorsk

no: both

Example:

<filter class="solr.StopFilterFactory" ignoreCase="true"

words="lang/stopwords_no.txt" format="snowball"/>

</analyzer>

</fieldType>

In: "Forelskelsen"

Tokenizer to Filter: "forelskelsen"

Out: "forelske"

160Apache Solr Reference Guide 5.1

Norwegian Minimal Stemmer

The stems plural forms of Norwegian nouns only.NorwegianMinimalStemFilterFactory

Factory class: solr.NorwegianMinimalStemFilterFactory

Arguments: Choose the Norwegian language variant to use. Valid values are:variant:

nb: Bokmål (default)

nn: Nynorsk

no: both

Example:

<filter class="solr.StopFilterFactory" ignoreCase="true"

words="lang/stopwords_no.txt" format="snowball"/>

</analyzer>

</fieldType>

In: "Bilens"

Tokenizer to Filter: "bilens"

Out: "bil"

Persian

Persian Filter Factories

Solr includes support for normalizing Persian, and Lucene includes an example stopword list.

Factory class: solr.PersianNormalizationFilterFactory

Arguments: None

Example:

</analyzer>

Polish

Solr provides support for Polish stemming with the , and solr.StempelPolishStemFilterFactory solr.M

for lemmatization, in the module. The orphologikFilterFactory contrib/analysis-extras solr.Ste

component includes an algorithmic stemmer with tables for Polish. To usempelPolishStemFilterFactory

either of these filters, see for instructions on which jars yousolr/contrib/analysis-extras/README.txt

need to add to your .solr_home/lib

161Apache Solr Reference Guide 5.1

Factory class: and solr.StempelPolishStemFilterFactory solr.MorfologikFilterFactory

Arguments: None

Example:

</analyzer>

</analyzer>

In: ""studenta studenci"

Tokenizer to Filter: "studenta", "studenci"

Out: "student", "student"

More information about the Stempel stemmer is available in .the Lucene javadocs

The Morfologik param value is a constant specifying which dictionary to choose.dictionary-resource

The dictionary resource must be named andmorfologik/dictionaries/{dictionaryResource}.dict

have an associated metadata file. See for details..info the Morfologik project

Portuguese

Solr includes four stemmers for Portuguese: one in the , ansolr.SnowballPorterFilterFactory

alternative stemmer called , a lighter stemmer called solr.PortugueseStemFilterFactory solr.Portugu

, and an even less aggressive stemmer called eseLightStemFilterFactory solr.PortugueseMinimalS

. Lucene includes an example stopword list.temFilterFactory

Factory classes: , solr.PortugueseStemFilterFactory solr.PortugueseLightStemFilterFactor

, y solr.PortugueseMinimalStemFilterFactory

Arguments: None

Example:

</analyzer>

162Apache Solr Reference Guide 5.1

</analyzer>

</analyzer>

In: "praia praias"

Tokenizer to Filter: "praia", "praias"

Out: "pra", "pra"

Romanian

Solr can stem Romanian using the Snowball Porter Stemmer with an argument of .language="Romanian"

Factory class: solr.SnowballPorterFilterFactory

Arguments:

language: (required) stemmer language, "Romanian" in this case

Example:

</analyzer>

Russian

Russian Stem Filter

Solr includes two stemmers for Russian: one in the solr.SnowballPorterFilterFactory

, and a lighter stemmer called . Lucenelanguage="Russian" solr.RussianLightStemFilterFactory

includes an example stopword list.

Factory class: solr.RussianLightStemFilterFactory

Arguments: None

Use of custom charsets is no longer supported as of Solr 3.4. If you need to index text in these

encodings, please use Java's character set conversion facilities (InputStreamReader, and so on.) during

I/O, so that Lucene can analyze this text as Unicode instead.

163Apache Solr Reference Guide 5.1

Example:

</analyzer>

Scandinavian

Scandinavian is a language group spanning three languages , and which are veryNorwegian Swedish Danish

similar.

Swedish å,ä,ö are in fact the same letters as Norwegian and Danish å,æ,ø and thus interchangeable when used

between these languages. They are however folded differently when people type them on a keyboard lacking

these characters.

In that situation almost all Swedish people use a, a, o instead of å, ä, ö. Norwegians and Danes on the other

hand usually type aa, ae and oe instead of å, æ and ø. Some do however use a, a, o, oo, ao and sometimes

permutations of everything above.

There are two filters for helping with normalization between Scandinavian languages: one is solr.Scandinavi

trying to preserve the special characters (æäöå) and another anNormalizationFilterFactory solr.Scan

which folds these to the more broad ø/ö->o etc.dinavianFoldingFilterFactory

See also each language section for other relevant filters.

Scandinavian Normalization Filter

This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa,

ao, ae, oe and oo) by transforming them to åÅæÆøØ.

It's a semantically less destructive solution than , most useful when a personScandinavianFoldingFilter

with a Norwegian or Danish keyboard queries a Swedish index and vice versa. This filter does perform thenot

common Swedish folds of å and ä to a nor ö to o.

Factory class: solr.ScandinavianNormalizationFilterFactory

Arguments: None

Example:

</analyzer>

In: "blåbærsyltetøj blåbärsyltetöj blaabaarsyltetoej blabarsyltetoj"

Tokenizer to Filter: "blåbærsyltetøj", "blåbärsyltetöj", "blaabaersyltetoej", "blabarsyltetoj"

Out: "blåbærsyltetøj", "blåbærsyltetøj", "blåbærsyltetøj", "blabarsyltetoj"

Scandinavian Folding Filter

164Apache Solr Reference Guide 5.1

This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o. It also discriminate against use of double

vowels aa, ae, ao, oe and oo, leaving just the first one.

It's is a semantically more destructive solution than , but can inScandinavianNormalizationFilter

addition help with matching raksmorgas as räksmörgås.

Factory class: solr.ScandinavianFoldingFilterFactory

Arguments: None

Example:

</analyzer>

In: "blåbærsyltetøj blåbärsyltetöj blaabaarsyltetoej blabarsyltetoj"

Tokenizer to Filter: "blåbærsyltetøj", "blåbärsyltetöj", "blaabaersyltetoej", "blabarsyltetoj"

Out: "blabarsyltetoj", "blabarsyltetoj", "blabarsyltetoj", "blabarsyltetoj"

Serbian

Serbian Normalization Filter

Solr includes a filter that normalizes Serbian Cyrillic and Latin characters to "bald" Latin. Cyrillic characters are

first converted to Latin; then, Latin characters have their diacritics removed, with the exception of "" which is

converted to " ". Note that this filter expects lowercased input.dj

Factory class: solr.SerbianNormalizationFilterFactory

Arguments: None

Example:

</analyzer>

In: "ura () "

Tokenizer to Filter: "ura" "" ""

Out: "djura", "djura", "srpski"

Spanish

Solr includes two stemmers for Spanish: one in the solr.SnowballPorterFilterFactory

, and a lighter stemmer called . Lucenelanguage="Spanish" solr.SpanishLightStemFilterFactory

includes an example stopword list.

165Apache Solr Reference Guide 5.1

Factory class: solr.SpanishStemFilterFactory

Arguments: None

Example:

</analyzer>

In: "torear toreara torearlo"

Tokenizer to Filter: "torear", "toreara", "torearlo"

Out: "tor", "tor", "tor"

Swedish

Swedish Stem Filter

Solr includes two stemmers for Swedish: one in the solr.SnowballPorterFilterFactory

, and a lighter stemmer called . Lucenelanguage="Swedish" solr.SwedishLightStemFilterFactory

includes an example stopword list.

Also relevant are the .Scandinavian normalization filters

Factory class: solr.SwedishStemFilterFactory

Arguments: None

Example:

</analyzer>

In: "kloke klokhet klokheten"

Tokenizer to Filter: "kloke", "klokhet", "klokheten"

Out: "klok", "klok", "klok"

Thai

This filter converts sequences of Thai characters into individual Thai words. Unlike European languages, Thai

does not use whitespace to delimit words.

Factory class: solr.ThaiTokenizerFactory

Arguments: None

Example:

166Apache Solr Reference Guide 5.1

</analyzer>

Turkish

Solr includes support for stemming Turkish through the ; support forsolr.SnowballPorterFilterFactory

case-insensitive search through the ; support for strippingsolr.TurkishLowerCaseFilterFactory

apostrophes and following suffixes through (see solr.ApostropheFilterFactory Role of Apostrophes in

); support for a form of stemming that truncating tokens at a configurable maximumTurkish Information Retrieval

length through the solr.TruncateTokenFilterFactory (see ); and LuceneInformation Retrieval on Turkish Texts

includes an example stopword list.

Factory class: solr.TurkishLowerCaseFilterFactory

Arguments: None

Example:

</analyzer>

Another example, illustrating diacritics-insensitive search:

</analyzer>

Related Topics

LanguageAnalysis

Phonetic Matching

Phonetic matching algorithms may be used to encode tokens so that two different spellings that are pronounced

similarly will match.

For overviews of and comparisons between algorithms, see and http://en.wikipedia.org/wiki/Phonetic_algorithm h

ttp://ntz-develop.blogspot.com/2011/03/phonetic-algorithms.html

Algorithms discussed in this section:

Beider-Morse Phonetic Matching (BMPM)

167Apache Solr Reference Guide 5.1

Daitch-Mokotoff Soundex

Double Metaphone

Metaphone

Soundex

Refined Soundex

Caverphone

Kölner Phonetik a.k.a. Cologne Phonetic

NYSIIS

Beider-Morse Phonetic Matching (BMPM)

To use this encoding in your analyzer, see in the Filter Descriptions section.Beider Morse Filter

Beider-Morse Phonetic Matching (BMPM) is a "soundalike" tool that lets you search using a new phonetic

matching system. BMPM helps you search for personal names (or just surnames) in a Solr/Lucene index, and is

far superior to the existing phonetic codecs, such as regular soundex, metaphone, caverphone, etc.

In general, phonetic matching lets you search a name list for names that are phonetically equivalent to the

desired name. BMPM is similar to a soundex search in that an exact spelling is not required. Unlike soundex, it

does not generate a large quantity of false hits.

From the spelling of the name, BMPM attempts to determine the language. It then applies phonetic rules for that

particular language to transliterate the name into a phonetic alphabet. If it is not possible to determine the

language with a fair degree of certainty, it uses generic phonetic instead. Finally, it applies language-independent

rules regarding such things as voiced and unvoiced consonants and vowels to further insure the reliability of the

matches.

For example, assume that the matches found when searching for Stephen in a database are "Stefan", "Steph",

"Stephen", "Steve", "Steven", "Stove", and "Stuffin". "Stefan", "Stephen", and "Steven" are probably relevant, and

are names that you want to see. "Stuffin", however, is probably not relevant. Also rejected were "Steph", "Steve",

and "Stove". Of those, "Stove" is probably not one that we would have wanted. But "Steph" and "Steve" are

possibly ones that you might be interested in.

For Solr, BMPM searching is available for the following languages:

English

French

German

Greek

Hebrew written in Hebrew letters

Hungarian

Italian

Polish

Romanian

Russian written in Cyrillic letters

Russian transliterated into English letters

Spanish

Turkish

The name matching is also applicable to non-Jewish surnames from the countries in which those languages are

spoken.

168Apache Solr Reference Guide 5.1
For more information, see here:   and http://stevemorse.org/phoneticinfo.htm http://stevemorse.org/phonetics/bmp
m.htm.
Daitch-Mokotoff Soundex
To use this encoding in your analyzer, see   in the Filter Descriptions section.Daitch-Mokotoff Soundex Filter
The Daitch-Mokotoff Soundex algorithm is a refinement of the Russel and American Soundex algorithms,
yielding greater accuracy in matching especially Slavish and Yiddish surnames with similar pronunciation but
differences in spelling.
The main differences compared to the other soundex variants are:
coded names are 6 digits long
initial character of the name is coded
rules to encoded multi-character n-grams
multiple possible encodings for the same name (branching)
Note: the implementation used by Solr (commons-codec's  ) has additional DaitchMokotoffSoundex
branching rules compared to the original description of the algorithm.
For more information, see   and http://en.wikipedia.org/wiki/Daitch%E2%80%93Mokotoff_Soundex http://www.avo
taynu.com/soundex.htm
Double Metaphone
To use this encoding in your analyzer, see   in the Filter Descriptions section.Double Metaphone Filter
 Alternatively, you may specify   with the  , but note that theencoding="DoubleMetaphone" Phonetic Filter
Phonetic Filter version will   provide the second ("alternate") encoding that is generated by the Doublenot
Metaphone Filter for some tokens.  
Encodes tokens using the double metaphone algorithm by Lawrence Philips.  See the original article at http://w
ww.drdobbs.com/the-double-metaphone-search-algorithm/184401251?pgno=2
Metaphone
To use this encoding in your analyzer, specify   with the  .encoding="Metaphone" Phonetic Filter
Encodes tokens using the Metaphone algorithm by Lawrence Philips, described in "Hanging on the Metaphone"
in   Dec. 1990.  Computer Language,
See http://en.wikipedia.org/wiki/Metaphone
Soundex
To use this encoding in your analyzer, specify   with the  .encoding="Soundex" Phonetic Filter
Encodes tokens using the Soundex algorithm, which is used to relate similar names, but can also be used as
a general purpose scheme to find words with similar phonemes.
See  http://en.wikipedia.org/wiki/Soundex
Refined Soundex

169Apache Solr Reference Guide 5.1
To use this encoding in your analyzer, specify   with the  .encoding="RefinedSoundex" Phonetic Filter
Encodes tokens using an improved version of the Soundex algorithm.   
See http://en.wikipedia.org/wiki/Soundex
Caverphone
To use this encoding in your analyzer, specify   with the  .encoding="Caverphone" Phonetic Filter
Caverphone is an algorithm created by the Caversham Project at the University of Otago.  The algorithm is
optimised for accents present in the southern part of the city of Dunedin, New Zealand.
See   and the Caverphone 2.0 specification at http://en.wikipedia.org/wiki/Caverphone http://caversham.otago.ac.
nz/files/working/ctp150804.pdf
Kölner Phonetik a.k.a. Cologne Phonetic 
To use this encoding in your analyzer, specify   with the  .encoding="ColognePhonetic" Phonetic Filter
The Kölner Phonetik, an algorithm published by Hans Joachim Postel in 1969, is optimized for the German
language.
See  http://de.wikipedia.org/wiki/K%C3%B6lner_Phonetik
NYSIIS
To use this encoding in your analyzer, specify   with the  .encoding="Nysiis" Phonetic Filter
NYSIIS is an encoding used to relate similar names, but can also be used as a general purpose scheme to find
words with similar phonemes.
See   and http://en.wikipedia.org/wiki/NYSIIS http://www.dropby.com/NYSIIS.html
Running Your Analyzer
Once you've defined a field type in   and specified the analysis steps that you want applied to it, youschema.xml
should test it out to make sure that it behaves the way you expect it to. Luckily, there is a very handy page in the
Solr   that lets you do just that. You can invoke the analyzer for any text field, provide sampleadmin interface
input, and display the resulting token stream.
For example, let's look at some of the "Text" field types available in the " "bin/solr -e techproducts
example configuration, and use the   ( ) toAnalysis Screen http://localhost:8983/solr/#/techproducts/analysis
compare how the tokens produced at index time for the sentence " " match up with aRunning an Analyzer
slightly different query text of " "run my analyzers
We can begin with " " - one of the most simplified Text field types available:text_ws

170Apache Solr Reference Guide 5.1

We can see very clearly that the only thing this field type does is tokenize text on whitespace.

If our objective is to allow queries like " " to match indexed text like "run my analyzer Running an

" then we will evidently need to pick a different field type with index & query time text analysis thatAnalyzer

does more processing of the inputs.

In particular we will want:

Case insensitivity, so "Analyzer" and "analyzer" match.

Stemming, so words like "Run" and "Running" are considered equivalent terms.

Stop Word Pruning, so small words like "an" and "my" don't affect the query.

For our next attempt, let's try the " " field type:text_general

With the verbose output enabled, we can see how each stage of our new analyzers modify the tokens they

receive before passing them on to the next stage. As we scroll down to the final output, we can see that we do

start to get a match on "analyzer" from each input string, thanks to the "LCF" stage -- which if you hover over

with your mouse, you'll see is the " ":LowerCaseFilter

171Apache Solr Reference Guide 5.1

" " is designed to be generally useful for any language, and it has definitely gotten us closer totext_general

our objective then " " by solving the problem of case sensitivity, by but it's still not quite what we aretext_ws

looking for.

So now let us try the " " field type:text_en

Now we can see the "SF" ( ) stage of the analyzers solving the problem of removing Stop Words,StopFilter

and as we scroll down, we also see the "PSF" ( ) stage apply stemming rules suitable forPorterStemFilter

our English language input, such that the terms produced by our "index analyzer" and the terms produced by our

"query analyzer" match the way we expect.

172Apache Solr Reference Guide 5.1

At this point, we can continue to experiment with additional inputs, verifying that our analyzers produce matching

tokens when we expect them to match, and disparate tokens when we do not expect them to match, as we

iterate and tweak our field type configuration.

173Apache Solr Reference Guide 5.1
Indexing and Basic Data Operations
This section describes how Solr adds data to its index. It covers the following topics:
Introduction to Solr Indexing: An overview of Solr's indexing process.
Post Tool: Information about using   to quickly upload some content to your system.post.jar
Uploading Data with Index Handlers: Information about using Solr's Index Handlers to upload
XML/XSLT, JSON and CSV data.
Uploading Data with Solr Cell using Apache Tika: Information about using the Solr Cell framework to
upload data for indexing.
Uploading Structured Data Store Data with the Data Import Handler: Information about uploading and
indexing data from a structured data store.
Updating Parts of Documents: Information about how to use atomic updates and optimistic concurrency
with Solr.
Detecting Languages During Indexing: Information about using language identification during the
indexing process.
De-Duplication: Information about configuring Solr to mark duplicate documents as they are indexed.
Content Streams: Information about streaming content to Solr Request Handlers.
UIMA Integration: Information about integrating Solr with Apache's Unstructured Information
Management Architecture (UIMA). UIMA lets you define custom pipelines of Analysis Engines that
incrementally add metadata to your documents as annotations.
Indexing Using Client APIs
Using client APIs, such as  , from your applications is an important option for updating Solr indexes. See the SolrJ
 section for more information.Client APIs
 
Introduction to Solr Indexing
This section describes the process of indexing: adding content to a Solr index and, if necessary, modifying that
content or deleting it. By adding content to an index, we make it searchable by Solr.
A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV)
files, data extracted from tables in a database, and files in common file formats such as Microsoft Word or PDF.
Here are the three most common ways of loading data into a Solr index:
Using the   framework built on Apache Tika for ingesting binary files or structured files such asSolr Cell
Office, Word, PDF, and other proprietary formats.
Uploading XML files by sending HTTP requests to the Solr server from any environment where such
requests can be generated.
Writing a custom Java application to ingest data through Solr's Java Client API (which is described in
more detail in  . Using the Java API may be the best choice if you're working with anClient APIs
application, such as a Content Management System (CMS), that offers a Java API.

174Apache Solr Reference Guide 5.1

Regardless of the method used to ingest data, there is a common basic data structure for data being fed into a

Solr index: a containing multiple each with a and containing which may bedocument fields, name content,

empty. One of the fields is usually designated as a unique ID field (analogous to a primary key in a database),

although the use of a unique ID field is not strictly required by Solr.

If the field name is defined in the file that is associated with the index, then the analysis stepsschema.xml

associated with that field will be applied to its content when the content is tokenized. Fields that are not explicitly

defined in the schema will either be ignored or mapped to a dynamic field definition (see Documents, Fields, and

), if one matching the field name exists.Schema Design

For more information on indexing in Solr, see the .Solr Wiki

The Solr Example Directory

When starting Solr with the "-e" option, the directory will be used as base directory for the exampleexample/

Solr instances that are created. This directory also includes an subdirectoryexample/exampledocs/

containing sample documents in a variety of formats that you can use to experiment with indexing into the

various examples.

The Utility for Transferring Filescurl

Many of the instructions and examples in this section make use of the utility for transferring contentcurl

through a URL. posts and retrieves data over HTTP, FTP, and many other protocols. Most Linuxcurl

distributions include a copy of . You'll find curl downloads for Linux, Windows, and many other operatingcurl

systems at . Documentation for is available here: http://curl.haxx.se/download.html curl http://curl.haxx.se/docs/

.manpage.html

Post Tool

Solr includes a simple command line tool for POSTing various types of content to a Solr server. The tool is bin/

. The bin/post tool is a Unix shell script; for Windows (non-Cygwin) usage, see the below.post Windows section

To run it, open a window and enter:

bin/post -c gettingstarted example/films/films.json

This will contact the server at pecifying the is . Thelocalhost:8983. S collection/core name mandatory

'-help' (or simply '-h') option will output information on its usage (i.e., .bin/post -help)

Using the bin/post Tool

Specifying either the or the full update is when using .collection/core name url mandatory bin/post

Using or other command line tools for posting data is just fine for examples or tests, but it's not thecurl

recommended method for achieving the best performance for updates in production environments. You

will achieve better performance with Solr Cell or the other methods described in this section.

Instead of , you can use utilities such as GNU ( ) or managecurl wget http://www.gnu.org/software/wget/

GETs and POSTS with Perl, although the command line options will differ.

175Apache Solr Reference Guide 5.1

The basic usage of is:bin/post

$ bin/post -help

Usage: post -c <collection> [OPTIONS] <files|directories|urls|-d ["...",...]>

or post -help

collection name defaults to DEFAULT_SOLR_COLLECTION if not specified

OPTIONS

=======

Solr options:

-url <base Solr update URL> (overrides collection, host, and port)

-host <host> (default: localhost)

-port <port> (default: 8983)

-commit yes|no (default: yes)

Web crawl options:

-recursive <depth> (default: 1)

-delay <seconds> (default: 10)

Directory crawl options:

-delay <seconds> (default: 0)

stdin/args options:

-type <content/type> (default: application/xml)

Other options:

-filetypes <type>[,<type>,...] (default:

xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt

,log)

-params "<key>=<value>[&<key>=<value>...]" (values must be URL-encoded; these

pass through to Solr update request)

-out yes|no (default: no; yes outputs Solr response to console)

...

Examples

There are several ways to use . This section presents several examples.bin/post

Indexing XML

Add all documents with file extension to collection or core named ..xml gettingstarted

bin/post -c gettingstarted *.xml

Add all documents with file extension to the collection/core on Solr running on port .xml gettingstarted 898

bin/post -c gettingstarted -port 8984 *.xml

Send XML arguments to delete a document from .gettingstarted

bin/post -c gettingstarted -d '<delete><id>42</id></delete>'

Indexing CSV

176Apache Solr Reference Guide 5.1

Index all CSV files into :gettingstarted

bin/post -c gettingstarted *.csv

Index a tab-separated file into :gettingstarted

bin/post -c signals -params "separator=%09" -type text/csv data.tsv

The content type ( ) parameter is required to treat the file as the proper type, otherwise it will be ignored-type

and a WARNING logged as it does not know what type of content a .tsv file is. The supports the CSV handler se

parameter, and is passed through using the setting.parator -params

Indexing JSON

Index all JSON files into .gettingstarted

bin/post -c gettingstarted *.json

Indexing rich documents (PDF, Word, HTML, etc)

Index a PDF file into .gettingstarted

bin/post -c gettingstarted a.pdf

Automatically detect content types in a folder, and recursively scan it for documents for indexing into gettingst

.arted

bin/post -c gettingstarted afolder/

Automatically detect content types in a folder, but limit it to PPT and HTML files and index into gettingstarte

bin/post -c gettingstarted -filetypes ppt,html afolder/

Windows support

bin/post exists currently only as a Unix shell script, however it delegates its work to a cross-platform capable

Java program. The can be run directly in supported environments, including Windows. SimplePostTool

SimplePostTool

The script currently delegates to a standalone Java program called . This tool,bin/post SimplePostTool

bundled into a executable JAR, can be run directly using .java -jar example/exampledocs/post.jar

See the help output and take it from there to post files, recurse a website or file system folder, or send direct

commands to a Solr server.

177Apache Solr Reference Guide 5.1

$ java -jar example/exampledocs/post.jar -h

SimplePostTool version 5.0.0

Usage: java [SystemProperties] -jar post.jar [-h|-] [<file|folder|url|arg>

[<file|folder|url|arg>...]]

Uploading Data with Index Handlers

Index Handlers are Request Handlers designed to add, delete and update documents to the index. In addition to

having plugins for importing rich documents or from structured data sources using the using Tika Data Import

, Solr natively supports indexing structured documents in XML, CSV and JSON.Handler

The recommended way to configure and use request handlers is with path based names that map to paths in the

request url. However, request handlers can also be specified with the (query type) parameter if the qt request

is appropriately configured. It is possible to access the same handler using more than one name, Dispatcher

which can be useful if you wish to specify different sets of default options.

A single unified update request handler supports XML, CSV, JSON, and javabin update requests, delegating to

the appropriate based on the of the .ContentStreamLoader Content-Type ContentStream

Topics covered in this section:

UpdateRequestHandler Configuration

XML Formatted Index Updates

Adding Documents

XML Update Commands

Using curl to Perform Updates

Using XSLT to Transform XML Index Updates

JSON Formatted Index Updates

Solr-Style JSON

JSON Update Convenience Paths

Transforming and Indexing Custom JSON

CSV Formatted Index Updates

CSV Update Parameters

Indexing Tab-Delimited files

CSV Update Convenience Paths

Nested Child Documents

UpdateRequestHandler Configuration

The default configuration file has the update request handler configured by default.

XML Formatted Index Updates

178Apache Solr Reference Guide 5.1

Index update commands can be sent as XML message to the update handler using Content-type:

or .application/xml Content-type: text/xml

Adding Documents

The XML schema recognized by the update handler for adding documents is very straightforward:

The element introduces one more documents to be added.<add>

The element introduces the fields making up a document.<doc>

The element presents the content for a specific field.<field>

For example:

<add>

<doc>

<field name="authors">Patrick Eagar</field>

<field name="subject">Sports</field>

<field name="title" boost="2.0">Summer of the all-rounder: Test and championship

cricket in England 1982</field>

<field name="publisher">Collins</field>

</doc>

...

</doc>

</add>

Each element has certain optional attributes which may be specified.

Command Optional

Parameter

Parameter Description

<add> commitWithin=

number

Add the document within the specified number of milliseconds

<add> overwrite=bool

ean

Default is true. Indicates if the unique key constraints should be checked to

overwrite previous versions of the same document (see below)

<doc> boost=float Default is 1.0. Sets a boost value for the document.To learn more about

boosting, see .Searching

<field> boost=float Default is 1.0. Sets a boost value for the field.

If the document schema defines a unique key, then by default an operation to add a document will/update

overwrite (ie: replace) any document in the index with the same unique key. If no unique key has been defined,

indexing performance is somewhat faster, as no check has to be made for an existing documents to replace.

If you have a unique key field, but you feel confident that you can safely bypass the uniqueness check (eg: you

build your indexes in batch, and your indexing code guarantees it never adds the same document more then

once) you can specify the option when adding your documents.overwrite="false"

179Apache Solr Reference Guide 5.1

XML Update Commands

Commit and Optimize Operations

The operation writes all documents loaded since the last commit to one or more segment files on the<commit>

disk. Before a commit has been issued, newly indexed content is not visible to searches. The commit operation

opens a new searcher, and triggers any event listeners that have been configured.

Commits may be issued explicitly with a message, and can also be triggered from <commit/> <autocommit>

parameters in .solrconfig.xml

The operation requests Solr to merge internal data structures in order to improve search<optimize>

performance. For a large index, optimization will take some time to complete, but by merging many small

segment files into a larger one, search performance will improve. If you are using Solr's replication mechanism to

distribute searches across many systems, be aware that after an optimize, a complete index will need to be

transferred. In contrast, post-commit transfers are usually much smaller.

The and elements accept these optional attributes:<commit> <optimize>

Optional

Attribute

Description

waitSearcher Default is true. Blocks until a new searcher is opened and registered as the main query

searcher, making the changes visible.

expungeDeletes (commit only) Default is false. Merges segments that have more than 10% deleted docs,

expunging them in the process.

maxSegments (optimize only) Default is 1. Merges the segments down to no more than this number of

segments.

Here are examples of <commit> and <optimize> using optional attributes:

Delete Operations

Documents can be deleted from the index in two ways. "Delete by ID" deletes the document with the specified

ID, and can be used only if a UniqueID field has been defined in the schema. "Delete by Query" deletes all

documents matching a specified query, although is ignored for a Delete by Query. A singlecommitWithin

delete message can contain multiple delete operations.

<query>subject:sport</query>

<query>publisher:penguin</query>

</delete>

Rollback Operations

180Apache Solr Reference Guide 5.1

The rollback command rolls back all add and deletes made to the index since the last commit. It neither calls any

event listeners nor creates a new searcher. Its syntax is simple: .<rollback/>

Using to Perform Updatescurl

You can use the utility to perform any of the above commands, using its option tocurl --data-binary

append the XML message to the command, and generating a HTTP POST request. For example:curl

curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml"

--data-binary '

<add>

<doc>

<field name="authors">Patrick Eagar</field>

<field name="subject">Sports</field>

<field name="publisher">Collins</field>

</doc>

</add>'

For posting XML messages contained in a file, you can use the alternative form:

curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml"

--data-binary @myfile.xml

Short requests can also be sent using a HTTP GET command, URL-encoding the request, as in the following.

Note the escaping of "<" and ">":

curl http://localhost:8983/solr/my_collection/update?stream.body=%3Ccommit/%3E

Responses from Solr take the form shown here:

</lst>

</response>

The status field will be non-zero in case of failure.

Using XSLT to Transform XML Index Updates

The UpdateRequestHandler allows you to index any arbitrary XML using the parameter to apply an <tr> XSL

. You must have an XSLT stylesheet in the directory of your that cantransformation conf/xslt config set

transform the incoming data to the expected format, and use the parameter to specify<add><doc/></add> tr

the name of that stylesheet.

Here is an example XSLT stylesheet:

181Apache Solr Reference Guide 5.1

<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>

<xsl:output media-type="text/xml" method="xml" indent="yes"/>

<xsl:template match='/'>

<add>

<xsl:apply-templates select="response/result/doc"/>

</add>

</xsl:template>

<xsl:template match="doc/*[@name='score']" priority="100">

</xsl:template>

<xsl:template match="doc">

<xsl:variable name="pos" select="position()"/>

<doc>

<xsl:apply-templates>

<xsl:with-param name="pos"><xsl:value-of select="$pos"/></xsl:with-param>

</xsl:apply-templates>

</doc>

</xsl:template>

<xsl:template match="doc/arr" priority="100">

<xsl:variable name="fn" select="@name"/>

<xsl:for-each select="*">

<xsl:element name="field">

<xsl:attribute name="name"><xsl:value-of

select="$fn"/></xsl:attribute>

<xsl:value-of select="."/>

</xsl:element>

</xsl:for-each>

</xsl:template>

<xsl:template match="doc/*">

<xsl:variable name="fn" select="@name"/>

<xsl:element name="field">

<xsl:attribute name="name"><xsl:value-of select="$fn"/></xsl:attribute>

<xsl:value-of select="."/>

</xsl:element>

</xsl:template>

<xsl:template match="*"/>

</xsl:stylesheet>

This stylesheet transforms Solr's XML search result format into Solr's Update XML syntax. One example usage

would be to copy a Solr 1.3 index (which does not have CSV response writer) into a format which can be indexed

into another Solr file (provided that all fields are stored):

http://localhost:8983/solr/my_collection/select?q=*:*&wt=xslt&tr=updateXml.xsl&rows=

1000

You can also use the stylesheet in to transform an index when updating:XsltUpdateRequestHandler

curl "http://localhost:8983/solr/my_collection/update?commit=true&tr=updateXml.xsl"

-H "Content-Type: text/xml" --data-binary @myexporteddata.xml

For more information about the XML Update Request Handler, see https://wiki.apache.org/solr/UpdateXmlMessa

.ges

182Apache Solr Reference Guide 5.1

JSON Formatted Index Updates

Solr can accept JSON that conforms to a defined structure, or can accept arbitrary JSON-formatted documents.

If sending arbitrarily formatted JSON, there are some additional parameters that need to be sent with the update

request, described below in the section .Transforming and Indexing Custom JSON

Solr-Style JSON

JSON formatted update requests may be sent to Solr's handler using /update Content-Type:

or .application/json Content-Type: text/json

JSON formatted updates can take 3 basic forms, described in depth below:

A single document to add, expressed as a top level JSON Object. To differentiate this from a set of

commands, the request parameter is required.json.command=false

A list of documents to add, expressed as a top level JSON Array containing a JSON Object per document.

A sequence of update commands, expressed as a top level JSON Object (aka: Map).

Adding a Single JSON Document

The simplest way to add Documents via JSON is to send each document individually as a JSON Object, using

the path:/update/json/docs

curl -X POST -H 'Content-Type: application/json'

'http://localhost:8983/solr/my_collection/update/json/docs' --data-binary '

{

"id": "1",

"title": "Doc 1"

Adding Multiple JSON Documents

Adding multiple documents at one time via JSON can be done via a JSON Array of JSON Objects, where each

object represents a document:

curl -X POST -H 'Content-Type: application/json'

'http://localhost:8983/solr/my_collection/update' --data-binary '

[

{

"id": "1",

"title": "Doc 1"

{

"id": "2",

"title": "Doc 2"

}

A sample JSON file is provided at and contains an array of objects thatexample/exampledocs/books.json

you can add to the Solr example:techproducts

183Apache Solr Reference Guide 5.1

curl 'http://localhost:8983/solr/techproducts/update?commit=true' --data-binary

@example/exampledocs/books.json -H 'Content-type:application/json'

Sending JSON Update Commands

In general, the JSON update syntax supports accepts all of the update commands that the XML update handler

supports, through a straightforward mapping. Multiple commands, adding and deleting documents, may be

contained in one message:

curl -X POST -H 'Content-Type: application/json'

'http://localhost:8983/solr/my_collection/update' --data-binary '

{

"add": {

"doc": {

"id": "DOC1",

"my_boosted_field": { /* use a map with boost/value for a boosted field

"boost": 2.3,

"value": "test"

"my_multivalued_field": [ "aaa", "bbb" ] /* Can use an array for a

multi-valued field */

}

"add": {

"commitWithin": 5000, /* commit this document within 5 seconds */

"overwrite": false, /* don't check for existing documents with the

same uniqueKey */

"boost": 3.45, /* a document boost */

"doc": {

"f1": "v1", /* Can use repeated keys for a multi-valued field

"f1": "v2"

}

"commit": {},

"optimize": { "waitSearcher":false },

"delete": { "id":"ID" }, /* delete by ID */

"delete": { "query":"QUERY" } /* delete by query */

As with other update handlers, parameters such as , , , and maycommit commitWithin optimize overwrite

be specified in the URL instead of in the body of the message.

The JSON update format allows for a simple delete-by-id. The value of a can be an array which containsdelete

a list of zero or more specific document id's (not a range) to be deleted. For example, a single document:

Comments are not allowed in JSON, but duplicate names are.

The comments in the above example are for illustrative purposes only, and can not be included in actual

commands sent to Solr.

184Apache Solr Reference Guide 5.1

{ "delete":"myid" }

Or a list of document IDs:

{ "delete":["id1","id2"] }

The value of a "delete" can be an array which contains a list of zero or more id's to be deleted. It is not a range

(start and end).

You can also specify with each "delete":_version_

{

"delete":"id":50,

"_version_":12345

}

You can specify the version of deletes in the body of the update request as well.

JSON Update Convenience Paths

In addition to the handler, there are a few additional JSON specific request handler paths available by/update

default in Solr, that implicitly override the behavior of some request parameters:

Path Default Parameters

/update/json stream.contentType=application/json

/update/json/docs stream.contentType=application/json

json.command=false

The path may be useful for clients sending in JSON formatted update commands from/update/json

applications where setting the Content-Type proves difficult, while the path can be/update/json/docs

particularly convenient for clients that always want to send in documents – either individually or as a list – with

out needing to worry about the full JSON command syntax.

Transforming and Indexing Custom JSON

If you have JSON documents that you would like to index without transforming them into Solr's structure, you can

add them to Solr by including some parameters with the update request. These parameters provide information

on how to split a single JSON file into multiple Solr documents and how to map fields to Solr's schema. One or

more valid JSON documents can be sent to the path with the configuration params./update/json/docs

Mapping Parameters

These parameters allow you to define how a JSON file should be read for multiple Solr documents.

split: Defines the path at which to split the input JSON into multiple Solr documents and is required if you

have multiple documents in a single JSON file. If the entire JSON makes a single solr document, the path

must be “ ”. /

f: This is a multivalued mapping parameter. At least one field mapping must be provided. The format of

185Apache Solr Reference Guide 5.1

the parameter is . The is required. The target-field-name:json-path json-path target-field

is the Solr document field name, and is optional. If not specified, it is automatically derived from the-name

input JSON. Wildcards can be used here, see the section below for more information.Wildcards

mapUniqueKeyOnly (boolean): This parameter is particularly convenient when the fields in the input

JSON are not available in the schema and is not enabled. This will index all the fieldsschemaless mode

into the default search field (using the parameter, below) and only the field is mapped todf uniqueKey

the corresponding field in the schema. If the input JSON does not have a value for the fielduniqueKey

then a UUID is generated for the same.

df: If the flag is used, the update handler needs a field where the data should bemapUniqueKeyOnly

indexed to. This is the same field that other handlers use as a default search field.

srcField: This is the name of the field to which the JSON source will be stored into. This can only be used

if (i.e., you want your JSON input file to be indexed as a single Solr document). Note that atomicsplit=/

updates will cause the field to be out-of-sync with the document.

echo: This is for debugging purpose only. Set it to true if you want the docs to be returned as a response.

Nothing will be indexed.

For example, if we have a JSON file that includes two documents, we could define an update request like this:

curl 'http://localhost:8983/solr/my_collection/update/json/docs'\

'?split=/exams'\

'&f=first:/first'\

'&f=last:/last'\

'&f=grade:/grade'\

'&f=subject:/exams/subject'\

'&f=test:/exams/test'\

'&f=marks:/exams/marks'\

-H 'Content-type:application/json' -d '

{

"first": "John",

"last": "Doe",

"grade": 8,

"exams": [

{

"subject": "Maths",

"test" : "term1",

"marks" : 90},

{

"subject": "Biology",

"test" : "term1",

"marks" : 86}

]

With this request, we have defined that "exams" contains multiple documents. In addition, we have mapped

several fields from the input document to Solr fields.

When the update request is complete, the following two documents will be added to the index:

186Apache Solr Reference Guide 5.1

{

"first":"John",

"last":"Doe",

"marks":90,

"test":"term1",

"subject":"Maths",

"grade":8

}

{

"first":"John",

"last":"Doe",

"marks":86,

"test":"term1",

"subject":"Biology",

"grade":8

}

In the prior example, all of the fields we wanted to use in Solr had the same names as they did in the input

JSON. When that is the case, we can simplify the request as follows:

curl 'http://localhost:8983/solr/my_collection/update/json/docs'\

'?split=/exams'\

'&f=/first'\

'&f=/last'\

'&f=/grade'\

'&f=/exams/subject'\

'&f=/exams/test'\

'&f=/exams/marks'\

-H 'Content-type:application/json' -d '

{

"first": "John",

"last": "Doe",

"grade": 8,

"exams": [

{

"subject": "Maths",

"test" : "term1",

"marks" : 90},

{

"subject": "Biology",

"test" : "term1",

"marks" : 86}

]

In this example, we simply named the field paths (such as ). Solr will automatically attempt to add/exams/test

the content of the field from the JSON input to the index in a field with the same name.

Wildcards

Note that if you are not working in , where fields that don't exist will be created on theSchemaless Mode

fly with Solr's best guess for the field type, documents may get rejected if the fields do not exist in the

schema before indexing.

187Apache Solr Reference Guide 5.1

Instead of specifying all the field names explicitly, it is possible to specify wildcards to map fields automatically.

There are two restrictions: wildcards can only be used at the end of the , and the split path cannotjson-path

use wildcards. A single asterisk "*" maps only to direct children, and a double asterisk "**" maps recursively to all

descendants. The following are example wildcard path mappings:

f=/docs/*: maps all the fields under docs and in the name as given in json

f=/docs/**: maps all the fields under docs and its children in the name as given in json

f=searchField:/docs/*: maps all fields under /docs to a single field called ‘searchField’

f=searchField:/docs/**: maps all fields under /docs and its children to searchField

f=$FQN:/**: maps all fields to the fully qualified name ( ) of the JSON field. The fully qualified$FQN

name is obtained by concatenating all the keys in the hierarchy with a period ( ) as a delimiter..

(Note: The default value of is ' fom Solr 5.0 . It used to be ' in 4.10.x releases. This breaksf $FQN:/**' /**'

backward compatibility. If you wish to have the old behavior please specify explicitly.) f=/**

With wildcards we can further simplify our previous example as follows:

curl 'http://localhost:8983/solr/my_collection/update/json/docs'\

'?split=/exams'\

'&f=/**'\

-H 'Content-type:application/json' -d '

{

"first": "John",

"last": "Doe",

"grade": 8,

"exams": [

{

"subject": "Maths",

"test" : "term1",

"marks" : 90},

{

"subject": "Biology",

"test" : "term1",

"marks" : 86}

]

Because we want the fields to be indexed with the field names as they are found in the JSON input, the double

wildcard in will map all fields and their descendants to the same fields in Solr.f=/**

It is also possible to send all the values to a single field and do a full text search on that. This is a good option to

blindly index and query JSON documents without worrying about fields and schema.

188Apache Solr Reference Guide 5.1

curl 'http://localhost:8983/solr/my_collection/update/json/docs'\

'?split=/'\

'&f=txt:/**'\

-H 'Content-type:application/json' -d '

{

"first": "John",

"last": "Doe",

"grade": 8,

"exams": [

{

"subject": "Maths",

"test" : "term1",

"marks" : 90},

{

"subject": "Biology",

"test" : "term1",

"marks" : 86}

]

In the above example, we've said all of the fields should be added to a field in Solr named 'txt'. This will add

multiple fields to a single field, so whatever field you choose should be multi-valued.

The default behavior is to use the fully qualified name (FQN) of the node. So, if we don't define any field

mappings, like this:

curl 'http://localhost:8983/solr/my_collection/update/json/docs?split=/exams'\

-H 'Content-type:application/json' -d '

{

"first": "John",

"last": "Doe",

"grade": 8,

"exams": [

{

"subject": "Maths",

"test" : "term1",

"marks" : 90},

{

"subject": "Biology",

"test" : "term1",

"marks" : 86}

]

The indexed documents would be added to the index with fields that look like this:

189Apache Solr Reference Guide 5.1

{

"first":"John",

"last":"Doe",

"grade":8,

"exams.subject":"Maths",

"exams.test":"term1",

"exams.marks":90},

{

"first":"John",

"last":"Doe",

"grade":8,

"exams.subject":"Biology",

"exams.test":"term1",

"exams.marks":86}

Setting JSON Defaults

If It is possible to send any json to the endpoint and the default configuration of the/update/json/docs

component is as follows:

<!-- this ensures that the entire json doc will be stored verbatim into one

field -->

<!-- This means a the uniqueKeyField will be extracted from the fields and

all fields go into the 'df' field. In this config df is already configured to

be 'text'

-->

<!-- The default search field where all the values are indexed to

-->

</lst>

</initParams>

So, if no params are passed, the entire json file would get indexed to the field and all the values in the_src_

input JSON would go to a field named . If there is a value for the uniqueKey it is stored and if no valuetext

could be obtained from the input JSON, a UUID is created and used as the uniqueKey field value.

CSV Formatted Index Updates

CSV formatted update requests may be sent to Solr's handler using /update Content-Type:

or .application/csv Content-Type: text/csv

A sample CSV file is provided at that you can use to add someexample/exampledocs/books.csv

documents to the Solr example:techproducts

curl 'http://localhost:8983/solr/techproducts/update?commit=true' --data-binary

@example/exampledocs/books.csv -H 'Content-type:application/csv'

190Apache Solr Reference Guide 5.1

CSV Update Parameters

The CSV handler allows the specification of many parameters in the URL in the form: f. .parameter optional

.=_fieldname value

The table below describes the parameters for the update handler.

Parameter Usage Global

(g) or

Per

Field

(f)

Example

separator Character used as field separator; default is "," g,(f:

see

split)

separator=%

trim If true, remove leading and trailing whitespace

from values. Default=false.

g,f f.isbn.trim=true

trim=false

header Set to true if first line of input contains field

names. These will be used if the pafieldnames

rameter is absent.

fieldnames Comma separated list of field names to use

when adding documents.

g fieldnames=isbn,price,title

literal.<field_name> Comma separated list of field names to use

when processing literal values.

g literal.color=red,blue,black

skip Comma separated list of field names to skip. g skip=uninteresting,shoesize

skipLines Number of lines to discard in the input stream

before the CSV data starts, including the

header, if present. Default=0.

g skipLines=5

encapsulator The character optionally used to surround

values to preserve characters such as the CSV

separator or whitespace. This standard CSV

format handles the encapsulator itself appearing

in an encapsulated value by doubling the

encapsulator.

g,(f:

see

split)

encapsulator="

escape The character used for escaping CSV

separators or other reserved characters. If an

escape is specified, the encapsulator is not

used unless also explicitly specified since most

formats use either encapsulation or escaping,

not both

g escape=\

keepEmpty Keep and index zero length (empty) fields.

Default=false.

g,f f.price.keepEmpty=true

191Apache Solr Reference Guide 5.1

map Map one value to another. Format is

value:replacement (which can be empty.)

g,f map=left:right

f.subject.map=history:bunk

split If true, split a field into multiple values by a

separate parser.

overwrite If true (the default), check for and overwrite

duplicate documents, based on the uniqueKey

field declared in the Solr schema. If you know

the documents you are indexing do not contain

any duplicates then you may see a considerable

speed up setting this to false.

commit Issues a commit after the data has been

ingested.

commitWithin Add the document within the specified number

of milliseconds.

g commitWithin=10000

rowid Map the rowid (line number) to a field specified

by the value of the parameter, for instance if

your CSV doesn't have a unique key and you

want to use the row id as such.

g rowid=id

rowidOffset Add the given offset (as an int) to the rowid

before adding it to the document. Default is 0

g rowidOffset=10

Indexing Tab-Delimited files

The same feature used to index CSV documents can also be easily used to index tab-delimited files (TSV files)

and even handle backslash escaping rather than CSV encapsulation.

For example, one can dump a MySQL table to a tab delimited file with:

SELECT * INTO OUTFILE '/tmp/result.txt' FROM mytable;

This file could then be imported into Solr by setting the to tab (%09) and the to backslashseparator escape

(%5c).

curl 'http://localhost:8983/solr/update/csv?commit=true&separator=%09&escape=%5c'

--data-binary @/tmp/result.txt

CSV Update Convenience Paths

In addition to the handler, there is an additional CSV specific request handler path available by default/update

in Solr, that implicitly override the behavior of some request parameters:

Path Default Parameters

/update/csv stream.contentType=application/csv

The path may be useful for clients sending in CSV formatted update commands from applications/update/csv

192Apache Solr Reference Guide 5.1

where setting the Content-Type proves difficult.

For more information on the CSV Update Request Handler, see .https://wiki.apache.org/solr/UpdateCSV

Nested Child Documents

Solr nested documents using a "Block Join" when indexing as a way to model documents containing other

documents, such as a blog post parent document and comments as child documents -- or products as parent

documents and sizes, colors, or other variations as child documents. At query time, the Block Join Query

can be used search against these relationships. In terms of performance, indexing the relationshipsParsers

between documents may be more efficient than attempting to do joins only at query time, since the relationships

are already stored in the index and do not need to be computed.

Nested documents may be indexed via either the XML or JSON data syntax (or using - but regardless ofSolrJ)

syntax, you must include a field that identifies the parent document as a parent; it can be any field that suits this

purpose, and it will be used as input for the .block join query parsers

XML Examples

For example, here are two documents and their child documents:

<add>

<doc>

<field name="title">Solr adds block join support</field>

<field name="content_type">parentDocument</field>

<doc>

<field name="comments">SolrCloud supports it too!</field>

</doc>

<doc>

<field name="title">New Lucene and Solr release is out</field>

<field name="content_type">parentDocument</field>

<doc>

<field name="comments">Lots of new features</field>

</doc>

</add>

In this example, we have indexed the parent documents with the field , which has the valuecontent_type

"parentDocument". We could have also used a boolean field, such as , with a value of "true", or anyisParent

other similar approach.

JSON Examples

This example is equivalent to the XML example above, note the special key need to_childDocuments_

indicate the nested documents in JSON.

193Apache Solr Reference Guide 5.1

[

{

"id": "1",

"title": "Solr adds block join support",

"content_type": "parentDocument",

"_childDocuments_": [

{

"id": "2",

"comments": "SolrCloud supports it too!"

}

]

{

"id": "3",

"title": "New Lucene and Solr release is out",

"content_type": "parentDocument",

"_childDocuments_": [

{

"id": "4",

"comments": "Lots of new features"

}

]

}

]

Uploading Data with Solr Cell using Apache Tika

Solr uses code from the project to provide a framework for incorporating many different file-formatApache Tika

parsers such as and into Solr itself. Working with this framework, Solr's Apache PDFBox Apache POI Extracti

can use Tika to support uploading binary files, including files in popular formats such asngRequestHandler

Word and PDF, for data extraction and indexing.

When this framework was under development, it was called the Solr Content Extraction Library or CEL; from that

abbreviation came this framework's name: Solr Cell.

If you want to supply your own for Solr to use, you can extend the ContentHandler ExtractingRequestHan

and override the method. This factory is responsible for constructing the dler createFactory() SolrConten

that interacts with Tika, and allows literals to override Tika-parsed values. Set the parameter tHandler litera

, which normally defaults to *true, to *false" to append Tika-parsed values to literal values.lsOverride

For more information on Solr's Extracting Request Handler, see https://wiki.apache.org/solr/ExtractingRequestH

Note

One limitation of indexing nested documents is that the whole block of parent-children documents must

be updated together whenever any changes are required. In other words, even if a single child document

or the parent document is changed, the whole block of parent-child documents must be indexed

together.

194Apache Solr Reference Guide 5.1

.andler

Topics covered in this section:

Key Concepts

Trying out Tika with the Solr techproducts Example

Input Parameters

Order of Operations

Configuring the Solr ExtractingRequestHandler

Indexing Encrypted Documents with the ExtractingUpdateRequestHandler

Examples

Sending Documents to Solr with a POST

Sending Documents to Solr with Solr Cell and SolrJ

Related Topics

Key Concepts

When using the Solr Cell framework, it is helpful to keep the following in mind:

Tika will automatically attempt to determine the input document type (Word, PDF, HTML) and extract the

content appropriately. If you like, you can explicitly specify a MIME type for Tika with the pstream.type

arameter.

Tika works by producing an XHTML stream that it feeds to a SAX ContentHandler. SAX is a common

interface implemented for many different XML parsers. For more information, see http://www.saxproject.or

.g/quickstart.html

Solr then responds to Tika's SAX events and creates the fields to index.

Tika produces metadata such as Title, Subject, and Author according to specifications such as the

DublinCore. See for the file types supported.http://tika.apache.org/1.7/formats.html

Tika adds all the extracted text to the field.content

You can map Tika's metadata fields to Solr fields. You can also boost these fields.

You can pass in literals for field values. Literals will override Tika-parsed values, including fields in the

Tika metadata object, the Tika content field, and any "captured content" fields.

You can apply an XPath expression to the Tika XHTML to restrict the content that is produced.

Trying out Tika with the Solr Exampletechproducts

You can try out the Tika framework using the example included in Solr.techproducts

Start the example:

bin/solr -e techproducts

You can now use curl to send a sample PDF file via HTTP POST:

While Apache Tika is quite powerful, it is not perfect and fails on some files. PDF files are particularly

problematic, mostly due to the PDF format itself. In case of a failure processing any file, the Extractin

does not have a secondary mechanism to try to extract some text from the file; it willgRequestHandler

throw an exception and fail.

195Apache Solr Reference Guide 5.1

curl

'http://localhost:8983/solr/techproducts/update/extract?literal.id=doc1&commit=true'

-F "myfile=@example/exampledocs/solr-word.pdf"

The URL above calls the Extracting Request Handler, uploads the file and assigns it thesolr-word.pdf

unique ID . Here's a closer look at the components of this command:doc1

The parameter provides the necessary unique ID for the document being indexed.literal.id=doc1

The causes Solr to perform a commit after indexing the document, making itcommit=true parameter

immediately searchable. For optimum performance when loading many documents, don't call the commit

command until you are done.

The flag instructs curl to POST data using the Content-Type and supports-F multipart/form-data

the uploading of binary files. The @ symbol instructs curl to upload the attached file.

The argument needs a valid path, which can be absolute or relative.myfile=@tutorial.html

You can also use to send a PDF file into Solr (without the params, the literal.id parameter would bebin/post

set to the absolute path to the file):

bin/post -c techproducts example/exampledocs/solr-word.pdf -params "literal.id=a"

Now you should be able to execute a query and find that document. You can make a request like http://loc

. alhost:8983/solr/techproducts/select?q=pdf

You may notice that although the content of the sample document has been indexed and stored, there are not a

lot of metadata fields associated with this document. This is because unknown fields are ignored according to the

default parameters configured for the handler in , and this behavior can/update/extract solrconfig.xml

be easily changed or overridden. For example, to store and see all metadata and content, execute the following:

bin/post -c techproducts example/exampledocs/solr-word.pdf -params

"literal.id=doc1&uprefix=attr_"

In this command, the parameter causes all generated fields that aren't defined in the schemauprefix=attr_

to be prefixed with , which is a dynamic field that is stored and indexed.attr_

This command allows you to query the document using an attribute, as in: http://localhost:8983/solr/t

.echproducts/select?q=attr_meta:microsoft

Input Parameters

The table below describes the parameters accepted by the Extracting Request Handler.

Parameter Description

boost.< >fieldname Boosts the specified field by the defined float amount. (Boosting a field alters its

importance in a query response. To learn about boosting fields, see .)Searching

196Apache Solr Reference Guide 5.1

capture Captures XHTML elements with the specified name for a supplementary addition

to the Solr document. This parameter can be useful for copying chunks of the

XHTML into a separate field. For instance, it could be used to grab paragraphs (<

) and index them into a separate field. Note that content is still also capturedp>

into the overall "content" field.

captureAttr Indexes attributes of the Tika XHTML elements into separate fields, named after

the element. If set to true, for example, when extracting from HTML, Tika can

return the href attributes in <a> tags as fields named "a". See the examples below.

commitWithin Add the document within the specified number of milliseconds.

date.formats Defines the date format patterns to identify in the documents.

defaultField If the uprefix parameter (see below) is not specified and a field cannot be

determined, the default field will be used.

extractOnly Default is false. If true, returns the extracted content from Tika without indexing

the document. This literally includes the extracted XHTML as a string in the

response. When viewing manually, it may be useful to use a response format

other than XML to aid in viewing the embedded XHTML tags.For an example, see

.http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput

extractFormat Default is "xml", but the other option is "text". Controls the serialization format of

the extract content. The xml format is actually XHTML, the same format that

results from passing the command to the Tika command line application, while-x

the text format is like that produced by Tika's command. This parameter is-t

valid only if is set to true.extractOnly

fmap.< >source_field Maps (moves) one field name to another. The must be a field insource_field

incoming documents, and the value is the Solr field to map to. Example: fmap.co

causes the data in the field generated by Tika to bentent=text content

moved to the Solr's field.text

literal.< >fieldname Populates a field with the name supplied with the specified value for each

document. The data can be multivalued if the field is multivalued.

literalsOverride If true (the default), literal field values will override other values with the same field

name. If false, literal values defined with will beliteral.< >fieldname

appended to data already in the fields extracted from Tika. If setting literalsOv

to "false", the field must be multivalued.erride

lowernames Values are "true" or "false". If true, all field names will be mapped to lowercase

with underscores, if needed. For example, "Content-Type" would be mapped to

"content_type."

multipartUploadLimitInKB Useful if uploading very large documents, this defines the KB size of documents to

allow.

passwordsFile Defines a file path and name for a file of file name to password mappings.

197Apache Solr Reference Guide 5.1

resource.name Specifies the optional name of the file. Tika can use it as a hint for detecting a

file's MIME type.

resource.password Defines a password to use for a password-protected PDF or OOXML file

tika.config Defines a file path and name to a customized Tika configuration file. This is only

required if you have customized your Tika implementation.

uprefix Prefixes all fields that are not defined in the schema with the given prefix. This is

very useful when combined with dynamic field definitions. Example: uprefix=ig

would effectively ignore all unknown fields generated by Tika given thenored_

example schema contains <dynamicField name="ignored_*" type="ign

ored"/>

xpath When extracting, only return Tika XHTML content that satisfies the given XPath

expression. See for details on the format ofhttp://tika.apache.org/1.7/index.html

Tika XHTML. See also .http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput

Order of Operations

Here is the order in which the Solr Cell framework, using the Extracting Request Handler and Tika, processes its

input.

Tika generates fields or passes them in as literals specified by . If literal.<fieldname>=<value> li

, literals will be appended as multi-value to the Tika-generated field.teralsOverride=false

If , Tika maps fields to lowercase.lowernames=true

Tika applies the mapping rules specified by parameters.fmap. source = target

If is specified, any unknown field names are prefixed with that value, else if isuprefix defaultField

specified, any unknown fields are copied to the default field.

Configuring the Solr ExtractingRequestHandler

If you are not working with the supplied orsample_techproducts_configs data_driven_schema_conf

, you must configure your own to know about the Jar's containing the igs config set solrconfig.xml Extract

and it's dependencies:ingRequestHandler

You can then configure the in .ExtractingRequestHandler solrconfig.xml

198Apache Solr Reference Guide 5.1

<requestHandler name="/update/extract"

class="org.apache.solr.handler.extraction.ExtractingRequestHandler">

<str name="fmap.Last-Modified">last_modified</str>

<str name="uprefix">ignored_</str>

</lst>

<!--Optional. Specify a path to a tika configuration file. See the Tika docs for

details.-->

<str name="tika.config">/my/path/to/tika.config</str>

<!-- Optional. Specify one or more date formats to parse. See

DateUtil.DEFAULT_DATE_FORMATS

for default date formats -->

</lst>

</requestHandler>

In the defaults section, we are mapping Tika's Last-Modified Metadata attribute to a field named last_modifie

. We are also telling it to ignore undeclared fields. These are all overridden parameters.d

The entry points to a file containing a Tika configuration. The allows you totika.config date.formats

specify various date formats for working with transforming extracted input tojava.text.SimpleDateFormats

a Date. Solr comes configured with the following date formats (see the in Solr):DateUtil

yyyy-MM-dd'T'HH:mm:ss'Z'

yyyy-MM-dd'T'HH:mm:ss

yyyy-MM-dd

yyyy-MM-dd hh:mm:ss

yyyy-MM-dd HH:mm:ss

EEE MMM d hh:mm:ss z yyyy

EEE, dd MMM yyyy HH:mm:ss zzz

EEEE, dd-MMM-yy HH:mm:ss zzz

EEE MMM d HH:mm:ss yyyy

You may also need to adjust the attribute as follows if you are submitting verymultipartUploadLimitInKB

large documents.

...

Multi-Core Configuration

For a multi-core configuration, you can specify in the section of andsharedLib='lib' <solr/> solr.xml

place the necessary jar files there.

For more information about Solr cores, see .The Well-Configured Solr Instance

Indexing Encrypted Documents with the ExtractingUpdateRequestHandler

The ExtractingRequestHandler will decrypt encrypted files and index their content if you supply a password in

either on the request, or in a file.resource.password passwordsFile

199Apache Solr Reference Guide 5.1

In the case of , the file supplied must be formatted so there is one line per rule. Each rulepasswordsFile

contains a file name regular expression, followed by "=", then the password in clear-text. Because the passwords

are in clear-text, the file should have strict access restrictions.

# This is a comment

myFileName = myPassword

.*\.docx$ = myWordPassword

.*\.pdf$ = myPdfPassword

Examples

Metadata

As mentioned before, Tika produces metadata about the document. Metadata describes different aspects of a

document, such as the author's name, the number of pages, the file size, and so on. The metadata produced

depends on the type of document submitted. For instance, PDFs have different metadata than Word documents

do.

In addition to Tika's metadata, Solr adds the following metadata (defined in )ExtractingMetadataConstants

Solr Metadata Description

stream_name The name of the Content Stream as uploaded to Solr. Depending on how the file is

uploaded, this may or may not be set

stream_source_info Any source info about the stream. (See the section on Content Streams later in this

section.)

stream_size The size of the stream in bytes.

stream_content_type The content type of the stream, if available.

Examples of Uploads Using the Extracting Request Handler

Capture and Mapping

The command below captures tags separately, and then maps all the instances of that field to a dynamic<div>

field named .foo_t

bin/post -c techproducts example/exampledocs/sample.html -params

"literal.id=doc2&captureAttr=true&defaultField=_text_&fmap.div=foo_t&capture=div"

Capture, Mapping, and Boosting

The command below captures tags separately, maps the field to a dynamic field named , then<div> foo_t

boosts by 3.foo_t

We recommend that you try using the option to discover which values Solr is setting forextractOnly

these metadata elements.

200Apache Solr Reference Guide 5.1

bin/post -c techproducts example/exampledocs/sample.html -params

"literal.id=doc3&captureAttr=true&defaultField=_text_&capture=div&fmap.div=foo_t&boo

st.foo_t=3"

Using Literals to Define Your Own Metadata

To add in your own metadata, pass in the literal parameter along with the file:

bin/post -c techproducts -params

"literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost

.foo_t=3&literal.blah_s=Bah" example/exampledocs/sample.html

XPath

The example below passes in an XPath expression to restrict the XHTML returned by Tika:

bin/post -c techproducts -params

"literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost

.foo_t=3&xpath=/xhtml:html/xhtml:body/xhtml:div//node()"

example/exampledocs/sample.html

Extracting Data without Indexing It

Solr allows you to extract data without indexing. You might want to do this if you're using Solr solely as an

extraction server or if you're interested in testing Solr extraction.

The example below sets the parameter to extract data without indexing it.extractOnly=true

curl "http://localhost:8983/solr/techproducts/update/extract?&extractOnly=true"

--data-binary @example/exampledocs/sample.html -H 'Content-type:text/html'

The output includes XML generated by Tika (and further escaped by Solr's XML) using a different output format

to make it more readable (`-out yes` instructs the tool to echo Solr's output to the console):

bin/post -c techproducts -params "extractOnly=true&wt=ruby&indent=true" -out yes

example/exampledocs/sample.html

Sending Documents to Solr with a POST

The example below streams the file as the body of the POST, which does not, then, provide information to Solr

about the name of the file.

curl

"http://localhost:8983/solr/techproducts/update/extract?literal.id=doc6&defaultField

=text&commit=true" --data-binary @example/exampledocs/sample.html -H

'Content-type:text/html'

Sending Documents to Solr with Solr Cell and SolrJ

201Apache Solr Reference Guide 5.1

SolrJ is a Java client that you can use to add documents to the index, update the index, or query the index. You'll

find more information on SolrJ in .Client APIs

Here's an example of using Solr Cell and SolrJ to add documents to a Solr index.

First, let's use SolrJ to create a new SolrClient, then we'll construct a request containing a ContentStream

(essentially a wrapper around a file) and sent it to Solr:

public class SolrCellRequestDemo {

public static void main (String[] args) throws IOException, SolrServerException {

SolrClient server = new

HttpSolrClient("http://localhost:8983/solr/my_collection");

ContentStreamUpdateRequest req = new

ContentStreamUpdateRequest("/update/extract");

req.addFile(new File("my-file.pdf"));

req.setParam(ExtractingParams.EXTRACT_ONLY, "true");

NamedList<Object> result = server.request(req);

System.out.println("Result: " + result);

}

This operation streams the file into the Solr index for .my-file.pdf my_collection

The sample code above calls the extract command, but you can easily substitute other commands that are

supported by Solr Cell. The key class to use is the , which makes sure theContentStreamUpdateRequest

ContentStreams are set properly. SolrJ takes care of the rest.

Note that the is not just specific to Solr Cell. You can send CSV to the CSVContentStreamUpdateRequest

Update handler and to any other Request Handler that works with Content Streams for updates.

Related Topics

ExtractingRequestHandler

Uploading Structured Data Store Data with the Data Import

Handler

Many search applications store the content to be indexed in a structured data store, such as a relational

database. The Data Import Handler (DIH) provides a mechanism for importing content from a data store and

indexing it. In addition to relational databases, DIH can index content from HTTP based data sources such as

RSS and ATOM feeds, e-mail repositories, and structured XML where an XPath processor is used to generate

fields.

The directory contains several collections many of the features of the data importexample/example-DIH

handler. To run this " " example:dih

bin/solr -e dih

For more information about the Data Import Handler, see .https://wiki.apache.org/solr/DataImportHandler

Topics covered in this section:

202Apache Solr Reference Guide 5.1

Concepts and Terminology

Configuration

Data Import Handler Commands

Property Writer

Data Sources

Entity Processors

Transformers

Special Commands for the Data Import Handler

Concepts and Terminology

Descriptions of the Data Import Handler use several familiar terms, such as entity and processor, in specific

ways, as explained in the table below.

Term Definition

Datasource As its name suggests, a datasource defines the location of the data of interest. For a database,

it's a DSN. For an HTTP datasource, it's the base URL.

Entity Conceptually, an entity is processed to generate a set of documents, containing multiple fields,

which (after optionally being transformed in various ways) are sent to Solr for indexing. For a

RDBMS data source, an entity is a view or table, which would be processed by one or more

SQL statements to generate a set of rows (documents) with one or more columns (fields).

Processor An entity processor does the work of extracting content from a data source, transforming it, and

adding it to the index. Custom entity processors can be written to extend or replace the ones

supplied.

Transformer Each set of fields fetched by the entity may optionally be transformed. This process can modify

the fields, create new fields, or generate multiple rows/documents form a single row. There are

several built-in transformers in the DIH, which perform functions such as modifying dates and

stripping HTML. It is possible to write custom transformers using the publicly available interface.

Configuration

Configuring solrconfig.xml

The Data Import Handler has to be registered in . For example:solrconfig.xml

<requestHandler name="/dataimport"

class="org.apache.solr.handler.dataimport.DataImportHandler">

<str name="config">/path/to/my/DIHconfigfile.xml</str>

</lst>

</requestHandler>

The only required parameter is the parameter, which specifies the location of the DIH configuration fileconfig

that contains specifications for the data source, how to fetch data, what data to fetch, and how to process it to

generate the Solr documents to be posted to the index.

You can have multiple DIH configuration files. Each file would require a separate definition in the solrconfig.

file, specifying a path to the file.xml

203Apache Solr Reference Guide 5.1

Configuring the DIH Configuration File

An annotated configuration file, based on the " " collection in the example server, is shown below (db dih exampl

). It extracts fields from the four tables defining ae/example-DIH/solr/db/conf/db-data-config.xml

simple product database, with this schema. More information about the parameters and options shown here are

described in the sections following.

204Apache Solr Reference Guide 5.1

<!-- The first element is the dataSource, in this case an HSQLDB database.

The path to the JDBC driver and the JDBC URL and login credentials are all

specified here.

Other permissible attributes include whether or not to autocommit to Solr,the

batchsize

used in the JDBC connection, a 'readOnly' flag -->

<dataSource driver="org.hsqldb.jdbcDriver"

url="jdbc:hsqldb:./example-DIH/hsqldb/ex" user="sa" />

<!-- a 'document' element follows, containing multiple 'entity' elements.

Note that 'entity' elements can be nested, and this allows the entity

relationships in the sample database to be mirrored here, so that we can

generate a denormalized Solr record which may include multiple features

for one item, for instance -->

<!-- The possible attributes for the entity element are described below.

Entity elements may contain one or more 'field' elements, which map

the data source field names to Solr fields, and optionally specify

per-field transformations -->

<entity name="item" query="select * from item"

deltaQuery="select id from item where last_modified >

'${dataimporter.last_index_time}'">

<!-- This entity is nested and reflects the one-to-many relationship between an item

and its multiple features.

Note the use of variables; ${item.ID} is the value of the column 'ID' for the

current item

('item' referring to the entity name) -->

<entity name="feature"

query="select DESCRIPTION from FEATURE where ITEM_ID='${item.ID}'"

deltaQuery="select ITEM_ID from FEATURE where last_modified >

'${dataimporter.last_index_time}'"

parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}">

</entity>

<entity name="item_category"

query="select CATEGORY_ID from item_category where

ITEM_ID='${item.ID}'"

deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where

last_modified > '${dataimporter.last_index_time}'"

parentDeltaQuery="select ID from item where

ID=${item_category.ITEM_ID}">

<entity name="category"

query="select DESCRIPTION from category where ID =

'${item_category.CATEGORY_ID}'"

deltaQuery="select ID from category where last_modified >

'${dataimporter.last_index_time}'"

parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category

where CATEGORY_ID=${category.ID}">

</entity>

</document>

</dataConfig>

205Apache Solr Reference Guide 5.1

Datasources can still be specified in . These must be specified in the defaults section of thesolrconfig.xml

handler in . However, these are not parsed until the main configuration is loaded.solrconfig.xml

The entire configuration itself can be passed as a request parameter using the parameter ratherdataConfig

than using a file. When configuration errors are encountered, the error message is returned in XML format.

A command is also supported, which is useful for validating a new configuration file, or if youreload-config

want to specify a file, load it, and not have it reloaded again on import. If there is an mistake in thexml

configuration a user-friendly message is returned in format. You can then fix the problem and do a xml reload-

.config

Data Import Handler Commands

DIH commands are sent to Solr via an HTTP request. The following operations are supported.

Command Description

abort Aborts an ongoing operation. The URL is http://<host>:<port>/ solr/ <collect

.ion_name>/ dataimport? command=abort

delta-import For incremental imports and change detection. The command is of the form http://<ho

st>:<port>/ solr/ <collection_name>/ dataimport? command=delta-impor

. It supports the same clean, commit, optimize and debug parameters as full-importt

command.

full-import A Full Import operation can be started with a URL of the form http://<host>:<port>/

. The commandsolr/ <collection_name>/ dataimport? command=full-import

returns immediately. The operation will be started in a new thread and the attributestatus

in the response should be shown as . The operation may take some time dependingbusy

on the size of dataset. Queries to Solr are not blocked during full-imports.

When a full-import command is executed, it stores the start time of the operation in a file

located at . This stored timestamp is used when aconf/dataimport.properties

delta-import operation is executed.

For a list of parameters that can be passed to this command, see below.

reload-config If the configuration file has been changed and you wish to reload it without restarting Solr,

run the command http://<host>:<port>/ solr/ <collection_name>/ dataimp

.ort? command=reload-config

status The URL is http://<host>:<port>/ solr/ <collection_name>/ dataimport?

. It returns statistics on the number of documents created, deleted,command=status

queries run, rows fetched, status, and so on.

Parameters for the Commandfull-import

The command accepts the following parameters:full-import

Parameter Description

You can also view the DIH configuration in the Solr Admin UI and there is an interface to import content.

206Apache Solr Reference Guide 5.1

clean Default is true. Tells whether to clean up the index before the indexing is started.

commit Default is true. Tells whether to commit after the operation.

debug Default is false Runs the command in debug mode. It is used by the interactive development

mode. Note that in debug mode, documents are never committed automatically. If you want to

run debug mode and commit the results too, add as a request parameter.commit=true

entity The name of an entity directly under the tag in the configuration file. Use this to<document>

execute one or more entities selectively. Multiple "entity" parameters can be passed on to run

multiple entities at once. If nothing is passed, all entities are executed.

optimize Default is true. Tells Solr whether to optimize after the operation.

Property Writer

The element defines the date format and locale for use with delta queries. It is an optionalpropertyWriter

configuration. Add the element to the DIH configuration file, directly under the element.dataConfig

<propertyWriter dateFormat="yyyy-MM-dd HH:mm:ss" type="SimplePropertiesWriter"

directory="data" filename="my_dih.properties" locale="en_US" />

The parameters available are:

Parameter Description

dateFormat A java.text.SimpleDateFormat to use when converting the date to text. The default is

"yyyy-MM-dd HH:mm:ss".

type The implementation class. Use for non-SolrCloud installations. IfSimplePropertiesWriter

using SolrCloud, use . If this is not specified, it will default to theZKPropertiesWriter

appropriate class depending on if SolrCloud mode is enabled.

directory Used with the only). The directory for the properties file. If notSimplePropertiesWriter

specified, the default is "conf".

filename Used with the only). The name of the properties file. If notSimplePropertiesWriter

specified, the default is the requestHandler name (as defined in , appendedsolrconfig.xml

by ".properties" (i.e., "dataimport.properties").

locale The locale. If not defined, the ROOT locale is used. It must be specified as language-country.

For example, .en-US

Data Sources

A data source specifies the origin of data and its type. Somewhat confusingly, some data sources are configured

within the associated entity processor. Data sources can also be specified in , which is usefulsolrconfig.xml

when you have multiple environments (for example, development, QA, and production) differing only in their data

sources.

You can create a custom data source by writing a class that extends org.apache.solr.handler.dataimpo

.rt.DataSource

207Apache Solr Reference Guide 5.1

The mandatory attributes for a data source definition are its name and type. The name identifies the data source

to an Entity element.

The types of data sources available are described below.

ContentStreamDataSource

This takes the POST data as the data source. This can be used with any EntityProcessor that uses a DataSour

.ce<Reader>

FieldReaderDataSource

This can be used where a database field contains XML which you wish to process using the

XPathEntityProcessor. You would set up a configuration with both JDBC and FieldReader data sources, and two

entities, as follows:

<entity name ="e1" dataSource="a1" processor="SqlEntityProcessor" pk="docid"

query="select * from t1 ...">

<!-- nested XpathEntity; the field in the parent which is to be used for

Xpath is set in the "datafield" attribute in place of the "url" attribute

-->

<entity name="e2" dataSource="a2" processor="XPathEntityProcessor"

dataField="e1.fieldToUseForXPath">

...

</entity>

The FieldReaderDataSource can take an parameter, which will default to "UTF-8" if not specified.Itencoding

must be specified as language-country. For example, .en-US

FileDataSource

This can be used like an , but is used to fetch content from files on disk. The only difference fromURLDataSource

URLDataSource, when accessing disk files, is how a pathname is specified.

This data source accepts these optional attributes.

Optional Attribute Description

basePath The base path relative to which the value is evaluated if it is not absolute.

encoding Defines the character encoding to use. If not defined, UTF-8 is used.

JdbcDataSource

208Apache Solr Reference Guide 5.1

This is the default datasource. It's used with the . See the example in the SqlEntityProcessor FieldReaderDataSo

section for details on configuration.urce

URLDataSource

This data source is often used with XPathEntityProcessor to fetch content from an underlying or file:// http

location. Here's an example:://

<dataSource name="a"

type="URLDataSource"

baseUrl="http://host:port/"

encoding="UTF-8"

connectionTimeout="5000"

readTimeout="10000"/>

The URLDataSource type accepts these optional parameters:

Optional

Parameter

Description

baseURL Specifies a new baseURL for pathnames. You can use this to specify host/port changes

between Dev/QA/Prod environments. Using this attribute isolates the changes to be

made to the solrconfig.xml

connectionTimeout Specifies the length of time in milliseconds after which the connection should time out.

The default value is 5000ms.

encoding By default the encoding in the response header is used. You can use this property to

override the default encoding.

readTimeout Specifies the length of time in milliseconds after which a read operation should time out.

The default value is 10000ms.

Entity Processors

Entity processors extract data, transform it, and add it to a Solr index. Examples of entities include views or

tables in a data store.

Each processor has its own set of attributes, described in its own section below. In addition, there are

non-specific attributes common to all entities which may be specified.

Attribute Use

dataSource The name of a data source. If there are multiple data sources defined, use this

attribute with the name of the data source for this entity.

name Required. The unique name used to identify an entity.

pk The primary key for the entity. It is optional, and required only when using

delta-imports. It has no relation to the uniqueKey defined in but theyschema.xml

can both be the same. It is mandatory if you do delta-imports and then refers to the

column name in } which is used as the${dataimporter.delta.<column-name>

primary key.

209Apache Solr Reference Guide 5.1

processor Default is SqlEntityProcessor. Required only if the datasource is not RDBMS.

onError Permissible values are (abort|skip|continue) . The default value is 'abort'. 'Skip' skips

the current document. 'Continue' ignores the error and processing continues.

preImportDeleteQuery Before a full-import command, use this query this to cleanup the index instead of

using '*:*'. This is honored only on an entity that is an immediate sub-child of <docu

.ment>

postImportDeleteQuery Similar to the above, but executed after the import has completed.

rootEntity By default the entities immediately under the are root entities. If this<document>

attribute is set to false, the entity directly falling under that entity will be treated as

the root entity (and so on). For every row returned by the root entity, a document is

created in Solr.

transformer Optional. One or more transformers to be applied on this entity.

cacheImpl Optional. A class (which must implement ) to use for caching this entityDIHCache

when doing lookups from an entity which wraps it. Provided implementation is "Sort

".edMapBackedCache

cacheKey The name of a property of this entity to use as a cache key if iscacheImpl

specified.

cacheLookup An entity + property name that will be used to lookup cached instances of this entity

if is specified.cacheImpl

Caching of entities in DIH is provided to avoid repeated lookups for same entities again and again. The default S

is a where a key is a field in the row and the value is a bunch of rows for thatortedMapBackedCache HashMap

same key.

In the example below, each entity is cached using the ' ' property as a cache key. Cachemanufacturer id

lookups will be performed for each entity based on the product's " " property. When the cache hasproduct manu

no data for a particular key, the query is run and the cache is populated

<entity name="manufacturer" query="select id, name from manufacturer"

cacheKey="id" cacheLookup="product.manu" cacheImpl="SortedMapBackedCache"/>

</entity>

The SQL Entity Processor

The SqlEntityProcessor is the default processor. The associated should be a JDBC URL.data source

The entity attributes specific to this processor are shown in the table below.

Attribute Use

query Required. The SQL query used to select rows.

210Apache Solr Reference Guide 5.1

deltaQuery SQL query used if the operation is delta-import. This query selects the primary keys of the

rows which will be parts of the delta-update. The pks will be available to the

deltaImportQuery through the variable }.${dataimporter.delta.<column-name>

parentDeltaQuery SQL query used if the operation is delta-import.

deletedPkQuery SQL query used if the operation is delta-import.

deltaImportQuery SQL query used if the operation is delta-import. If this is not present, DIH tries to construct

the import query by(after identifying the delta) modifying the 'query' (this is error prone).

There is a namespace } which can be used${dataimporter.delta.<column-name>

in this query. For example, select * from tbl where

}.id=${dataimporter.delta.id

The XPathEntityProcessor

This processor is used when indexing XML formatted data. The data source is typically or URLDataSource FileD

. Xpath can also be used with the described below, to generate a documentataSource FileListEntityProcessor

from each file.

The entity attributes unique to this processor are shown below.

Attribute Use

Processor Required. Must be set to "XpathEntityProcessor".

url Required. HTTP URL or file location.

stream Optional: Set to true for a large file or download.

forEach Required unless you define . The Xpath expression whichuseSolrAddSchema

demarcates each record. This will be used to set up the processing loop.

xsl Optional: Its value (a URL or filesystem path) is the name of a resource used as a

preprocessor for applying the XSL transformation.

useSolrAddSchema Set this to true if the content is in the form of the standard Solr update XML schema.

flatten Optional: If set true, then text from under all the tags is extracted into one field.

Each field element in the entity can have the following attributes as well as the default ones.

Attribute Use

xpath Required. The XPath expression which will extract the content from the record for this field.

Only a subset of Xpath syntax is supported.

commonField Optional. If true, then when this field is encountered in a record it will be copied to future

records when creating a Solr document.

Here is an example from the " " collection in the example (rss dih example/example-DIH/solr/rss/conf/

):rss-data-config.xml

211Apache Solr Reference Guide 5.1

<entity name="slashdot"

pk="link"

url="http://rss.slashdot.org/Slashdot/slashdot"

processor="XPathEntityProcessor"

<!-- forEach sets up a processing loop ; here there are two

expressions-->

forEach="/RDF/channel | /RDF/item"

transformer="DateFormatTransformer">

<field column="date" xpath="/RDF/item/date"

dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />

</entity>

</document>

</dataConfig>

The MailEntityProcessor

The MailEntityProcessor uses the Java Mail API to index email messages using the IMAP protocol. The

MailEntityProcessor works by connecting to a specified mailbox using a username and password, fetching the

email headers for each message, and then fetching the full email contents to construct a document (one

document for each mail message).

Here is an example from the " " collection of the example (mail dih example/example-DIH/mail/conf/mai

):l-data-config.xml

<entity processor="MailEntityProcessor"

user="email@gmail.com"

password="password"

host="imap.gmail.com"

protocol="imaps"

fetchMailsSince="2009-09-20 00:00:00"

batchSize="20"

folders="inbox"

processAttachement="false"

name="sample_entity"/>

</document>

</dataConfig>

212Apache Solr Reference Guide 5.1

The entity attributes unique to the MailEntityProcessor are shown below.

Attribute Use

processor Required. Must be set to "MailEntityProcessor".

user Required. Username for authenticating to the IMAP server; this is typically the email

address of the mailbox owner

password Required. Password for authenticating to the IMAP server

host Required. The IMAP server to connect to

protocol Required. The IMAP protocol to use, valid values are: imap, imaps, gimap, and gimaps

fetchMailsSince Optional. Date/time used to set a filter to import messages that occur after the specified

date; expected format is: yyyy-MM-dd HH:mm:ss

folders Required. Comma-delimited list of folder names to pull messages from, such as "inbox"

recurse Optional (default is true). Flag to indicate if the processor should recurse all child

folders when looking for messages to import

include Optional. Comma-delimited list of folder patterns to include when processing folders

(can be a literal value or regular expression)

exclude Optional. Comma-delimited list of folder patterns to exclude when processing folders

(can be a literal value or regular expression);

excluded folder patterns take precedent over include folder patterns.

processAttachement

processAttachments

Optional (default is true). Use Tika to process message attachments.

includeContent Optional (default is true). Include the message body when constructing Solr documents

for indexing

Importing New Emails Only

After running a full import, the MailEntityProcessor keeps track of the timestamp of the previous import so that

subsequent imports can use the fetchMailsSince filter to only pull new messages from the mail server. This

occurs automatically using the Data Import Handler dataimport.properties file (stored in conf). For instance, if you

set fetchMailsSince=2014-08-22 00:00:00 in your mail-data-config.xml, then all mail messages that occur after

this date will be imported on the first run of the importer. Subsequent imports will use the date of the previous

import as the fetchMailsSince filter, so that only new emails since the last import are indexed each time.

GMail Extensions

When connecting to a GMail account, you can improve the efficiency of the MailEntityProcessor by setting the

protocol to or . This allows the processor to send the fetchMailsSince filter to the GMail server togimap gimaps

have the date filter applied on the server, which means the processor only receives new messages from the

server. However, GMail only supports date granularity, so the server-side filter may return previously seen

messages if run more than once a day.

213Apache Solr Reference Guide 5.1

The TikaEntityProcessor

The TikaEntityProcessor uses Apache Tika to process incoming documents. This is similar to Uploading Data

, but using the DataImportHandler options instead.with Solr Cell using Apache Tika

Here is an example from the " " collection of the example (tika dih example/example-DIH/tika/conf/tik

):a-data-config.xml

<entity name="tika-test" processor="TikaEntityProcessor"

url="../contrib/extraction/src/test-files/extraction/solr-word.pdf"

format="text">

</entity>

</document>

</dataConfig>

The parameters for this processor are described in the table below:

Attribute Use

dataSource This parameter defines the data source and an optional name which can be referred to in

later parts of the configuration if needed. This is the same dataSource explained in the

description of general entity processor attributes above.

The available data source types for this processor are:

BinURLDataSource: used for HTTP resources, but can also be used for files.

BinContentStreamDataSource: used for uploading content as a stream.

BinFileDataSource: used for content on the local filesystem.

url The path to the source file(s), as a file path or a traditional internet URL. This parameter is

required.

htmlMapper Allows control of how Tika parses HTML. The "default" mapper strips much of the HTML

from documents while the "identity" mapper passes all HTML as-is with no modifications.

If this parameter is defined, it must be either or ; if it is absent, "default" isdefault identity

assumed.

format The output format. The options are , , or . The default is "text" if nottext xml html none

defined. The format "none" can be used if metadata only should be indexed and not the

body of the documents.

parser The default parser is . If a custom ororg.apache.tika.parser.AutoDetectParser

other parser should be used, it should be entered as a fully-qualified name of the class

and path.

214Apache Solr Reference Guide 5.1

fields The list of fields from the input documents and how they should be mapped to Solr fields.

If the attribute is defined as "true", the field will be obtained from the metadata of themeta

document and not parsed from the body of the main text.

extractEmbedded Instructs the TikaEntityProcessor to extract embedded documents or attachments when tr

. If false, embedded documents and attachments will be ignored.ue

onError By default, the TikaEntityProcessor will stop processing documents if it finds one that

generates an error. If you define to "skip", the TikaEntityProcessor will insteadonError

skip documents that fail processing and log a message that the document was skipped.

The FileListEntityProcessor

This processor is basically a wrapper, and is designed to generate a set of files satisfying conditions specified in

the attributes which can then be passed to another processor, such as the . The entityXPathEntityProcessor

information for this processor would be nested within the FileListEnitity entry. It generates four implicit fields: fil

, , , which can be used in the nested processor.eAbsolutePath fileSize fileLastModified fileName

This processor does not use a data source.

The attributes specific to this processor are described in the table below:

Attribute Use

fileName Required. A regular expression pattern to identify files to be included.

basedir Required. The base directory (absolute path).

recursive Whether to search directories recursively. Default is 'false'.

excludes A regular expression pattern to identify files which will be excluded.

newerThan A date in the format or a date math expression ( ).yyyy-MM-ddHH:mm:ss NOW - 2YEARS

olderThan A date, using the same formats as newerThan.

rootEntity This should be set to false. This ensures that each row (filepath) emitted by this processor is

considered to be a document.

dataSource Must be set to null.

The example below shows the combination of the FileListEntityProcessor with another processor which will

generate a set of fields from each file found.

215Apache Solr Reference Guide 5.1

<!-- this outer processor generates a list of files satisfying the conditions

specified in the attributes -->

<entity name="f" processor="FileListEntityProcessor"

fileName=".*xml"

newerThan="'NOW-30DAYS'"

recursive="true"

rootEntity="false"

dataSource="null"

baseDir="/my/document/directory">

<entity name="nested" processor="XPathEntityProcessor"

forEach="/rootelement" url="${f.fileAbsolutePath}" >

</entity>

</document>

</dataConfig>

LineEntityProcessor

This EntityProcessor reads all content from the data source on a line by line basis and returns a field called rawL

for each line read. The content is not parsed in any way; however, you may add transformers to manipulateine

the data within the field, or to create other additional fields.rawLine

The lines read can be filtered by two regular expressions specified with the and acceptLineRegex omitLineR

attributes. The table below describes the LineEntityProcessor's attributes:egex

Attribute Description

url A required attribute that specifies the location of the input file in a way that is compatible

with the configured data source. If this value is relative and you are using FileDataSource

or URLDataSource, it assumed to be relative to baseLoc.

acceptLineRegex An optional attribute that if present discards any line which does not match the regExp.

omitLineRegex An optional attribute that is applied after any acceptLineRegex and that discards any line

which matches this regExp.

For example:

216Apache Solr Reference Guide 5.1

<entity name="jc"

processor="LineEntityProcessor"

acceptLineRegex="^.*\.xml$"

omitLineRegex="/obsolete"

url="file:///Volumes/ts/files.lis"

rootEntity="false"

dataSource="myURIreader1"

transformer="RegexTransformer,DateFormatTransformer">

...

While there are use cases where you might need to create a Solr document for each line read from a file, it is

expected that in most cases that the lines read by this processor will consist of a pathname, which in turn will be

consumed by another EntityProcessor, such as XPathEntityProcessor.

PlainTextEntityProcessor

This EntityProcessor reads all content from the data source into an single implicit field called . TheplainText

content is not parsed in any way, however you may add transformers to manipulate the data within the plainTe

as needed, or to create other additional fields.xt

For example:

<entity processor="PlainTextEntityProcessor" name="x" url="http://abc.com/a.txt"

dataSource="data-source-name">

</entity>

Ensure that the dataSource is of type ( , ).DataSource<Reader> FileDataSource URLDataSource

Transformers

Transformers manipulate the fields in a document returned by an entity. A transformer can create new fields or

modify existing ones. You must tell the entity which transformers your import operation will be using, by adding

an attribute containing a comma separated list to the element.<entity>

Specific transformation rules are then added to the attributes of a element, as shown in the examples<field>

below. The transformers are applied in the order in which they are specified in the transformer attribute.

The Data Import Handler contains several built-in transformers. You can also write your own custom

transformers, as described in the Solr Wiki (see ). Thehttp://wiki.apache.org/solr/DIHCustomTransformer

ScriptTransformer (described below) offers an alternative method for writing your own transformers.

Solr includes the following built-in transformers:

Transformer Name Use

ClobTransformer Used to create a String out of a Clob type in database.

DateFormatTransformer Parse date/time instances.

217Apache Solr Reference Guide 5.1

HTMLStripTransformer Strip HTML from a field.

LogTransformer Used to log data to log files or a console.

NumberFormatTransformer Uses the NumberFormat class in java to parse a string into a number.

RegexTransformer Use regular expressions to manipulate fields.

ScriptTransformer Write transformers in Javascript or any other scripting language supported by

Java.

TemplateTransformer Transform a field using a template.

These transformers are described below.

ClobTransformer

You can use the ClobTransformer to create a string out of a CLOB in a database. A CLOB is a character large

object: a collection of character data typically stored in a separate location that is referenced in the database.

See . Here's an example of invoking the ClobTransformer.http://en.wikipedia.org/wiki/Character_large_object

...

</entity>

The ClobTransformer accepts these attributes:

Attribute Description

clob Boolean value to signal if ClobTransformer should process this field or not. If this attribute is

omitted, then the corresponding field is not transformed.

sourceColName The source column to be used as input. If this is absent source and target are same

The DateFormatTransformer

This transformer converts dates from one format to another. This would be useful, for example, in a situation

where you wanted to convert a field with a fully specified date/time into a less precise date format, for use in

faceting.

DateFormatTransformer applies only on the fields with an attribute . Other fields are notdateTimeFormat

modified.

This transformer recognizes the following attributes:

Attribute Description

dateTimeFormat The format used for parsing this field. This must comply with the syntax of the Java

class.SimpleDateFormat

sourceColName The column on which the dateFormat is to be applied. If this is absent source and target

are same.

218Apache Solr Reference Guide 5.1

locale The locale to use for date transformations. If not specified, the ROOT locale will be used. It

must be specified as language-country. For example, .en-US

Here is example code that returns the date rounded up to the month "2007-JUL":

...

</entity>

The HTMLStripTransformer

You can use this transformer to strip HTML out of a field. For example:

...

</entity>

There is one attribute for this transformer, , which is a boolean value (true/false) to signal if thestripHTML

HTMLStripTransformer should process the field or not.

The LogTransformer

You can use this transformer to log data to the console or log files. For example:

<entity ...

transformer="LogTransformer"

logTemplate="The name is ${e.name}" logLevel="debug">

....

</entity>

Unlike other transformers, the LogTransformer does not apply to any field, so the attributes are applied on the

entity itself.

The NumberFormatTransformer

Use this transformer to parse a number from a string, converting it into the specified format, and optionally using

a different locale.

NumberFormatTransformer will be applied only to fields with an attribute .formatStyle

This transformer recognizes the following attributes:

Attribute Description

formatStyle The format used for parsing this field. The value of the attribute must be one of (number|p

). This uses the semantics of the Java NumberFormat class.ercent|integer|currency

sourceColName The column on which the NumberFormat is to be applied. This is attribute is absent. The

source column and the target column are the same.

219Apache Solr Reference Guide 5.1

locale The locale to be used for parsing the strings. If this is absent, the ROOT locale is used. It

must be specified as language-country. For example, .en-US

For example:

...

</entity>

The RegexTransformer

The regex transformer helps in extracting or manipulating values from fields (from the source) using Regular

Expressions. The actual class name is . Butorg.apache.solr.handler.dataimport.RegexTransformer

as it belongs to the default package the package-name can be omitted.

The table below describes the attributes recognized by the regex transformer.

Attribute Description

regex The regular expression that is used to match against the column or sourceColName's

value(s). If replaceWith is absent, each regex is taken as a value and a list of valuesgroup

is returned.

sourceColName The column on which the regex is to be applied. If not present, then the source and target

are identical.

splitBy Used to split a string. It returns a list of values.

groupNames A comma separated list of field column names, used where the regex contains groups and

each group is to be saved to a different field. If some groups are not to be named leave a

space between commas.

replaceWith Used along with regex . It is equivalent to the method new

.String(<sourceColVal>).replaceAll(<regex>, <replaceWith>)

Here is an example of configuring the regex transformer:

<entity name="foo" transformer="RegexTransformer"

query="select full_name, emailids from foo">

</entity>

In this example, regex and sourceColName are custom attributes used by the transformer. The transformer

220Apache Solr Reference Guide 5.1

reads the field from the resultset and transforms it to two new target fields, and full_name firstName lastNa

. Even though the query returned only one column, , in the result set, the Solr document gets twome full_name

extra fields and which are "derived" fields. These new fields are only created if thefirstName lastName

regexp matches.

The emailids field in the table can be a comma-separated value. It ends up producing one or more email IDs,

and we expect the to be a multivalued field in Solr.mailId

Note that this transformer can either be used to split a string into tokens based on a splitBy pattern, or to perform

a string substitution as per replaceWith, or it can assign groups within a pattern to a list of groupNames. It

decides what it is to do based upon the above attributes , and which aresplitBy replaceWith groupNames

looked for in order. This first one found is acted upon and other unrelated attributes are ignored.

The ScriptTransformer

The script transformer allows arbitrary transformer functions to be written in any scripting language supported by

Java, such as Javascript, JRuby, Jython, Groovy, or BeanShell. Javascript is integrated into Java 7; you'll need

to integrate other languages yourself.

Each function you write must accept a row variable (which corresponds to a , thusJava Map<String,Object>

permitting operations). Thus you can modify the value of an existing field or add new fields.get,put,remove

The return value of the function is the returned object.

The script is inserted into the DIH configuration file at the top level and is called once for each row.

Here is a simple example.

<!-- simple script to generate a new row, converting a temperature from Fahrenheit

to Centigrade -->

<script><![CDATA[

function f2c(row) {

var tempf, tempc;

tempf = row.get('temp_f');

if (tempf != null) {

tempc = (tempf - 32.0)*5.0/9.0;

row.put('temp_c', temp_c);

}

return row;

}

]]>

</script>

....

</entity>

</document>

</dataConfig>

The TemplateTransformer

221Apache Solr Reference Guide 5.1

You can use the template transformer to construct or modify a field value, perhaps using the value of other fields.

You can insert extra text into the template.

...

</entity>

Special Commands for the Data Import Handler

You can pass special commands to the DIH by adding any of the variables listed below to any row returned by

any component:

Variable Description

$skipDoc Skip the current document; that is, do not add it to Solr. The value can be the string tr

.ue|false

$skipRow Skip the current row. The document will be added with rows from other entities. The

value can be the string true|false

$docBoost Boost the current document. The boost value can be a number or the convetoString

rsion of a number.

$deleteDocById Delete a document from Solr with this ID. The value has to be the value ofuniqueKey

the document.

$deleteDocByQuery Delete documents from Solr using this query. The value must be a Solr Query.

Updating Parts of Documents

Once you have indexed the content you need in your Solr index, you will want to start thinking about your

strategy for dealing with changes to those documents. Solr supports two approaches to updating documents that

have only partially changed.

The first is . This approach allows changing only one or more fields of a document without havingatomic updates

to re-index the entire document.

The second approach is known as or . It is a feature of many NoSQLoptimistic concurrency optimistic locking

databases, and allows conditional updating a document based on its version. This approach includes semantics

and rules for how to deal with version matches or mis-matches.

Atomic Updates and Optimistic Concurrency may be used as independent strategies for managing changes to

documents, or they may be combined: you can use optimistic concurrency to conditionally apply an atomic

update.

Atomic Updates

Solr supports several modifiers that atomically update values of a document. This allows updating only specific

fields, which can help speed indexing processes in an environment where speed of index additions is critical to

the application.

222Apache Solr Reference Guide 5.1

To use atomic updates, add a modifier to the field that needs to be updated. The content can be updated, added

to, or incrementally increased if a number.

Modifier Usage

set Set or replace the field value(s) with the specified value(s), or remove the values if 'null' or

empty list is specified as the new value.

May be specified as a single value, or as a list for multivalued fields

add Adds the specified values to a multivalued field.

May be specified as a single value, or as a list.

remove Removes (all occurrences of) the specified values from a multivalued field.

May be specified as a single value, or as a list.

removeregex Removes all occurrences of the specified regex from a multiValued field.

May be specified as a single value, or as a list.

inc Increments a numeric value by a specific amount.

Must be specified as a single numeric value.

For example, if the following document exists in our collection:

{"id":"mydoc",

"price":10,

"popularity":42,

"categories":["kids"],

"promo_ids":["a123x"],

"tags":["free_to_try","buy_now","clearance","on_sale"]

}

And we apply the following update command:

{"id":"mydoc",

"price":{"set":99},

"popularity":{"inc":20},

"categories":{"add":["toys","games"]},

"promo_ids":{"remove":"a123x"},

"tags":{"remove":["free_to_try","on_sale"]}

}

The resulting document in our collection will be:

All original source fields must be stored for field modifiers to work correctly, which is the Solr default.

223Apache Solr Reference Guide 5.1

{"id":"mydoc",

"price":99,

"popularity":62,

"categories":["kids","toys","games"],

"tags":["buy_now","clearance"]

}

Optimistic Concurrency

Optimistic Concurrency is a feature of Solr that can be used by client applications which update/replace

documents to ensure that the document they are replacing/updating has not been concurrently modified by

another client application. This feature works by requiring a field on all documents in the index, and_version_

comparing that to a specified as part of the update command. By default, Solr's includ_version_ schema.xml

es a field, and this field is automatically added to each new document._version_

In general, using optimistic concurrency involves the following work flow:

A client reads a document. In Solr, one might retrieve the document with the handler to be sure to/get

have the latest version.

A client changes the document locally.

The client resubmits the changed document to Solr, for example, perhaps with the handler./update

If there is a version conflict (HTTP error code 409), the client starts the process over.

When the client resubmits a changed document to Solr, the can be included with the update to_version_

invoke optimistic concurrency control. Specific semantics are used to define when the document should be

updated or when to report a conflict.

If the content in the field is greater than '1' (i.e., '12345'), then the in the_version_ _version_

document must match the in the index._version_

If the content in the field is equal to '1', then the document must simply exist. In this case, no_version_

version matching occurs, but if the document does not exist, the updates will be rejected.

If the content in the field is less than '0' (i.e., '-1'), then the document must exist. In this_version_ not

case, no version matching occurs, but if the document exists, the updates will be rejected.

If the content in the field is equal to '0', then it doesn't matter if the versions match or if the_version_

document exists or not. If it exists, it will be overwritten; if it does not exist, it will be added.

If the document being updated does not include the field, and atomic updates are not being used,_version_

the document will be treated by normal Solr rules, which is usually to discard the previous version.

When using Optimistic Concurrency, clients can include an optional request parameter toversions=true

indicate that the versions of the documents being added should be included in the response. This allowsnew

clients to immediately know what the is of every documented added with out needing to make a_version_

redundant . request/get

For example...

224Apache Solr Reference Guide 5.1

$ curl -X POST -H 'Content-Type: application/json'

'http://localhost:8983/solr/techproducts/update?versions=true' --data-binary '

[ { "id" : "aaa" },

{ "id" : "bbb" } ]'

{"responseHeader":{"status":0,"QTime":6},

"adds":["aaa",1498562471222312960,

"bbb",1498562471225458688]}

$ curl -X POST -H 'Content-Type: application/json'

'http://localhost:8983/solr/techproducts/update?_version_=999999&versions=true'

--data-binary '

[{ "id" : "aaa",

"foo_s" : "update attempt with wrong existing version" }]'

{"responseHeader":{"status":409,"QTime":3},

"error":{"msg":"version conflict for aaa expected=999999

actual=1498562471222312960",

"code":409}}

$ curl -X POST -H 'Content-Type: application/json'

'http://localhost:8983/solr/techproducts/update?_version_=1498562471222312960&versio

ns=true&commit=true' --data-binary '

[{ "id" : "aaa",

"foo_s" : "update attempt with correct existing version" }]'

{"responseHeader":{"status":0,"QTime":5},

"adds":["aaa",1498562624496861184]}

$ curl 'http://localhost:8983/solr/techproducts/query?q=*:*&fl=id,_version_'

{

"responseHeader":{

"status":0,

"QTime":5,

"params":{

"fl":"id,_version_",

"q":"*:*"}},

"response":{"numFound":2,"start":0,"docs":[

{

"id":"bbb",

"_version_":1498562471225458688},

{

"id":"aaa",

"_version_":1498562624496861184}]

}}

For more information, please also see from ApacheYonik Seeley's presentation on NoSQL features in Solr 4

Lucene EuroCon 2012.

225Apache Solr Reference Guide 5.1

Document Centric Versioning Constraints

Optimistic Concurrency is extremely powerful, and works very efficiently because it uses an internally assigned,

globally unique values for the field. However, In some situations users may want to configure their_version_

own document specific version field, where the version values are assigned on a per-document basis by an

external system, and have Solr reject updates that attempt to replace a document with an "older" version. In

situations like this the can be useful.DocBasedVersionConstraintsProcessorFactory

The basic usage of is to configure it in DocBasedVersionConstraintsProcessorFactory solrconfig.x

as part of the and specify the name of your custom in yourml UpdateRequestProcessorChain versionField

schema that should be checked when validating updates:

<str name="versionField">my_version_l</str>

</processor>

Once configured, this update processor will reject (HTTP error code 409) any attempt to update an existing

document where the value of the field in the "new" document is not greater then the value ofmy_version_l

that field in the existing document.

DocBasedVersionConstraintsProcessorFactory supports two additional configuration params which are

optional:

ignoreOldUpdates - A boolean option which defaults to . If set to then instead of rejectingfalse true

updates where the is too low, the update will be silently ignored (and return a status 200versionField

to the client).

deleteVersionParam - A String parameter that can be specified to indicate that this processor should

also inspect Delete By Id commands. The value of this configuration option should be the name of a

request parameter that the processor will now consider mandatory for all attempts to Delete By Id, and

Power Tip

The field is by default stored in the inverted index ( ). However, for some_version_ indexed="true"

systems with a very large number of documents, the increase in FieldCache memory requirements may

be too costly. A solution can be to declare the field as :_version_ DocValues

<field name="_version_" type="long" indexed="false" stored="true"

required="true" docValues="true"/>

Sample field definition

versionField vs _version_

The field used by Solr for its normal optimistic concurrency also has important semantics in_version_

how updates are distributed to replicas in SolrCloud, and be assigned internally by Solr. UsersMUST

can not re-purpose that field and specify it as the for use in the versionField DocBasedVersionCo

configuration.nstraintsProcessorFactory

226Apache Solr Reference Guide 5.1

must be be used by clients to specify a value for the which is greater then the existingversionField

value of the document to be deleted. When using this request param, any Delete By Id command with a

high enough document version number to succeed will be internally converted into an Add Document

command that replaces the existing document with a new one which is empty except for the Unique Key

and to keeping a record of the deleted version so future Add Document commands willversionField

fail if their "new" version is not high enough.

Please consult the and for additional information and example usages.processor javadocs test configs

De-Duplication

Preventing duplicate or near duplicate documents from entering an index or tagging documents with a

signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash

algorithm. Solr natively supports de-duplication techniques of this type via the class and allows<Signature>

for the easy addition of new hash/signature implementations. A Signature can be implemented several ways:

Method Description

MD5Signature 128 bit hash used for exact duplicate detection.

Lookup3Signature 64 bit hash used for exact duplicate detection, much faster than MD5 and smaller to

index

TextProfileSignature Fuzzy hashing implementation from nutch for near duplicate detection. It's tunable but

works best on longer text.

Other, more sophisticated algorithms for fuzzy/near hashing can be added later.

Configuration Options

There are two places in Solr to configure de-duplication: in and in .solrconfig.xml schema.xml

In solrconfig.xml

The has to be registered in the solrconfig.xml as part of the SignatureUpdateProcessorFactory UpdateRe

:questProcessorChain

Adding in the de-duplication process will change the setting so that it applies to an updateallowDups

Term (with in this case) rather than the unique field Term. Of course the signatureField signature

could be the unique field, but generally you want the unique field to be unique. When a documentField

is added, a signature will automatically be generated and attached to the document in the specified sign

.atureField

227Apache Solr Reference Guide 5.1

<bool name="overwriteDupes">false</bool>

<str name="fields">name,features,cat</str>

<str name="signatureClass">solr.processor.Lookup3Signature</str>

</processor>

</updateRequestProcessorChain>

Setting Default Description

signatureClass org.apache.solr.update.processor.Lookup3Signature A Signature implementation for

generating a signature hash.

fields all fields The fields to use to generate the

signature hash in a comma separated

list. By default, all fields on the

document will be used.

signatureField signatureField The name of the field used to hold

the fingerprint/signature. Be sure the

field is defined in schema.xml.

enabled true Enable/disable de-duplication

processing.

In schema.xml

If you are using a separate field for storing the signature you must have it indexed:

<field name="signature" type="string" stored="true" indexed="true"

multiValued="false" />

Be sure to change your update handlers to use the defined chain, i.e.

<str name="update.chain">dedupe</str>

</lst>

</requestHandler>

Detecting Languages During Indexing

Solr can identify languages and map text to language-specific fields during indexing using the UpdateRelangid

questProcessor. Solr supports two implementations of this feature:

Tika's language detection feature: http://tika.apache.org/0.10/detection.html

The update processor can also be specified per request with a parameter of .update.chain=dedupe

228Apache Solr Reference Guide 5.1

LangDetect language detection: http://code.google.com/p/language-detection/

You can see a comparison between the two implementations here: http://blog.mikemccandless.com/2011/10/acc

. In general, the LangDetect implementation supports more languagesuracy-and-performance-of-googles.html

with higher performance.

For specific information on each of these language identification implementations, including a list of supported

languages for each, see the relevant project websites. For more information about the UpdateRequestPlangid

rocessor, see the Solr wiki: . For more information about languagehttp://wiki.apache.org/solr/LanguageDetection

analysis in Solr, see .Language Analysis

Configuring Language Detection

You can configure the UpdateRequestProcessor in . Both implementations take thelangid solrconfig.xml

same parameters, which are described in the following section. At a minimum, you must specify the fields for

language identification and a field for the resulting language code.

Configuring Tika Language Detection

Here is an example of a minimal Tika configuration in :langid solrconfig.xml

<processor

class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory

<str name="langid.fl">title,subject,text,keywords</str>

<str name="langid.langField">language_s</str>

</lst>

</processor>

Configuring LangDetect Language Detection

Here is an example of a minimal LangDetect configuration in :langid solrconfig.xml

<processor

class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorF

actory">

<str name="langid.fl">title,subject,text,keywords</str>

<str name="langid.langField">language_s</str>

</lst>

</processor>

langid Parameters

As previously mentioned, both implementations of the UpdateRequestProcessor take the samelangid

parameters.

Parameter Type Default Required Description

langid Boolean true no Enables and disables language

detection.

229Apache Solr Reference Guide 5.1

langid.fl string none yes A comma- or space-delimited list of

fields to be processed by .langid

langid.langField string none yes Specifies the field for the returned

language code.

langid.langsField multivalued

string

none no Specifies the field for a list of returned

language codes. If you use langid.ma

, each detectedp.individual

language will be added to this field.

langid.overwrite Boolean false no Specifies whether the content of the la

and fields willngField langsField

be overwritten if they already contain

values.

langid.lcmap string none false A space-separated list specifying colon

delimited language code mappings to

apply to the detected languages. For

example, you might use this to map

Chinese, Japanese, and Korean to a

common code, and map bothcjk

American and British English to a single

code by using en langid.lcmap=ja:

cjk zh:cjk ko:cjk en_GB:en

. This affects both the valuesen_US:en

put into the and langField langsFie

fields, as well as the field suffixesld

when using , unlesslangid.map

overridden by langid.map.lcmap

langid.threshold float 0.5 no Specifies a threshold value between 0

and 1 that the language identification

score must reach before acceplangid

ts it. With longer text fields, a high

threshold such at 0.8 will give good

results. For shorter text fields, you may

need to lower the threshold for

language identification, though you will

be risking somewhat lower quality

results. We recommend experimenting

with your data to tune your results.

langid.whitelist string none no Specifies a list of allowed language

identification codes. Use this in

combination with tolangid.map

ensure that you only index documents

into fields that are in your schema.

230Apache Solr Reference Guide 5.1

langid.map Boolean false no Enables field name mapping. If true,

Solr will map field names for all fields

listed in .langid.fl

langid.map.fl string none no A comma-separated list of fields for la

that is different than thengid.map

fields specified in .langid.fl

langid.map.keepOrig Boolean false no If true, Solr will copy the field during the

field name mapping process, leaving

the original field in place.

langid.map.individual Boolean false no If true, Solr will detect and map

languages for each field individually.

langid.map.individual.fl string none no A comma-separated list of fields for use

with that islangid.map.individual

different than the fields specified in lan

.gid.fl

langid.fallbackFields string none no If no language is detected that meets

the score, or iflangid.threshold

the detected language is not on the la

, this field specifiesngid.whitelist

language codes to be used as fallback

values. If no appropriate fallback

languages are found, Solr will use the

language code specified in langid.fa

.llback

langid.fallback string none no Specifies a language code to use if no

language is detected or specified in la

.ngid.fallbackFields

langid.map.lcmap string determined by

langid.lcmap

no A space-separated list specifying colon

delimited language code mappings to

use when mapping field names. For

example, you might use this to make

Chinese, Japanese, and Korean

language fields use a common s*_cjk

uffix, and map both American and

British English fields to a single b*_en

y using langid.map.lcmap=ja:cjk

zh:cjk ko:cjk en_GB:en

.en_US:en

langid.map.pattern Java

regular

expression

none no By default, fields are mapped as

<field>_<language>. To change this

pattern, you can specify a Java regular

expression in this parameter.

231Apache Solr Reference Guide 5.1

langid.map.replace Java

replace

none no By default, fields are mapped as

<field>_<language>. To change this

pattern, you can specify a Java replace

in this parameter.

langid.enforceSchema Boolean true no If false, the processor does notlangid

validate field names against your

schema. This may be useful if you plan

to rename or delete fields later in the

UpdateChain.

Content Streams

When Solr RequestHandlers are accessed using path based URLs, the object containingSolrQueryRequest

the parameters of the request may also contain a list of ContentStreams containing bulk data for the request.

(The name SolrQueryRequest is a bit misleading: it is involved in all requests, regardless of whether it is a query

request or an update request.)

Stream Sources

Currently RequestHandlers can get content streams in a variety of ways:

For multipart file uploads, each file is passed as a stream.

For POST requests where the content-type is not , the rawapplication/x-www-form-urlencoded

POST body is passed as a stream. The full POST body is parsed as parameters and included in the Solr

parameters.

The contents of parameter is passed as a stream.stream.body

If remote streaming is enabled and URL content is called for during request handling, the contents of each

and parameters are fetched and passed as a stream.stream.url stream.file

By default, curl sends a header. If you need tocontentType="application/x-www-form-urlencoded"

test a SolrContentHeader content stream, you will need to set the content type with the "-H" flag.

RemoteStreaming

Remote streaming lets you send the contents of a URL as a stream to a given SolrRequestHandler. You could

use remote streaming to send a remote or local file to an update plugin. For convenience, remote streaming is

enabled in most of the example files included with Solr, however it is not recommended in asolrconfig.xml

production situation with out additional security between you and untrusted remote clients.

<!-- *** WARNING ***

The settings below authorize Solr to fetch remote files, You

should make sure your system has some authentication before

using enableRemoteStreaming="true"

-->

The default behavior, when is not specified in is to allowenableRemoteStreaming solrconfig.xml not

remote streaming (i.e., ).enableRemoteStreaming="false"

232Apache Solr Reference Guide 5.1

Debugging Requests

The example files include a "dump" RequestHandler:solrconfig.xml

This handler simply outputs the contents of the SolrQueryRequest using the specified writer type . This is awt

useful tool to help understand what streams are available to the RequestHandlers.

UIMA Integration

You can integrate the Apache Unstructured Information Management Architecture ( ) with Solr. UIMA letsUIMA

you define custom pipelines of Analysis Engines that incrementally add metadata to your documents as

annotations.

For more information about Solr UIMA integration, see .https://wiki.apache.org/solr/SolrUIMA

Configuring UIMA

The SolrUIMA UpdateRequestProcessor is a custom update request processor that takes documents being

indexed, sends them to a UIMA pipeline, and then returns the documents enriched with the specified metadata.

To configure UIMA for Solr, follow these steps:

Copy (under ) and its libraries (under solr-uima-4.x.y.jar /solr-4.x.y/dist/ contrib/uima/l

) to a Solr libraries directory, or set tags in appropriately to point to thoseib <lib/> solrconfig.xml

jar files:

Modify , adding your desired metadata fields specifying proper values for type, indexed,schema.xml

stored, and multiValued options. For example:

<field name="language" type="string" indexed="true" stored="true"

required="false"/>

<field name="concept" type="string" indexed="true" stored="true"

multiValued="true" required="false"/>

<field name="sentence" type="text" indexed="true" stored="true"

multiValued="true" required="false" />

Add the following snippet to :solrconfig.xml

<processor

class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">

If you is used, be aware that this allows to send a requestenableRemoteStreaming="true" anyone

to any URL or local file. If is enabled, it will allow anyone to view any file on yourDumpRequestHandler

system.

233Apache Solr Reference Guide 5.1

<str name="keyword_apikey">VALID_ALCHEMYAPI_KEY</str>

<str name="concept_apikey">VALID_ALCHEMYAPI_KEY</str>

<str name="lang_apikey">VALID_ALCHEMYAPI_KEY</str>

<str name="cat_apikey">VALID_ALCHEMYAPI_KEY</str>

<str name="entities_apikey">VALID_ALCHEMYAPI_KEY</str>

<str name="oc_licenseID">VALID_OPENCALAIS_KEY</str>

</lst>

<str

name="analysisEngine">/org/apache/uima/desc/OverridingParamsExtServicesAE.xml<

/str>

<!-- Set to true if you want to continue indexing even if text

processing fails.

Default is false. That is, Solr throws RuntimeException and

never indexed documents entirely in your session. -->

<!-- This is optional. It is used for logging when text processing

fails.

If logField is not specified, uniqueKey will be used as logField.

-->

<bool name="merge">false</bool>

</arr>

</lst>

<str name="name">org.apache.uima.alchemy.ts.concept.ConceptFS</str>

<str name="field">concept</str>

</lst>

<str

name="name">org.apache.uima.alchemy.ts.language.LanguageFS</str>

<str name="feature">language</str>

<str name="field">language</str>

</lst>

<str name="name">org.apache.uima.SentenceAnnotation</str>

<str name="feature">coveredText</str>

<str name="field">sentence</str>

</lst>

</processor>

234Apache Solr Reference Guide 5.1

</updateRequestProcessorChain>

In your replace the existing default UpdateRequestHandler or create a newsolrconfig.xml

UpdateRequestHandler:

</lst>

</requestHandler>

Once you are done with the configuration your documents will be automatically enriched with the specified fields

when you index them.

VALID_ALCHEMYAPI_KEY is your AlchemyAPI Access Key. You need to register an AlchemyAPI

Access key to use AlchemyAPI services: . http://www.alchemyapi.com/api/register.html

is your Calais Service Key. You need to register a Calais Service keyVALID_OPENCALAIS_KEY

to use the Calais services: . http://www.opencalais.com/apikey

must contain an AE descriptor inside the specified path in the classpath. analysisEngine

must contain the input fields that need to be analyzed by UIMA. If analyzeFields merge=true

then their content will be merged and analyzed only once.

Field mapping describes which features of which types should go in a field.

235Apache Solr Reference Guide 5.1
Searching
This section describes how Solr works with search requests. It covers the following topics:
Overview of Searching in Solr: An introduction to searching with Solr.
Velocity Search UI: A simple search UI using the VelocityResponseWriter.
Relevance: Conceptual information about understanding relevance in search results.
Query Syntax and Parsing: A brief conceptual overview of query syntax and parsing. It also contains the
following sub-sections:
Common Query Parameters: No matter the query parser, there are several parameters that are
common to all of them.
The Standard Query Parser: Detailed information about the standard Lucene query parser.
The DisMax Query Parser: Detailed information about Solr's DisMax query parser.
The Extended DisMax Query Parser: Detailed information about Solr's Extended DisMax (eDisMax)
Query Parser.
Function Queries: Detailed information about parameters for generating relevancy scores using
values from one or more numeric fields.
Local Parameters in Queries: How to add local arguments to queries.
Other Parsers: More parsers designed for use in specific situations.
Faceting: Detailed information about categorizing search results based on indexed terms.
Highlighting: Detailed information about Solr's highlighting utilities. Sub-sections cover the different types
of highlighters:
Standard Highlighter: Uses the most sophisticated and fine-grained query representation of the
three highlighters.
FastVector Highlighter: Optimized for term vector options on fields, and good for large documents
and multiple languages.
Postings Highlighter: Uses similar options as the FastVector highlighter, but is more compact and
efficient.
Spell Checking: Detailed information about Solr's spelling checker.
Query Re-Ranking: Detailed information about re-ranking top scoring documents from simple queries
using more complex scores.
Transforming Result Documents: Detailed information about using   to add computedDocTransformers
information to individual documents
Suggester: Detailed information about Solr's powerful autosuggest component.
MoreLikeThis: Detailed information about Solr's similar results query component.
 
Pagination of Results: Detailed information about fetching paginated results for display in a UI, or for
fetching all documents matching a query.
 
Result Grouping: Detailed information about grouping results based on common field values.

236Apache Solr Reference Guide 5.1
 
Result Clustering: Detailed information about grouping search results based on cluster analysis applied to
text fields. A bit like "unsupervised" faceting.
 
Spatial Search: How to use Solr's spatial search capabilities.
 
The Terms Component: Detailed information about accessing indexed terms and the documents that
include them.
 
The Term Vector Component: How to get term information about specific documents.
 
The Stats Component: How to return information from numeric fields within a document set.
 
The Query Elevation Component: How to force documents to the top of the results for certain queries.
 
Response Writers: Detailed information about configuring and using Solr's response writers.
 
Near Real Time Searching: How to include documents in search results nearly immediately after they are
indexed.
 
RealTime Get: How to get the latest version of a document without opening a searcher.
Overview of Searching in Solr
Solr offers a rich, flexible set of features for search. To understand the extent of this flexibility, it's helpful to begin
with an overview of the steps and components involved in a Solr search.
When a user runs a search in Solr, the search query is processed by a  . A request handler is arequest handler
Solr plug-in that defines the logic to be used when Solr processes a request. Solr supports a variety of request
handlers. Some are designed for processing search queries, while others manage tasks such as index
replication.
Search applications select a particular request handler by default. In addition, applications can be configured to
allow users to override the default selection in preference of a different request handler.
To process a search query, a request handler calls a  , which interprets the terms and parametersquery parser
of a query. Different query parsers support different syntax. The default query parser is the   query parser.DisMax
Solr also includes an earlier "standard" (Lucene) query parser, and an   (eDisMax) query parser.Extended DisMax
The   query parser's syntax allows for greater precision in searches, but the DisMax query parser isstandard
much more tolerant of errors. The DisMax query parser is designed to provide an experience similar to that of
popular search engines such as Google, which rarely display syntax errors to users. The Extended DisMax query
parser is an improved version of DisMax that handles the full Lucene query syntax while still tolerating syntax
errors. It also includes several additional features.
In addition, there are   that are accepted by all query parsers.common query parameters
Input to a query parser can include:
search strings---that is,   to search for in the indexterms
parameters for fine-tuning the query by increasing the importance of particular strings or fields, by
applying Boolean logic among the search terms, or by excluding content from the search results

237Apache Solr Reference Guide 5.1

parameters for controlling the presentation of the query response, such as specifying the order in which

results are to be presented or limiting the response to particular fields of the search application's schema.

Search parameters may also specify a . As part of a search response, a query filter runs a queryquery filter

against the entire index and caches the results. Because Solr allocates a separate cache for filter queries, the

strategic use of filter queries can improve search performance. (Despite their similar names, query filters are not

related to analysis filters. Query filters perform queries at search time against data already in the index, while

analysis filters, such as Tokenizers, parse content for indexing, following specified rules).

A search query can request that certain terms be highlighted in the search response; that is, the selected terms

will be displayed in colored boxes so that they "jump out" on the screen of search results. canHighlighting

make it easier to find relevant passages in long documents returned in a search. Solr supports multi-term

highlighting. Solr includes a rich set of search parameters for controlling how terms are highlighted.

Search responses can also be configured to include (document excerpts) featuring highlighted text.snippets

Popular search engines such as Google and Yahoo! return snippets in their search results: 3-4 lines of text

offering a description of a search result.

To help users zero in on the content they're looking for, Solr supports two special ways of grouping search

results to aid further exploration: faceting and clustering.

Faceting is the arrangement of search results into categories (which are based on indexed terms). Within each

category, Solr reports on the number of hits for relevant term, which is called a facet constraint. Faceting makes

it easy for users to explore search results on sites such as movie sites and product review sites, where there are

many categories and many items within a category.

The image below shows an example of faceting from the CNET Web site, which was the first site to use Solr.

Faceting makes use of fields defined when the search applications were indexed. In the example above, these

fields include categories of information that are useful for describing digital cameras: manufacturer, resolution,

and zoom range.

Clustering groups search results by similarities discovered when a search is executed, rather than when content

is indexed. The results of clustering often lack the neat hierarchical organization found in faceted search results,

but clustering can be useful nonetheless. It can reveal unexpected commonalities among search results, and it

can help users rule out content that isn't pertinent to what they're really searching for.

238Apache Solr Reference Guide 5.1

Solr also supports a feature called , which enables users to submit new queries that focus onMoreLikeThis

particular terms returned in an earlier query. MoreLikeThis queries can make use of faceting or clustering to

provide additional aid to users.

A Solr component called a manages the final presentation of the query response. Solr includesresponse writer

a variety of response writers, including an and a .XML Response Writer JSON Response Writer

The diagram below summarizes some key elements of the search process.

Velocity Search UI

Solr includes a sample search UI based on the (also known as Solritas) thatVelocityResponseWriter

demonstrates several useful features, such as searching, faceting, highlighting, autocomplete, and geospatial

searching.

When using the config set, you can access the Velocity sample Search UIsample_techproducts_configs

here: http://localhost:8983/solr/techproducts/browse

239Apache Solr Reference Guide 5.1

The Velocity Search UI

For more information about the Velocity Response Writer, see the .Response Writer page

Relevance

Relevance is the degree to which a query response satisfies a user who is searching for information.

The relevance of a query response depends on the context in which the query was performed. A single search

application may be used in different contexts by users with different needs and expectations. For example, a

search engine of climate data might be used by a university researcher studying long-term climate trends, a

farmer interested in calculating the likely date of the last frost of spring, a civil engineer interested in rainfall

patterns and the frequency of floods, and a college student planning a vacation to a region and wondering what

to pack. Because the motivations of these users vary, the relevance of any particular response to a query will

vary as well.

How comprehensive should query responses be? Like relevance in general, the answer to this question depends

on the context of a search. The cost of finding a particular document in response to a query is high in somenot

contexts, such as a legal e-discovery search in response to a subpoena, and quite low in others, such as a

search for a cake recipe on a Web site with dozens or hundreds of cake recipes. When configuring Solr, you

should weigh comprehensiveness against other factors such as timeliness and ease-of-use.

The e-discovery and recipe examples demonstrate the importance of two concepts related to relevance:

Precision is the percentage of documents in the returned results that are relevant.

240Apache Solr Reference Guide 5.1

Recall is the percentage of relevant results returned out of all relevant results in the system. Obtaining

perfect recall is trivial: simply return every document in the collection for every query.

Returning to the examples above, it's important for an e-discovery search application to have 100% recall

returning all the documents that are relevant to a subpoena. It's far less important that a recipe application offer

this degree of precision, however. In some cases, returning too many results in casual contexts could overwhelm

users. In some contexts, returning fewer results that have a higher likelihood of relevance may be the best

approach.

Using the concepts of precision and recall, it's possible to quantify relevance across users and queries for a

collection of documents. A perfect system would have 100% precision and 100% recall for every user and every

query. In other words, it would retrieve all the relevant documents and nothing else. In practical terms, when

talking about precision and recall in real systems, it is common to focus on precision and recall at a certain

number of results, the most common (and useful) being ten results.

Through faceting, query filters, and other search components, a Solr application can be configured with the

flexibility to help users fine-tune their searches in order to return the most relevant results for users. That is, Solr

can be configured to balance precision and recall to meet the needs of a particular user community.

The configuration of a Solr application should take into account:

the needs of the application's various users (which can include ease of use and speed of response, in

addition to strictly informational needs)

the categories that are meaningful to these users in their various contexts (e.g., dates, product categories,

or regions)

any inherent relevance of documents (e.g., it might make sense to ensure that an official product

description or FAQ is always returned near the top of the search results)

whether or not the age of documents matters significantly (in some contexts, the most recent documents

might always be the most important)

Keeping all these factors in mind, it's often helpful in the planning stages of a Solr deployment to sketch out the

types of responses you think the search application should return for sample queries. Once the application is up

and running, you can employ a series of testing methodologies, such as focus groups, in-house testing, teTREC

sts and A/B testing to fine tune the configuration of the application to best meet the needs of its users.

For more information about relevance, see Grant Ingersoll's tech article Debugging Search Application

which is available on SearchHub.org.Relevance Issues

Query Syntax and Parsing

Solr supports several query parsers, offering search application designers great flexibility in controlling how

queries are parsed.

This section explains how to specify the query parser to be used. It also describes the syntax and features

supported by the main query parsers included with Solr and describes some other parsers that may be useful for

particular situations. There are some query parameters common to all Solr parsers; these are discussed in the

section .Common Query Parameters

The parsers discussed in this Guide are:

The Standard Query Parser

The DisMax Query Parser

The Extended DisMax Query Parser

241Apache Solr Reference Guide 5.1
Other Parsers
The query parser plugins are all subclasses of  . If you have custom parsing needs, you may wantQParserPlugin
to extend that class to create your own query parser.
For more detailed information about the many query parsers available in Solr, see https://wiki.apache.org/solr/So
.lrQuerySyntax
Common Query Parameters
The table below summarizes Solr's common query parameters, which are supported by the  ,  ,Standard DisMax
and   Request Handlers.eDisMax
Parameter Description
defType Selects the query parser to be used to process the query.
sort Sorts the response to a query in either ascending or descending order based on the
response's score or another specified characteristic.
start Specifies an offset (by default, 0) into the responses at which Solr should begin displaying
content.
rows Controls how many rows of responses are displayed at a time (default value: 10)
fq Applies a filter query to the search results.
fl Limits the information included in a query response to a specified list of fields. The fields need
to have been indexed as stored for this parameter to work correctly.
debug Request additional debugging information in the response. Specifying the   padebug=timing
rameter returns just the timing information; specifying the   parameterdebug=results
returns "explain" information for each of the documents returned; specifying the debug=quer
 returns all of the debug information.y parameter
explainOther Allows clients to specify a Lucene query to identify a set of documents. If non-blank, the
explain info of each document which matches this query, relative to the main query (specified
by the q parameter) will be returned along with the rest of the debugging information.
timeAllowed Defines the time allowed for the query to be processed. If the time elapses before the query
response is complete, partial information may be returned.
omitHeader Excludes the header from the returned results, if set to true. The header contains information
about the request, such as the time the request took to complete. The default is false.
wt Specifies the Response Writer to be used to format the query response.
logParamsList By default, Solr logs all parameters. Set this parameter to restrict which parameters are
logged. Valid entries are the parameters to be logged, separated by commas (i.e., logParam
). An empty list will log no parameters, so if logging all parameterssList=param1,param2
is desired, do not define this additional parameter at all.
The following sections describe these parameters in detail.
The   ParameterdefType

242Apache Solr Reference Guide 5.1

The defType parameter selects the query parser that Solr should use to process the main query parameter ( ) inq

the request. For example:

defType=dismax

If no defType param is specified, then by default, the is used. (eg: The Standard Query Parser defType=lucen

The Parametersort

The parameter arranges search results in either ascending ( ) or descending ( ) order. Thesort asc desc

parameter can be used with either numerical or alphabetical content. The directions can be entered in either all

lowercase or all uppercase letters (i.e., both or ).asc ASC

Solr can sort query responses according to document scores or the value of any indexed field with a single value

(that is, any field whose attributes in include and ),schema.xml multiValued="false" indexed="true"

provided that:

the field is non-tokenized (that is, the field has no analyzer and its contents have been parsed into tokens,

which would make the sorting inconsistent), or

the field uses an analyzer (such as the KeywordTokenizer) that produces only a single term.

If you want to be able to sort on a field whose contents you want to tokenize to facilitate searching, use the <cop

directive in the file to clone the field. Then search on the field and sort on its clone.yField> schema.xml

The table explains how Solr responds to various settings of the parameter.sort

Example Result

If the sort parameter is omitted, sorting is performed as though the parameter were set to

score .desc

score desc Sorts in descending order from the highest score to the lowest score.

price asc Sorts in ascending order of the price field

inStock desc,

price asc

Sorts by the contents of the field in descending order, then within those resultsinStock

sorts in ascending order by the contents of the price field.

Regarding the sort parameter's arguments:

A sort ordering must include a field name (or as a pseudo field), followed by whitespace (escapedscore

as + or in URL strings), followed by a sort direction ( or ).%20 asc desc

Multiple sort orderings can be separated by a comma, using this syntax: sort=<field

name>+<direction>,<field name>+<direction>],...

When more than one sort criteria is provided, the second entry will only be used if the first entry

results in a tie. If there is a third entry, it will only be used if the first AND second entries are tied.

This pattern continues with further entries.

The Parameterstart

When specified, the parameter specifies an offset into a query's result set and instructs Solr to beginstart

displaying results from this offset.

243Apache Solr Reference Guide 5.1

The default value is "0". In other words, by default, Solr returns results without an offset, beginning where the

results themselves begin.

Setting the parameter to some other number, such as 3, causes Solr to skip over the preceding recordsstart

and start at the document identified by the offset.

You can use the parameter this way for paging. For example, if the parameter is set to 10, youstart rows

could display three successive pages of results by setting start to 0, then re-issuing the same query and setting

start to 10, then issuing the query again and setting start to 20.

The Parameterrows

You can use the rows parameter to paginate results from a query. The parameter specifies the maximum

number of documents from the complete result set that Solr should return to the client at one time.

The default value is 10. That is, by default, Solr returns 10 documents at a time in response to a query.

The (Filter Query) Parameterfq

The parameter defines a query that can be used to restrict the superset of documents that can be returned,fq

without influencing score. It can be very useful for speeding up complex queries, since the queries specified with

are cached independently of the main query. When a later query uses the same filter, there's a cache hit, andfq

filter results are returned quickly from the cache.

When using the parameter, keep in mind the following:fq

The parameter can be specified multiple times in a query. Documents will only be included in the resultfq

if they are in the intersection of the document sets resulting from each instance of the parameter. In the

example below, only documents which have a popularity greater then 10 and have a section of 0 will

match.

fq=popularity:[10 TO *]&fq=section:0

Filter queries can involve complicated Boolean queries. The above example could also be written as a

single with two mandatory clauses like so:fq

fq=+popularity:[10 TO *] +section:0

The document sets from each filter query are cached independently. Thus, concerning the previous

examples: use a single containing two mandatory clauses if those clauses appear together often, andfq

use two separate parameters if they are relatively independent. (To learn about tuning cache sizes andfq

making sure a filter cache actually exists, see .)The Well-Configured Solr Instance

As with all parameters: special characters in an URL need to be properly escaped and encoded as hex

values. Online tools are available to help you with URL-encoding. For example: http://meyerweb.com/eric/t

.ools/dencoder/

The (Field List) Parameterfl

The parameter limits the information included in a query response to a specified list of fields. The fields needfl

to have been indexed as stored for this parameter to work correctly.

244Apache Solr Reference Guide 5.1

The field list can be specified as a space-separated or comma-separated list of field names. The string "score"

can be used to indicate that the score of each document for the particular query should be returned as a field.

The wildcard character "*" selects all the stored fields in a document. You can also add psuedo-fields, functions

and transformers to the field list request.

This table shows some basic examples of how to use :fl

Field List Result

id name price Return only the id, name, and price fields.

id,name,price Return only the id, name, and price fields.

id name, price Return only the id, name, and price fields.

id score Return the id field and the score.

* Return all the fields in each document. This is the default value of the fl parameter.

* score Return all the fields in each document, along with each field's score.

Function Values

Functions can be computed for each document in the result and returned as a psuedo-field:

fl=id,title,product(price,popularity)

Document Transformers

Document Transformers can be used to modify the information returned about each documents in the results of

a query:

fl=id,title,[explain]

Field Name Aliases

You can change the key used to in the response for a field, function, or transformer by prefixing it with a "displ

". For example::ayName

fl=id,sales_price:price,secret_sauce:prod(price,popularity),why_score:[explain

style=nl]

245Apache Solr Reference Guide 5.1

"response":{"numFound":2,"start":0,"docs":[

{

"id":"6H500F0",

"secret_sauce":2100.0,

"sales_price":350.0,

"why_score":{

"match":true,

"value":1.052226,

"description":"weight(features:cache in 2) [DefaultSimilarity], result

of:",

"details":[{

...

The Parameterdebug

The parameter can be specified multiple times and supports the following arguments:debug

debug=query: return debug information about the query only.

debug=timing: return debug information about how long the query took to process.

debug=results: return debug information about the score results (also known as "explain")

debug=all: return all available debug information about the request request. (alternatively usage: debug

)=true

For backwards compatibility with older versions of Solr, may instead be specified as andebugQuery=true

alternative way to indicate debug=all

The default behavior is not to include debugging information.

The ParameterexplainOther

The parameter specifies a Lucene query in order to identify a set of documents. If thisexplainOther

parameter is included and is set to a non-blank value, the query will return debugging information, along with the

"explain info" of each document that matches the Lucene query, relative to the main query (which is specified by

the q parameter). For example:

q=supervillians&debugQuery=on&explainOther=id:juggernaut

The query above allows you to examine the scoring explain info of the top matching documents, compare it to

the explain info for documents matching , and determine why the rankings are not as youid:juggernaut

expect.

The default value of this parameter is blank, which causes no extra "explain info" to be returned.

The ParametertimeAllowed

This parameter specifies the amount of time, in milliseconds, allowed for a search to complete. If this time

expires before the search is complete, any partial results will be returned.

The ParameteromitHeader

This parameter may be set to either true or false.

246Apache Solr Reference Guide 5.1

If set to true, this parameter excludes the header from the returned results. The header contains information

about the request, such as the time it took to complete. The default value for this parameter is false.

The Parameterwt

The parameter selects the Response Writer that Solr should use to format the query's response. For detailedwt

descriptions of Response Writers, see .Response Writers

The cache=false Parameter

Solr caches the results of all queries and filter queries by default. To disable result caching, set the cache=fals

parameter.e

You can also use the option to control the order in which non-cached filter queries are evaluated. Thiscost

allows you to order less expensive non-cached filters before expensive non-cached filters.

For very high cost filters, if and and the query implements the interfaccache=false cost>=100 PostFilter

e, a Collector will be requested from that query and used to filter documents after they have matched the main

query and all other filter queries. There can be multiple post filters; they are also ordered by cost.

For example:

// normal function range query used as a filter, all matching documents

// generated up front and cached

fq={!frange l=10 u=100}mul(popularity,price)

// function range query run in parallel with the main query like a traditional

// lucene filter

fq={!frange l=10 u=100 cache=false}mul(popularity,price)

// function range query checked after each document that already matches the query

// and all other filters. Good for really expensive function queries.

fq={!frange l=10 u=100 cache=false cost=100}mul(popularity,price)

The ParameterlogParamsList

By default, Solr logs all parameters of requests. From version 4.7, set this parameter to restrict which parameters

of a request are logged. This may help control logging to only those parameters considered important to your

organization.

For example, you could define this like:

logParamsList=q,fq

And only the 'q' and 'fq' parameters will be logged.

If no parameters should be logged, you can send as empty (i.e., ).logParamsList logParamsList=

The Standard Query Parser

Solr's default Query Parser is also known as the " " parser.lucene

The key advantage of the standard query parser is that it supports a robust and fairly intuitive syntax allowing

This parameter does not only apply to query requests, but to any kind of request to Solr.

247Apache Solr Reference Guide 5.1

you to create a variety of structured queries. The largest disadvantage is that it's very intolerant of syntax errors,

as compared with something like the query parser which is designed to throw as few errors as possible.DisMax

Topics covered in this section:

Standard Query Parser Parameters

The Standard Query Parser's Response

Specifying Terms for the Standard Query Parser

Specifying Fields in a Query to the Standard Query Parser

Boolean Operators Supported by the Standard Query Parser

Grouping Terms to Form Sub-Queries

Differences between Lucene Query Parser and the Solr Standard Query Parser

Related Topics

Standard Query Parser Parameters

In addition to the , , , and Common Query Parameters Faceting Parameters Highlighting Parameters MoreLikeThis

, the standard query parser supports the parameters described in the table below.Parameters

Parameter Description

q Defines a query using standard query syntax. This parameter is mandatory.

q.op Specifies the default operator for query expressions, overriding the default operator specified in

the file. Possible values are "AND" or "OR".schema.xml

df Specifies a default field, overriding the definition of a default field in the file.schema.xml

Default parameter values are specified in , or overridden by query-time values in the request.solrconfig.xml

The Standard Query Parser's Response

By default, the response from the standard query parser contains one block, which is unnamed. If the<result>

is used, then an additional block will be returned, using the name "debug". This will parameterdebug <lst>

contain useful debugging info, including the original query string, the parsed query string, and explain info for

each document in the <result> block. If the is also used, then additional explain info parameterexplainOther

will be provided for all the documents matching that query.

Sample Responses

This section presents examples of responses from the standard query parser.

The URL below submits a simple query and requests the XML Response Writer to use indentation to make the

XML response more readable.

http://localhost:8983/solr/techproducts/select?q=id:SP2514N

Results:

248Apache Solr Reference Guide 5.1

<?xml version="1.0" encoding="UTF-8"?>

<doc>

<arr name="cat"><str>electronics</str><str>hard drive</str></arr>

<arr name="features"><str>7200RPM, 8MB cache, IDE Ultra ATA-133</str>

<str>NoiseGuard, SilentSeek technology, Fluid Dynamic Bearing (FDB)

motor</str></arr>

<str name="manu">Samsung Electronics Co. Ltd.</str>

<str name="name">Samsung SpinPoint P120 SP2514N - hard drive - 250 GB -

ATA-133</str>

</doc>

</result>

</response>

Here's an example of a query with a limited field list.

http://localhost:8983/solr/techproducts/select?q=id:SP2514N&fl=id+name

Results:

<?xml version="1.0" encoding="UTF-8"?>

<doc>

<str name="name">Samsung SpinPoint P120 SP2514N - hard drive - 250 GB -

ATA-133</str>

</doc>

</result>

</response>

Specifying Terms for the Standard Query Parser

A query to the standard query parser is broken up into terms and operators. There are two types of terms: single

terms and phrases.

A single term is a single word such as "test" or "hello"

A phrase is a group of words surrounded by double quotes such as "hello dolly"

Multiple terms can be combined together with Boolean operators to form more complex queries (as described

below).

249Apache Solr Reference Guide 5.1

Term Modifiers

Solr supports a variety of term modifiers that add flexibility or precision, as needed, to searches. These modifiers

include wildcard characters, characters for making a search "fuzzy" or more general, and so on. The sections

below describe these modifiers in detail.

Wildcard Searches

Solr's standard query parser supports single and multiple character wildcard searches within single terms.

Wildcard characters can be applied to single terms, but not to search phrases.

Wildcard Search Type Special

Character

Example

Single character (matches a single

character)

? The search string would match both testte?t

and text.

Multiple characters (matches zero or more

sequential characters)

* The wildcard search:

tes*

would match test, testing, and tester.

You can also use wildcard characters in the

middle of a term. For example:

te*t

would match test and text.

*est

would match pest and test.

Fuzzy Searches

Solr's standard query parser supports fuzzy searches based on the Damerau-Levenshtein Distance or Edit

Distance algorithm. Fuzzy searches discover terms that are similar to a specified term without necessarily being

an exact match. To perform a fuzzy search, use the tilde ~ symbol at the end of a single-word term. For example,

to search for a term similar in spelling to "roam," use the fuzzy search:

roam~

This search will match terms like roams, foam, & foams. It will also match the word "roam" itself.

An optional distance parameter specifies the maximum number of edits allowed, between 0 and 2, defaulting to

It is important that the analyzer used for queries parses terms and phrases in a way that is consistent

with the way the analyzer used for indexing parses terms and phrases; otherwise, searches may

produce unexpected results.

250Apache Solr Reference Guide 5.1

2. For example:

roam~1

This will match terms like roams & foam - but not foams since it has an edit distance of "2".

Proximity Searches

A proximity search looks for terms that are within a specific distance from one another.

To perform a proximity search, add the tilde character ~ and a numeric value to the end of a search phrase. For

example, to search for a "apache" and "jakarta" within 10 words of each other in a document, use the search:

"jakarta apache"~10

The distance referred to here is the number of term movements needed to match the specified phrase. In the

example above, if "apache" and "jakarta" were 10 spaces apart in a field, but "apache" appeared before "jakarta",

more than 10 term movements would be required to move the terms together and position "apache" to the right

of "jakarta" with a space in between.

Range Searches

A range search specifies a range of values for a field (a range with an upper bound and a lower bound). The

query matches documents whose values for the specified field or fields fall within the range. Range queries can

be inclusive or exclusive of the upper and lower bounds. Sorting is done lexicographically, except on numeric

fields. For example, the range query below matches all documents whose field has a value betweenmod_date

20020101 and 20030101, inclusive.

mod_date:[20020101 TO 20030101]

Range queries are not limited to date fields or even numerical fields. You could also use range queries with

non-date fields:

title:{Aida TO Carmen}

This will find all documents whose titles are between Aida and Carmen, but not including Aida and Carmen.

The brackets around a query determine its inclusiveness.

Square brackets [ ] denote an inclusive range query that matches values including the upper and lower

bound.

Curly brackets { } denote an exclusive range query that matches values between the upper and lower

bounds, but excluding the upper and lower bounds themselves.

You can mix these types so one end of the range is inclusive and the other is exclusive. Here's an

example: count:{1 TO 10]

Boosting a Term with ^

Lucene/Solr provides the relevance level of matching documents based on the terms found. To boost a term use

the caret symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost^

factor, the more relevant the term will be.

In many cases, stemming (reducing terms to a common stem) can produce similar effects to fuzzy

searches and wildcard searches.

251Apache Solr Reference Guide 5.1

Boosting allows you to control the relevance of a document by boosting its term. For example, if you are

searching for

"jakarta apache" and you want the term "jakarta" to be more relevant, you can boost it by adding the ^ symbol

along with the boost factor immediately after the term. For example, you could type:

jakarta^4 apache

This will make documents with the term jakarta appear more relevant. You can also boost Phrase Terms as in

the example:

"jakarta apache"^4 "Apache Lucene"

By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (for example, it

could be 0.2).

Constant Score with ^=

Constant score queries are created with , which sets the entire clause to the<query_clause>^=<score>

specified score for any documents matching that clause. This is desirable when you only care about matches for

a particular clause and don't want other relevancy factors such as term frequency (the number of times the term

appears in the field) or inverse document frequency (a measure across the whole index for how rare a term is in

a field).

Example:

(description:blue OR color:blue)^=1.0 text:shoes

Specifying Fields in a Query to the Standard Query Parser

Data indexed in Solr is organized in fields, which are defined in the Solr file. Searches can takeschema.xml

advantage of fields to add precision to queries. For example, you can search for a term only in a specific field,

such as a title field.

The file defines one field as a default field. If you do not specify a field in a query, Solr searchesschema.xml

only the default field. Alternatively, you can specify a different field or a combination of fields in a query.

To specify a field, type the field name followed by a colon ":" and then the term you are searching for within the

field.

For example, suppose an index contains two fields, title and text,and that text is the default field. If you want to

find a document called "The Right Way" which contains the text "don't go this way," you could include either of

the following terms in your search query:

title:"The Right Way" AND text:go

title:"Do it right" AND go

Since text is the default field, the field indicator is not required; hence the second query above omits it.

The field is only valid for the term that it directly precedes, so the query will find only "Do"title:Do it right

in the title field. It will find "it" and "right" in the default field (in this case the text field).

Boolean Operators Supported by the Standard Query Parser

Boolean operators allow you to apply Boolean logic to queries, requiring the presence or absence of specific

252Apache Solr Reference Guide 5.1

terms or conditions in fields in order to match documents. The table below summarizes the Boolean operators

supported by the standard query parser.

Boolean

Operator

Alternative

Symbol

Description

AND && Requires both terms on either side of the Boolean operator to be present for a

match.

NOT !Requires that the following term not be present.

OR || Requires that either term (or both terms) be present for a match.

+Requires that the following term be present.

-Prohibits the following term (that is, matches on fields or documents that do not

include that term). The operator is functionally similar to the Boolean operator .- !

Because it's used by popular search engines such as Google, it may be more

familiar to some user communities.

Boolean operators allow terms to be combined through logic operators. Lucene supports AND, " ", OR, NOT and+

" " as Boolean operators.-

The OR operator is the default conjunction operator. This means that if there is no Boolean operator between

two terms, the OR operator is used. The OR operator links two terms and finds a matching document if either of

the terms exist in a document. This is equivalent to a union using sets. The symbol || can be used in place of the

word OR.

In the file, you can specify which symbols can take the place of Boolean operators such as OR. Toschema.xml

search for documents that contain either "jakarta apache" or just "jakarta," use the query:

"jakarta apache" jakarta

"jakarta apache" OR jakarta

The Boolean Operator +

The symbol (also known as the "required" operator) requires that the term after the symbol exist somewhere+ +

in a field in at least one document in order for the query to return a match.

For example, to search for documents that must contain "jakarta" and that may or may not contain "lucene," use

the following query:

+jakarta lucene

When specifying Boolean operators with keywords such as AND or NOT, the keywords must appear in

all uppercase.

The standard query parser supports all the Boolean operators listed in the table above. The DisMax

query parser supports only and .+ -

253Apache Solr Reference Guide 5.1

The Boolean Operator AND ( )&&

The AND operator matches documents where both terms exist anywhere in the text of a single document. This is

equivalent to an intersection using sets. The symbol can be used in place of the word AND.&&

To search for documents that contain "jakarta apache" and "Apache Lucene," use either of the following queries:

"jakarta apache" AND "Apache Lucene"

"jakarta apache" && "Apache Lucene"

The Boolean Operator NOT ( )!

The NOT operator excludes documents that contain the term after NOT. This is equivalent to a difference using

sets. The symbol can be used in place of the word NOT.!

The following queries search for documents that contain the phrase "jakarta apache" but do not contain the

phrase "Apache Lucene":

"jakarta apache" NOT "Apache Lucene"

"jakarta apache" ! "Apache Lucene"

The Boolean Operator -

The symbol or "prohibit" operator excludes documents that contain the term after the symbol.- -

For example, to search for documents that contain "jakarta apache" but not "Apache Lucene," use the following

query:

"jakarta apache" -"Apache Lucene"

Escaping Special Characters

Solr gives the following characters special meaning when they appear in a query:

+ - && || ! ( ) { } [ ] ^ " ~ * ? : /

To make Solr interpret any of these characters literally, rather as a special character, precede the character with

a backslash character \. For example, to search for (1+1):2 without having Solr interpret the plus sign and

parentheses as special characters for formulating a sub-query with two terms, escape the characters by

preceding each one with a backslash:

$1\+1$\:2

Grouping Terms to Form Sub-Queries

Lucene/Solr supports using parentheses to group clauses to form sub-queries. This can be very useful if you

want to control the Boolean logic for a query.

The query below searches for either "jakarta" or "apache" and "website":

This operator is supported by both the standard query parser and the DisMax query parser.

254Apache Solr Reference Guide 5.1

(jakarta OR apache) AND website

This adds precision to the query, requiring that the term "website" exist, along with either term "jakarta" and

"apache."

Grouping Clauses within a Field

To apply two or more Boolean operators to a single field in a search, group the Boolean clauses within

parentheses. For example, the query below searches for a title field that contains both the word "return" and the

phrase "pink panther":

title:(+return +"pink panther")

Differences between Lucene Query Parser and the Solr Standard Query Parser

Solr's standard query parser differs from the Lucene Query Parser in the following ways:

A * may be used for either or both endpoints to specify an open-ended range query

field:[* TO 100] finds all field values less than or equal to 100

field:[100 TO *] finds all field values greater than or equal to 100

field:[* TO *] matches all documents with the field

Pure negative queries (all clauses prohibited) are allowed (only as a top-level clause)

-inStock:false finds all field values where inStock is not false

-field:[* TO *] finds all documents without a value for field

A hook into FunctionQuery syntax. You'll need to use quotes to encapsulate the function if it includes

parentheses, as shown in the second example below:

_val_:myfield

_val_:"recip(rord(myfield),1,2,3)"

Support for using any type of query parser as a nested clause.

inStock:true OR {!dismax qf='name manu' v='ipod'}

Range queries ("[a TO z]"), prefix queries ("a*"), and wildcard queries ("a*b") are constant-scoring (all

matching documents get an equal score). The scoring factors TF, IDF, index boost, and "coord" are not

used. There is no limitation on the number of terms that match (as there was in past versions of Lucene).

Specifying Dates and Times

Queries against fields using the type (typically range queries) should use the TrieDateField appropriate date

:syntax

timestamp:[* TO NOW]

createdate:[1976-03-06T23:59:59.999Z TO *]

createdate:[1995-12-31T23:59:59.999Z TO 2007-03-06T00:00:00Z]

pubdate:[NOW-1YEAR/DAY TO NOW/DAY+1DAY]

createdate:[1976-03-06T23:59:59.999Z TO 1976-03-06T23:59:59.999Z+1YEAR]

createdate:[1976-03-06T23:59:59.999Z/YEAR TO 1976-03-06T23:59:59.999Z]

Related Topics

Local Parameters in Queries

Other Parsers

255Apache Solr Reference Guide 5.1

The DisMax Query Parser

The DisMax query parser is designed to process simple phrases (without complex syntax) entered by users and

to search for individual terms across several fields using different weighting (boosts) based on the significance of

each field. Additional options enable users to influence the score based on rules specific to each use case

(independent of user input).

In general, the DisMax query parser's interface is more like that of Google than the interface of the 'standard' Solr

request handler. This similarity makes DisMax the appropriate query parser for many consumer applications. It

accepts a simple syntax, and it rarely produces error messages.

The DisMax query parser supports an extremely simplified subset of the Lucene QueryParser syntax. As in

Lucene, quotes can be used to group phrases, and +/- can be used to denote mandatory and optional clauses.

All other Lucene query parser special characters (except AND and OR) are escaped to simplify the user

experience. The DisMax query parser takes responsibility for building a good query from the user's input using

Boolean clauses containing DisMax queries across fields and boosts specified by the user. It also lets the Solr

administrator provide additional boosting queries, boosting functions, and filtering queries to artificially affect the

outcome of all searches. These options can all be specified as default parameters for the handler in the solrco

file or overridden in the Solr query URL.nfig.xml

Interested in the technical concept behind the DisMax name? DisMax stands for Maximum Disjunction. Here's a

definition of a Maximum Disjunction or "DisMax" query:

A query that generates the union of documents produced by its subqueries, and that scores each

document with the maximum score for that document as produced by any subquery, plus a tie

breaking increment for any additional matching subqueries.

Whether or not you remember this explanation, do remember that the DisMax request handler was primarily

designed to be easy to use and to accept almost any input without returning an error.

DisMax Parameters

In addition to the common request parameter, highlighting parameters, and simple facet parameters, the DisMax

query parser supports the parameters described below. Like the standard query parser, the DisMax query parser

allows default parameter values to be specified in , or overridden by query-time values in thesolrconfig.xml

request.

Parameter Description

qDefines the raw input strings for the query.

q.alt Calls the standard query parser and defines query input strings, when the q parameter is not

used.

qf Query Fields: specifies the fields in the index on which to perform the query. If absent, defaults to

.df

256Apache Solr Reference Guide 5.1

mm Minimum "Should" Match: specifies a minimum number of clauses that must match in a query. If

no 'mm' parameter is specified in the query, or as a default in , the effectivesolrconfig.xml

value of the parameter (either in the query, as a default in , or from theq.op solrconfig.xml

'defaultOperator' option in ) is used to influence the behavior. If is effectivelyschema.xml q.op

AND'ed, then mm=100%; if is OR'ed, then mm=1. Users who want to force the legacyq.op

behavior should set a default value for the 'mm' parameter in their file. Userssolrconfig.xml

should add this as a configured default for their request handlers. This parameter tolerates

miscellaneous white spaces in expressions (e.g., " 3 < -25% 10 < -3\n", " \n-25%\n

).", " \n3\n "

pf Phrase Fields: boosts the score of documents in cases where all of the terms in the q parameter

appear in close proximity.

ps Phrase Slop: specifies the number of positions two terms can be apart in order to match the

specified phrase.

qs Query Phrase Slop: specifies the number of positions two terms can be apart in order to match

the specified phrase. Used specifically with the parameter.qf

tie Tie Breaker: specifies a float value (which should be something much less than 1) to use as

tiebreaker in DisMax queries.

bq Boost Query: specifies a factor by which a term or phrase should be "boosted" in importance

when considering a match.

bf Boost Functions: specifies functions to be applied to boosts. (See for details about function

queries.)

The sections below explain these parameters in detail.

The Parameterq

The parameter defines the main "query" constituting the essence of the search. The parameter supports rawq

input strings provided by users with no special escaping. The + and - characters are treated as "mandatory" and

"prohibited" modifiers for terms. Text wrapped in balanced quote characters (for example, "San Jose") is treated

as a phrase. Any query containing an odd number of quote characters is evaluated as if there were no quote

characters at all.

The Parameterq.alt

If specified, the parameter defines a query (which by default will be parsed using standard query parsingq.alt

syntax) when the main q parameter is not specified or is blank. The parameter comes in handy when youq.alt

need something like a query to match all documents (don't forget for that one!) in order to get&rows=0

collection-wise faceting counts.

The (Query Fields) Parameterqf

The parameter introduces a list of fields, each of which is assigned a boost factor to increase or decrease thatqf

particular field's importance in the query. For example, the query below:

The parameter does not support wildcard characters such as *.q

257Apache Solr Reference Guide 5.1

qf="fieldOne^2.3 fieldTwo fieldThree^0.4"

assigns a boost of 2.3, leaves with the default boost (because no boost factor isfieldOne fieldTwo

specified), and a boost of 0.4. These boost factors make matches in much morefieldThree fieldOne

significant than matches in , which in turn are much more significant than matches in .fieldTwo fieldThree

The (Minimum Should Match) Parametermm

When processing queries, Lucene/Solr recognizes three types of clauses: mandatory, prohibited, and "optional"

(also known as "should" clauses). By default, all words or phrases specified in the parameter are treated asq

"optional" clauses unless they are preceded by a "+" or a "-". When dealing with these "optional" clauses, the mm

parameter makes it possible to say that a certain minimum number of those clauses must match. The DisMax

query parser offers great flexibility in how the minimum number can be specified.

The table below explains the various ways that mm values can be specified.

Syntax Example Description

Positive integer 3 Defines the minimum number of clauses that must match, regardless of

how many clauses there are in total.

Negative integer -2 Sets the minimum number of matching clauses to the total number of

optional clauses, minus this value.

Percentage 75% Sets the minimum number of matching clauses to this percentage of the

total number of optional clauses. The number computed from the

percentage is rounded down and used as the minimum.

Negative percentage -25% Indicates that this percent of the total number of optional clauses can be

missing. The number computed from the percentage is rounded down,

before being subtracted from the total to determine the minimum number.

An expression

beginning with a

positive integer

followed by a > or <

sign and another

value

3<90% Defines a conditional expression indicating that if the number of optional

clauses is equal to (or less than) the integer, they are all required, but if

it's greater than the integer, the specification applies. In this example: if

there are 1 to 3 clauses they are all required, but for 4 or more clauses

only 90% are required.

Multiple conditional

expressions involving

> or < signs

2<-25%

9<-3

Defines multiple conditions, each one being valid only for numbers

greater than the one before it. In the example at left, if there are 1 or 2

clauses, then both are required. If there are 3-9 clauses all but 25% are

required. If there are more then 9 clauses, all but three are required.

When specifying values, keep in mind the following:mm

When dealing with percentages, negative values can be used to get different behavior in edge cases. 75%

and -25% mean the same thing when dealing with 4 clauses, but when dealing with 5 clauses 75% means

3 are required, but -25% means 4 are required.

If the calculations based on the parameter arguments determine that no optional clauses are needed, the

usual rules about Boolean queries still apply at search time. (That is, a Boolean query containing no

required clauses must still match at least one optional clause).

No matter what number the calculation arrives at, Solr will never use a value greater than the number of

258Apache Solr Reference Guide 5.1

optional clauses, or a value less than 1. In other words, no matter how low or how high the calculated

result, the minimum number of required matches will never be less than 1 or greater than the number of

clauses.

When searching across multiple fields that are configured with different query analyzers, the number of

optional clauses may differ between the fields. In such a case, the value specified by mm applies to the

maximum number of optional clauses. For example, if a query clause is treated as stopword for one of the

fields, the number of optional clauses for that field will be smaller than for the other fields. A query with

such a stopword clause would not return a match in that field if mm is set to 100% because the removed

clause does not count as matched.

The default value of is 100% (meaning that all clauses must match).mm

The (Phrase Fields) Parameterpf

Once the list of matching documents has been identified using the and parameters, the parameter canfq qf pf

be used to "boost" the score of documents in cases where all of the terms in the q parameter appear in close

proximity.

The format is the same as that used by the parameter: a list of fields and "boosts" to associate with each ofqf

them when making phrase queries out of the entire q parameter.

The (Phrase Slop) Parameterps

The parameter specifies the amount of "phrase slop" to apply to queries specified with the pf parameter.ps

Phrase slop is the number of positions one token needs to be moved in relation to another token in order to

match a phrase specified in a query.

The (Query Phrase Slop) Parameterqs

The parameter specifies the amount of slop permitted on phrase queries explicitly included in the user's queryqs

string with the parameter. As explained above, slop refers to the number of positions one token needs to beqf

moved in relation to another token in order to match a phrase specified in a query.

The (Tie Breaker) Parametertie

The parameter specifies a float value (which should be something much less than 1) to use as tiebreaker intie

DisMax queries.

When a term from the user's input is tested against multiple fields, more than one field may match. If so, each

field will generate a different score based on how common that word is in that field (for each document relative to

all other documents). The parameter lets you control how much the final score of the query will betie

influenced by the scores of the lower scoring fields compared to the highest scoring field.

A value of "0.0" makes the query a pure "disjunction max query": that is, only the maximum scoring subquery

contributes to the final score. A value of "1.0" makes the query a pure "disjunction sum query" where it doesn't

matter what the maximum scoring sub query is, because the final score will be the sum of the subquery scores.

Typically a low value, such as 0.1, is useful.

The (Boost Query) Parameterbq

The parameter specifies an additional, optional, query clause that will be added to the user's main query tobq

influence the score. For example, if you wanted to add a relevancy boost for recent documents:

259Apache Solr Reference Guide 5.1

q=cheese

bq=date:[NOW/DAY-1YEAR TO NOW/DAY]

You can specify multiple parameters. If you want your query to be parsed as separate clauses with separatebq

boosts, use multiple parameters.bq

The (Boost Functions) Parameterbf

The parameter specifies functions (with optional boosts) that will be used to construct FunctionQueries whichbf

will be added to the user's main query as optional clauses that will influence the score. Any function supported

natively by Solr can be used, along with a boost value. For example:

recip(rord(myfield),1,2,3)^1.5

Specifying functions with the bf parameter is essentially just shorthand for using the param combined with thebq

parser.{!func}

For example, if you want to show the most recent documents first, you could use either of the following:

bf=recip(rord(creationDate),1,1000,1000)

...or...

bq={!func}recip(rord(creationDate),1,1000,1000)

Examples of Queries Submitted to the DisMax Query Parser

All of the sample URLs in this section assume you are running Solr's "techproducts" example:

bin/solr -e techproducts

Normal results for the word "video" using the StandardRequestHandler with the default search field:

http://localhost:8983/solr/techproducts/select?q=video&fl=name+score

The "dismax" handler is configured to search across the text, features, name, sku, id, manu, and cat fields all

with varying boosts designed to ensure that "better" matches appear first, specifically: documents which match

on the name and cat fields get higher scores.

http://localhost:8983/solr/techproducts/select?defType=dismax&q=video

Note that this instance is also configured with a default field list, which can be overridden in the URL.

http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&fl=*,scor

You can also override which fields are searched on and how much boost each field gets.

http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&qf=featur

es^20.0+text^0.3

You can boost results that have a field that matches a specific value.

http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&bq=cat:el

260Apache Solr Reference Guide 5.1

ectronics^5.0

Another instance of the handler is registered using the "instock" and has slightly different configurationqt

options, notably: a filter for (you guessed it) .inStock:true)

http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&fl=name,s

core,inStock

http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&qt=instoc

k&fl=name,score,inStock

One of the other really cool features in this handler is robust support for specifying the

"BooleanQuery.minimumNumberShouldMatch" you want to be used based on how many terms are in your user's

query. These allows flexibility for typos and partial matches. For the dismax handler, one and two word queries

require that all of the optional clauses match, but for three to five word queries one missing word is allowed.

http://localhost:8983/solr/techproducts/select?defType=dismax&q=belkin+ipod

http://localhost:8983/solr/techproducts/select?defType=dismax&q=belkin+ipod+gib

berish

http://localhost:8983/solr/techproducts/select?defType=dismax&q=belkin+ipod+app

Just like the StandardRequestHandler, it supports the debugQuery option to viewing the parsed query, and the

score explanations for each document.

http://localhost:8983/solr/techproducts/select?defType=dismax&q=belkin+ipod+gib

berish&debugQuery=true

http://localhost:8983/solr/techproducts/select?defType=dismax&q=video+card&debu

gQuery=true

The Extended DisMax Query Parser

The Extended DisMax (eDisMax) query parser is an improved version of the . In addition toDisMax query parser

supporting all the DisMax query parser parameters, Extended Dismax:

supports the full Lucene query parser syntax.

supports queries such as AND, OR, NOT, -, and +.

treats "and" and "or" as "AND" and "OR" in Lucene syntax mode.

respects the 'magic field' names and . These are not a real fields in , but if_val_ _query_ schema.xml

used it helps do special things (like a function query in the case of or a nested query in the case of _val_

). If is used in a term or phrase query, the value is parsed as a function._query_ _val_

includes improved smart partial escaping in the case of syntax errors; fielded queries, +/-, and phrase

queries are still supported in this mode.

improves proximity boosting by using word shingles; you do not need the query to match all words in the

document before proximity boosting is applied.

includes advanced stopword handling: stopwords are not required in the mandatory part of the query but

are still used in the proximity boosting part. If a query consists of all stopwords, such as "to be or not to

be", then all words are required.

includes improved boost function: in Extended DisMax, the function is a multiplier rather than anboost

addend, improving your boost results; the additive boost functions of DisMax ( and ) are alsobf bq

supported.

261Apache Solr Reference Guide 5.1

supports pure negative nested queries: queries such as will match all documents.+foo (-foo)

lets you specify which fields the end user is allowed to query, and to disallow direct fielded searches.

Extended DisMax Parameters

In addition to all the , Extended DisMax includes these query parameters:DisMax parameters

The Parameterboost

A multivalued list of strings parsed as queries with scores multiplied by the score from the main query for all

matching documents. This parameter is shorthand for wrapping the query produced by eDisMax using the Boos

tQParserPlugin

The ParameterlowercaseOperators

A Boolean parameter indicating if lowercase "and" and "or" should be treated the same as operators "AND" and

"OR".

The Parameterps

Default amount of slop on phrase queries built with , and/or fields (affects boosting).pf pf2 pf3

The Parameterpf2

A multivalued list of fields with optional weights, based on pairs of word shingles.

The Parameterps2

This is similar to but overrides the slop factor used for . If not specified, is used.ps pf2 ps

The Parameterpf3

A multivalued list of fields with optional weights, based on triplets of word shingles. Similar to , except thatpf

instead of building a phrase per field out of all the words in the input, it builds a set of phrases for each field out

of each triplet of word shingles.

The Parameterps3

This is similar to but overrides the slop factor used for . If not specified, is used.ps pf3 ps

The Parameterstopwords

A Boolean parameter indicating if the configured in the query analyzer should beStopFilterFactory

respected when parsing the query: if it is false, then the in the query analyzer is ignored.StopFilterFactory

The Parameteruf

Specifies which schema fields the end user is allowed to explicitly query. This parameter supports wildcards. The

default is to allow all fields, equivalent to . To allow only title field, use . To allow title and alluf=* uf=title

fields ending with _s, use . To allow all fields except title, use . To disallow alluf=title,*_s uf=*-title

fielded searches, use .uf=-*

Field aliasing using per-field overridesqf

262Apache Solr Reference Guide 5.1

Per-field overrides of the parameter may be specified to provide 1-to-many aliasing from field names specifiedqf

in the query string, to field names used in the underlying query. By default, no aliasing is used and field names

specified in the query string are treated as literal field names in the index.

Examples of Queries Submitted to the Extended DisMax Query Parser

All of the sample URLs in this section assume you are running Solr's " " example:techproducts

bin/solr -e techproducts

Boost the result of the query term "hello" based on the document's popularity:

http://localhost:8983/solr/techproducts/select?defType=edismax&q=hello&pf=text&qf=te

xt&boost=popularity

Search for iPods OR video:

http://localhost:8983/solr/techproducts/select?defType=edismax&q=ipod+OR+video

Search across multiple fields, specifying (via boosts) how important each field is relative each other:

http://localhost:8983/solr/techproducts/select?q=video&defType=edismax&qf=features^2

0.0+text^0.3

You can boost results that have a field that matches a specific value:

http://localhost:8983/solr/techproducts/select?q=video&defType=edismax&qf=features^2

0.0+text^0.3&bq=cat:electronics^5.0

Using the "mm" param, 1 and 2 word queries require that all of the optional clauses match, but for queries with

three or more clauses one missing clause is allowed:

http://localhost:8983/solr/techproducts/select?q=belkin+ipod&defType=edismax&mm=2

http://localhost:8983/solr/techproducts/select?q=belkin+ipod+gibberish&defType=edism

ax&mm=2

http://localhost:8983/solr/techproducts/select?q=belkin+ipod+apple&defType=edismax&m

m=2

In the example below, we see a per-field override of the parameter being used to alias "name" in the queryqf

string to either the " " and " " fields:last_name first_name

defType=edismax

q=sysadmin name:Mike

qf=title text last_name first_name

f.name.qf=last_name first_name

Using negative boost

Negative query boosts have been supported at the "Query" object level for a long time (resulting in negative

263Apache Solr Reference Guide 5.1

scores for matching documents). Now the QueryParsers have been updated to handle this too.

Using 'slop'

Dismax and can run queries against all query fields, and also run a query in the form of a phraseEdismax

against the phrase fields. (This will work only for boosting documents, not actually for matching.) However, that

phrase query can have a 'slop,' which is the distance between the terms of the query while still considering it a

phrase match. For example:

q=foo bar

qf=field1^5 field2^10

pf=field1^50 field2^20

defType=dismax

With these parameters, the Dismax Query Parser generates a query that looks something like this:

(+(field1:foo^5 OR field2:bar^10) AND (field1:bar^5 OR field2:bar^10))

But it also generates another query that will only be used for boosting results:

field1:"foo bar"^50 OR field2:"foo bar"^20

Thus, any document that has the terms "foo" and "bar" will match; however if some of those documents have

both of the terms as a phrase, it will score much higher because it's more relevant.

If you add the parameter (phrase slop), the second query will instead be:ps

ps=10 field1:"foo bar"~10^50 OR field2:"foo bar"~10^20

This means that if the terms "foo" and "bar" appear in the document with less than 10 terms between each other,

the phrase will match. For example the doc that says:

*Foo* term1 term2 term3 *bar*

will match the phrase query.

How does one use phrase slop? Usually it is configured in the request handler (in ).solrconfig

With query slop ( ) the concept is similar, but it applies to explicit phrase queries from the user. For example, ifqs

you want to search for a name, you could enter:

q="Hans Anderson"

A document that contains "Hans Anderson" will match, but a document that contains the middle name "Christian"

or where the name is written with the last name first ("Anderson, Hans") won't. For those cases one could

configure the query field , so that even if the user searches for an explicit phrase query, a slop is applied.qs

Finally, contains not only a phrase fields ( ) parameters, but also phrase and query fields 2 and 3.edismax pf

You can use those fields for setting different fields or boosts. Each of those can use a different phrase slop.

264Apache Solr Reference Guide 5.1

Using the 'magic fields' _val_ and _query_

If the 'magic field' name is used in a term or phrase query, the value is parsed as a function._val_

The Solr Query Parser's use of and differs from the Lucene Query Parser in the following ways:_val_ _query_

If the magic field name is used in a term or phrase query, the value is parsed as a function._val_

It provides a hook into syntax. Quotes are necessary to encapsulate the function whenFunctionQuery

it includes parentheses. For example:

_val_:myfield

_val_:"recip(rord(myfield),1,2,3)"

The Solr Query Parser offers nested query support for any type of query parser (via QParserPlugin).

Quotes are often necessary to encapsulate the nested query if it contains reserved characters. For

example:

_query_:"{!dismax qf=myfield}how now brown cow"

Although not technically a syntax difference, note that if you use the Solr type (or theTrieDateField

deprecated type), any queries on those fields (typically range queries) should use either theDateField

Complete ISO 8601 Date syntax that field supports, or the to get relative dates. For example:DateMath Syntax

timestamp:[* TO NOW]

createdate:[1976-03-06T23:59:59.999Z TO *]

createdate:[1995-12-31T23:59:59.999Z TO 2007-03-06T00:00:00Z]

pubdate:[NOW-1YEAR/DAY TO NOW/DAY+1DAY]

createdate:[1976-03-06T23:59:59.999Z TO 1976-03-06T23:59:59.999Z+1YEAR]

createdate:[1976-03-06T23:59:59.999Z/YEAR TO 1976-03-06T23:59:59.999Z]

Function Queries

Function queries enable you to generate a relevancy score using the actual value of one or more numeric fields.

Function queries are supported by the , , and query parsers.DisMax Extended DisMax standard

Function queries use . The functions can be a constant (numeric or string literal), a field, anotherfunctions

function or a parameter substitution argument. You can use these functions to modify the ranking of results for

users. These could be used to change the ranking of results based on a user's location, or some other

calculation.

Function query topics covered in this section:

Using Function Query

Available Functions

Example Function Queries

Sort By Function

TO must be uppercase, or Solr will report a 'Range Group' error.

265Apache Solr Reference Guide 5.1

Related Topics

Using Function Query

Functions must be expressed as function calls (for example, instead of simply ).sum(a,b) a+b

There are several ways of using function queries in a Solr query:

Via an explicit QParser that expects function arguments, such or . For example: func frange

q={!func}div(popularity,price)&fq={!frange l=1000}customer_ratings

In a Sort expression. For example:

sort=div(popularity,price) desc, score desc

Add the results of functions as psuedo-fields to documents in query results. For instance, for:

&fl=sum(x, y),id,a,b,c,score

the output would be:

...

...

Use in a parameter that is explicitly for specifying functions, such as the EDisMax query parser's p boost

aram, or DisMax query parser's . (Note that the parameter actually takes (boost function) parameterbf bf

a list of function queries separated by white space and each with an optional boost. Make sure you

eliminate any internal white space in single function queries when using ). For example:bf

q=dismax&bf="ord(popularity)^0.5 recip(rord(price),1,1000,1000)^0.3"

Introduce a function query inline in the lucene QParser with the keyword. For example:_val_

q=_val_:mynumericfield _val_:"recip(rord(myfield),1,2,3)"

Only functions with fast random access are recommended.

Available Functions

The table below summarizes the functions available for function queries.

Function Description Syntax Examples

abs Returns the absolute value of the specified

value or function.

abs(x)

abs(-5)

266Apache Solr Reference Guide 5.1

and Returns a value of true if and only if all of its

operands evaluate to true.

and(not(exists(popularity)),exi

sts(price)): returns for anytrue

document which has a value in the fprice

ield, but does not have a value in the popu

fieldlarity

"constant" Specifies a floating point constant. 1.5

def def is short for default. Returns the value of

field "field", or if the field does not exist,

returns the default value specified. and

yields the first value where exists()==tru

.)e

def(rating,5): This functiondef()

returns the rating, or if no rating specified

in the doc, returns 5

equivalent to def(myfield, 1.0): if(

exists(myfield),myfield,1.0)

div Divides one value or function by another.

div(x,y) divides x by y.

div(1,y)

div(sum(x,100),max(y,1))

dist Return the distance between two vectors

(points) in an n-dimensional space. Takes in

the power, plus two or more ValueSource

instances and calculates the distances

between the two vectors. Each ValueSource

must be a number. There must be an even

number of ValueSource instances passed in

and the method assumes that the first half

represent the first vector and the second half

represent the second vector.

dist(2, x, y, 0, 0): calculates the

Euclidean distance between (0,0) and (x,y)

for each document

: calculates thedist(1, x, y, 0, 0)

Manhattan (taxicab) distance between

(0,0) and (x,y) for each document

Euclideandist(2, x,y,z,0,0,0):

distance between (0,0,0) and (x,y,z) for

each document.

: Euclideandist(1,x,y,z,e,f,g)

distance between (x,y,z) and (e,f,g) where

each letter is a field name

docfreq(field,val) Returns the number of documents that

contain the term in the field. This is a

constant (the same value for all documents

in the index).

You can quote the term if it's more complex,

or do parameter substitution for the term

value.

docfreq(text,'solr')

...&defType=func

&q=docfreq(text,$myterm)

&myterm=solr

exists Returns TRUE if any member of the field

exists.

exists(author) returns TRUE for any

document has a value in the "author" field.

returnsexists(query(price:5.00))

TRUE if "price" matches "5.00".

267Apache Solr Reference Guide 5.1

field Returns the numeric field value of an

indexed (not multi-valued) field with a

maximum of one value per document. The f

function can be called using theield()

name of the field as a string, or for most

conventional field names simply use the field

name by itself.

0 is returned for documents without a value

in the field.

myFloatFieldName

field("my complex float

fieldName")

hsin The Haversine distance calculates the

distance between two points on a sphere

when traveling along the sphere. The values

must be in radians. also take ahsin

Boolean argument to specify whether the

function should convert its output to radians.

hsin(2, true, x, y, 0, 0)

idf Inverse document frequency; a measure of

whether the term is common or rare across

all documents. Obtained by dividing the total

number of documents by the number of

documents containing the term, and then

taking the logarithm of that quotient. See

also .tf

idf(fieldName,'solr'): measures

the inverse of the frequency of the

occurrence of the term in'solr' fieldN

.ame

if Enables conditional function queries. In if(

:test,value1,value2)

test is or refers to a logical value or

expression that returns a logical value

(TRUE or FALSE).

value1 is the value that is returned by

the function if yields TRUE.test

value2 is the value that is returned by

the function if yields FALSE.test

An expression can be any function which

outputs boolean values, or even functions

returning numeric values, in which case

value 0 will be interpreted as false, or

strings, in which case empty string is

interpreted as false.

if(termfreq(cat,'electronics'),

popularity,42) :

This function checks each document for

the to see if it contains the term "electro

" in the field. If it does, then thenics cat

value of the field is returned,popularity

otherwise the value of is returned.42

linear Implements where and arem*x+c m c

constants and is an arbitrary function. Thisx

is equivalent to ,sum(product(m,x),c)

but slightly more efficient as it is

implemented as a single function.

linear(x,m,c)

returns linear(x,2,4) 2*x+4

268Apache Solr Reference Guide 5.1

log Returns the log base 10 of the specified

function.

log(x)

log(sum(x,100))

map Maps any values of an input function x that

fall within min and max inclusive to the

specified target. The arguments min and

max must be constants. The arguments tar

and can be constants orget default

functions. If the value of x does not fall

between min and max, then either the value

of x is returned, or a default value is returned

if specified as a 5th argument.

map(x,min,max,target)

- changes any values of 0map(x,0,0,1)

to 1. This can be useful in handling default

0 values.

map(x,min,max,target,default)

- changes anymap(x,0,100,1,-1)

values between and to , and all0 100 1

other values to . -1

map(x,0,100,sum(x,599),docfreq(

text,solr)) - changes any values

between 0and 100 to x+599, and all other

values to frequency of the term 'solr' in the

field text.

max Returns the max of another function and a

constant, which are specified as arguments:

. The max function is useful formax(x,c)

"bottoming out" another function at some

constant.

max(myfield,0)

maxdoc Returns the number of documents in the

index, including those that are marked as

deleted but have not yet been purged. This

is a constant (the same value for all

documents in the index).

maxdoc()

ms Returns milliseconds of difference between

its arguments. Dates are relative to the Unix

or POSIX time epoch, midnight, January 1,

1970 UTC. Arguments may be the name of

an indexed , or date mathTrieDateField

based on a .constant date or NOW

ms(): Equivalent to , number ofms(NOW)

milliseconds since the epoch.

ms(a): Returns the number of

milliseconds since the epoch that the

argument represents.

ms(a,b) : Returns the number of

milliseconds that b occurs before a (that

is, a - b)

ms(NOW/DAY)

ms(2000-01-01T00:00:00Z)

ms(mydatefield)

ms(NOW,mydatefield)

ms(mydatefield,2000-01-01T00:00

:00Z)

ms(datefield1,datefield2)

269Apache Solr Reference Guide 5.1

norm( )field Returns the "norm" stored in the index for

the specified field. This is the product of the

index time boost and the length

normalization factor, according to the Similar

for the field.ity

norm(fieldName)

not The logically negated value of the wrapped

function.

not(exists(author)): TRUE only

when is false.exists(author)

numdocs Returns the number of documents in the

index, not including those that are marked as

deleted but have not yet been purged. This

is a constant (the same value for all

documents in the index).

numdocs()

or A logical disjunction. or(value1,value2): TRUE if either va

or is true.lue1 value2

ord Returns the ordinal of the indexed field value

within the indexed list of terms for that field

in Lucene index order (lexicographically

ordered by unicode value), starting at 1. In

other words, for a given field, all values are

ordered lexicographically; this function then

returns the offset of a particular value in that

ordering. The field must have a maximum of

one value per document (not multi-valued). 0

is returned for documents without a value in

the field.

Related Topics

FunctionQuery

Local Parameters in Queries

Local parameters are arguments in a Solr request that are specific to a query parameter. Local parameters

provide a way to add meta-data to certain argument types such as query strings. (In Solr documentation, local

parameters are sometimes referred to as LocalParams.)

Local parameters are specified as prefixes to arguments. Take the following query argument, for example:

q=solr rocks

We can prefix this query string with local parameters to provide more information to the Standard Query Parser.

For example, we can change the default operator type to "AND" and the default field to "title":

q={!q.op=AND df=title}solr rocks

These local parameters would change the query to require a match on both "solr" and "rocks" while searching

the "title" field by default.

Basic Syntax of Local Parameters

To specify a local parameter, insert the following before the argument to be modified:

Begin with {!

Insert any number of key=value pairs separated by white space

End with } and immediately follow with the query argument

You may specify only one local parameters prefix per argument. Values in the key-value pairs may be quoted via

single or double quotes, and backslash escaping works within quoted strings.

Query Type Short Form

If a local parameter value appears without a name, it is given the implicit name of "type". This allows short-form

275Apache Solr Reference Guide 5.1

representation for the type of query parser to use when parsing a query string. Thus

q={!dismax qf=myfield}solr rocks

is equivalent to:

q={!type=dismax qf=myfield}solr rocks

If no "type" is specified (either explicitly or implicitly) then the is used by default. Thuslucene parser

fq={!df=summary}solr rocks

is equivilent to:

fq={!type=lucene df=summary}solr rocks

Specifying the Parameter Value with the ' ' Keyv

A special key of within local parameters is an alternate way to specify the value of that parameter.v

q={!dismax qf=myfield}solr rocks

is equivalent to

q={!type=dismax qf=myfield v='solr rocks'}

Parameter Dereferencing

Parameter dereferencing or indirection lets you use the value of another argument rather than specifying it

directly. This can be used to simplify queries, decouple user input from query parameters, or decouple front-end

GUI parameters from defaults set in .solrconfig.xml

q={!dismax qf=myfield}solr rocks

is equivalent to:

q={!type=dismax qf=myfield v=$qq}&qq=solr rocks

Other Parsers

In addition to the main query parsers discussed earlier, there are several other query parsers that can be used

instead of or in conjunction with the main parsers for specific purposes. This section details the other parsers,

and gives examples for how they might be used.

Many of these parsers are expressed the same way as .Local Parameters in Queries

Query parsers discussed in this section:

Block Join Query Parsers

Boost Query Parser

Collapsing Query Parser

Complex Phrase Query Parser

Field Query Parser

Function Query Parser

Function Range Query Parser

Join Query Parser

Lucene Query Parser

Max Score Query Parser

276Apache Solr Reference Guide 5.1

Related Topics

General Parameters

The table below summarizes the general parameters for controlling faceting.

Parameter Description

288Apache Solr Reference Guide 5.1

facet If set to true, enables faceting.

facet.query Specifies a Lucene query to generate a facet count.

These parameters are described in the sections below.

The Parameterfacet

If set to "true," this parameter enables facet counts in the query response. If set to "false" to a blank or missing

value, this parameter disables faceting. None of the other parameters listed below will have any effect unless this

parameter is set to "true." The default value is blank.

The Parameterfacet.query

This parameter allows you to specify an arbitrary query in the Lucene default syntax to generate a facet count.

By default, Solr's faceting feature automatically determines the unique terms for a field and returns a count for

each of those terms. Using , you can override this default behavior and select exactly which termsfacet.query

or expressions you would like to see counted. In a typical implementation of faceting, you will specify a number

of parameters. This parameter can be particularly useful for numeric-range-based facets orfacet.query

prefix-based facets.

You can set the parameter multiple times to indicate that multiple queries should be used asfacet.query

separate facet constraints.

To use facet queries in a syntax other than the default syntax, prefix the facet query with the name of the query

notation. For example, to use the hypothetical query parser, you could set the parametermyfunc facet.query

like so:

facet.query={!myfunc}name~fred

Field-Value Faceting Parameters

Several parameters can be used to trigger faceting based on the indexed terms in a field.

When using this parameter, it is important to remember that "term" is a very specific concept in Lucene: it relates

to the literal field/value pairs that are indexed after any analysis occurs. For text fields that include stemming,

lowercasing, or word splitting, the resulting terms may not be what you expect. If you want Solr to perform both

analysis (for searching) and faceting on the full literal strings, use the directive in the ficopyField schema.xml

le to create two versions of the field: one Text and one String. Make sure both are . (For moreindexed="true"

information about the directive, see .)copyField Documents, Fields, and Schema Design

The table below summarizes Solr's field value faceting parameters.

Parameter Description

facet.field Identifies a field to be treated as a facet.

facet.prefix Limits the terms used for faceting to those that begin with the specified prefix.

facet.contains Limits the terms used for faceting to those that contain the specified substring.

facet.contains.ignoreCase If facet.contains is used, ignore case when searching for the specified substring.

289Apache Solr Reference Guide 5.1

facet.sort Controls how faceted results are sorted.

facet.limit Controls how many constraints should be returned for each facet.

facet.offset Specifies an offset into the facet results at which to begin displaying facets.

facet.mincount Specifies the minimum counts required for a facet field to be included in the

response.

facet.missing Controls whether Solr should compute a count of all matching results which have

no value for the field, in addition to the term-based constraints of a facet field.

facet.method Selects the algorithm or method Solr should use when faceting a field.

facet.enum.cache.minDF (Advanced) Specifies the minimum document frequency (the number of

documents matching a term) for which the should be used whenfilterCache

determining the constraint count for that term.

facet.overrequest.count (Advanced) A number of documents, beyond the effective tofacet.limit

request from each shard in a distributed search

facet.overrequest.ratio (Advanced) A multiplier of the effective to request from eachfacet.limit

shard in a distributed search

facet.threads (Advanced) Controls parallel execution of field faceting

These parameters are described in the sections below.

The Parameterfacet.field

The parameter identifies a field that should be treated as a facet. It iterates over each Term in thefacet.field

field and generate a facet count using that Term as the constraint. This parameter can be specified multiple

times in a query to select multiple facet fields.

The Parameterfacet.prefix

The parameter limits the terms on which to facet to those starting with the given string prefix.facet.prefix

This does not limit the query in any way, only the facets that would be returned in response to the query.

This parameter can be specified on a per-field basis with the syntax of .f.<fieldname>.facet.prefix

The Parameterfacet.contains

The parameter limits the terms on which to facet to those containing the given substring. Thisfacet.contains

does not limit the query in any way, only the facets that would be returned in response to the query.

This parameter can be specified on a per-field basis with the syntax of .f.<fieldname>.facet.contains

The Parameterfacet.contains.ignoreCase

If is used, the parameter causes case to be ignored whenfacet.contains facet.contains.ignoreCase

If you do not set this parameter to at least one field in the schema, none of the other parameters

described in this section will have any effect.

290Apache Solr Reference Guide 5.1

matching the given substring against candidate facet terms.

This parameter can be specified on a per-field basis with the syntax of f.<fieldname>.facet.contains.ig

.noreCase

The Parameterfacet.sort

This parameter determines the ordering of the facet field constraints.

facet.sort

Setting

Results

count Sort the constraints by count (highest count first).

index Return the constraints sorted in their index order (lexicographic by indexed term). For terms in

the ASCII range, this will be alphabetically sorted.

The default is if is greater than 0, otherwise, the default is .count facet.limit index

This parameter can be specified on a per-field basis with the syntax of .f.<fieldname>.facet.sort

The Parameterfacet.limit

This parameter specifies the maximum number of constraint counts (essentially, the number of facets for a field

that are returned) that should be returned for the facet fields. A negative value means that Solr will return

unlimited number of constraint counts.

The default value is 100.

This parameter can be specified on a per-field basis to apply a distinct limit to each field with the syntax of f.<fi

.eldname>.facet.limit

The Parameterfacet.offset

The parameter indicates an offset into the list of constraints to allow paging.facet.offset

The default value is 0.

This parameter can be specified on a per-field basis with the syntax of .f.<fieldname>.facet.offset

The Parameterfacet.mincount

The parameter specifies the minimum counts required for a facet field to be included in thefacet.mincount

response. If a field's counts are below the minimum, the field's facet is not returned.

The default value is 0.

This parameter can be specified on a per-field basis with the syntax of .f.<fieldname>.facet.mincount

The Parameterfacet.missing

If set to true, this parameter indicates that, in addition to the Term-based constraints of a facet field, a count of all

results that match the query but which have no facet value for the field should be computed and returned in the

response.

The default value is false.

291Apache Solr Reference Guide 5.1

This parameter can be specified on a per-field basis with the syntax of .f.<fieldname>.facet.missing

The Parameterfacet.method

The facet.method parameter selects the type of algorithm or method Solr should use when faceting a field.

Setting Results

enum Enumerates all terms in a field, calculating the set intersection of documents that match the term

with documents that match the query. This method is recommended for faceting multi-valued fields

that have only a few distinct values. The average number of values per document does not matter.

For example, faceting on a field with U.S. States such as wouAlabama, Alaska, ... Wyoming

ld lead to fifty cached filters which would be used over and over again. The shouldfilterCache

be large enough to hold all the cached filters.

fc Calculates facet counts by iterating over documents that match the query and summing the terms

that appear in each document. This is currently implemented using an cache ifUnInvertedField

the field either is multi-valued or is tokenized (according to ). EachFieldType.isTokened()

document is looked up in the cache to see what terms/values it contains, and a tally is incremented

for each value. This method is excellent for situations where the number of indexed values for the

field is high, but the number of values per document is low. For multi-valued fields, a hybrid

approach is used that uses term filters from the for terms that match manyfilterCache

documents. The letters stand for field cache.fc

fcs Per-segment field faceting for single-valued string fields. Enable with andfacet.method=fcs

control the number of threads used with the local parameter. This parameter allowsthreads

faceting to be faster in the presence of rapid index changes.

The default value is (except for fields using the field type) since it tends to use less memory andfc BoolField

is faster when a field has many unique terms in the index.

This parameter can be specified on a per-field basis with the syntax of .f.<fieldname>.facet.method

The Parameterfacet.enum.cache.minDf

This parameter indicates the minimum document frequency (the number of documents matching a term) for

which the filterCache should be used when determining the constraint count for that term. This is only used with

the method of faceting.facet.method=enum

A value greater than zero decreases the filterCache's memory usage, but increases the time required for the

query to be processed. If you are faceting on a field with a very large number of terms, and you wish to decrease

memory usage, try setting this parameter to a value between 25 and 50, and run a few tests. Then, optimize the

parameter setting as necessary.

The default value is 0, causing the filterCache to be used for all terms in the field.

This parameter can be specified on a per-field basis with the syntax of f.<fieldname>.facet.enum.cache.

.minDF

Over-Request Parameters

In some situations, the accuracy in selecting the "top" constraints returned for a facet in a distributed Solr query

292Apache Solr Reference Guide 5.1

can be improved by "Over Requesting" the number of desired constraints (ie: ) from each of thefacet.limit

individual Shards. In these situations, each shard is by default asked for the top "10 + (1.5 *

" constraints.facet.limit)

In some situations, depending on how your docs are partitioned across your shards, and what valfacet.limit

ue you used, you may find it advantageous to increase or decrease the amount of over-requesting Solr does.

This can be achieved by setting the (defaults to 10) and facet.overrequest.count facet.overrequest.

(defaults to 1.5) parameters.ratio

The Parameterfacet.threads

This param will cause loading the underlying fields used in faceting to be executed in parallel with the number of

threads specified. Specify as where is the maximum number of threads used. Omittingfacet.threads=N N

this parameter or specifying the thread count as 0 will not spawn any threads, and only the main request thread

will be used. Specifying a negative number of threads will create up to Integer.MAX_VALUE threads.

Range Faceting

You can use Range Faceting on any date field or any numeric field that supports range queries. This is

particularly useful for stitching together a series of range queries (as facet by query) for things like prices. As of

Solr 3.1, Range Faceting is preferred over (described below).Date Faceting

Parameter Description

facet.range Specifies the field to facet by range.

facet.range.start Specifies the start of the facet range.

facet.range.end Specifies the end of the facet range.

facet.range.gap Specifies the span of the range as a value to be added to the lower bound.

facet.range.hardend A boolean parameter that specifies how Solr handles a range gap that cannot be evenly

divided between the range start and end values. If true, the last range constraint will

have the value an upper bound. If false, the last range will havefacet.range.end

the smallest possible upper bound greater then such that thefacet.range.end

range is the exact width of the specified range gap. The default value for this parameter

is false.

facet.range.include Specifies inclusion and exclusion preferences for the upper and lower bounds of the

range. See the topic for more detailed information.facet.range.include

facet.range.other Specifies counts for Solr to compute in addition to the counts for each facet range

constraint.

The Parameterfacet.range

The parameter defines the field for which Solr should create range facets. For example:facet.range

facet.range=price&facet.range=age

The Parameterfacet.range.start

293Apache Solr Reference Guide 5.1

The parameter specifies the lower bound of the ranges. You can specify this parameterfacet.range.start

on a per field basis with the syntax of . For example:f.<fieldname>.facet.range.start

f.price.facet.range.start=0.0&f.age.facet.range.start=10

The Parameterfacet.range.end

The facet.range.end specifies the upper bound of the ranges. You can specify this parameter on a per field basis

with the syntax of . For example:f.<fieldname>.facet.range.end

f.price.facet.range.end=1000.0&f.age.facet.range.start=99

The Parameterfacet.range.gap

The span of each range expressed as a value to be added to the lower bound. For date fields, this should be

expressed using the (such as, ). You syntaxDateMathParser facet.range.gap=%2B1DAY ... '+1DAY'

can specify this parameter on a per-field basis with the syntax of . Forf.<fieldname>.facet.range.gap

example:

f.price.facet.range.gap=100&f.age.facet.range.gap=10

The Parameterfacet.range.hardend

The parameter is a Boolean parameter that specifies how Solr should handle casesfacet.range.hardend

where the does not divide evenly between and .facet.range.gap facet.range.start facet.range.end

If , the last range constraint will have the value as an upper bound. If , the lasttrue facet.range.end false

range will have the smallest possible upper bound greater then such that the range is thefacet.range.end

exact width of the specified range gap. The default value for this parameter is false.

This parameter can be specified on a per field basis with the syntax f.<fieldname>.facet.range.hardend

The Parameterfacet.range.include

By default, the ranges used to compute range faceting between and facet.range.start facet.range.end

are inclusive of their lower bounds and exclusive of the upper bounds. The "before" range defined with the face

parameter is exclusive and the "after" range is inclusive. This default, equivalent to "lower"t.range.other

below, will not result in double counting at the boundaries. You can use the parameterfacet.range.include

to modify this behavior using the following options:

Option Description

lower All gap-based ranges include their lower bound.

upper All gap-based ranges include their upper bound.

edge The first and last gap ranges include their edge bounds (lower for the first one, upper for the last

one) even if the corresponding upper/lower option is not specified.

outer The "before" and "after" ranges will be inclusive of their bounds, even if the first or last ranges

already include those boundaries.

all Includes all options: lower, upper, edge, outer.

294Apache Solr Reference Guide 5.1

You can specify this parameter on a per field basis with the syntax of f.<fieldname>.facet.range.includ

, and you can specify it multiple times to indicate multiple choices.e

The Parameterfacet.range.other

The parameter specifies that in addition to the counts for each range constraint between facet.range.other

and , counts should also be computed for these options:facet.range.start facet.range.end

Option Description

before All records with field values lower then lower bound of the first range.

after All records with field values greater then the upper bound of the last range.

between All records with field values between the start and end bounds of all ranges.

none Do not compute any counts.

all Compute counts for before, between, and after.

This parameter can be specified on a per field basis with the syntax of f.<fieldname>.facet.range.other

. In addition to the option, this parameter can be specified multiple times to indicate multiple choices, but all no

will override all other options.ne

The facet.mincount Parameter in Range Faceting

The parameter, the same one as used in field faceting is also applied to range faceting.facet.mincount

When used, no ranges with a count below the minimum will be included in the response.

Date Faceting Parameters

Date faceting using the type specific parameters has been deprecated since Solr 3.1. Existingfacet.date

users are encouraged to switch to using the more general , which provides the same features forRange Faceting

date fields, but can also work with any numeric field.

The response format is slightly different, but the request parameters are virtually identical.

Pivot (Decision Tree) Faceting

Pivoting is a summarization tool that lets you automatically sort, count, total or average data stored in a table.

The results are typically displayed in a second table showing the summarized data. Pivot faceting lets you create

a summary table of the results from a faceting documents by multiple fields.

To ensure you avoid double-counting, do not choose both and , do not choose , andlower upper outer

do not choose .all

Date Ranges & Time Zones

Range faceting on date fields is a common situation where the parameter can be useful to ensure TZ

that the "facet counts per day" or "facet counts per month" are based on a meaningful definition of when

a given day/month "starts" relative to a particular TimeZone.

For more information, see the examples in the section.Working with Dates

295Apache Solr Reference Guide 5.1

Another way to look at it is that the query produces a Decision Tree, in that Solr tells you "for facet A, the

constraints/counts are X/N, Y/M, etc. If you were to constrain A by X, then the constraint counts for B would be

S/P, T/Q, etc.". In other words, it tells you in advance what the "next" set of facet results would be for a field if

you apply a constraint from the current facet results.

facet.pivot

The parameter defines the fields to use for the pivot. Multiple values will createfacet.pivot facet.pivot

multiple "facet_pivot" sections in the response. Separate each list of fields with a comma.

facet.pivot.mincount

The parameter defines the minimum number of documents that need to match infacet.pivot.mincount

order for the facet to be included in results. The default is 1.

Using the " " example, A query URL like this one will returns the data below, withbin/solr -e techproducts

the pivot faceting results found in the section "facet_pivot":

http://localhost:8983/solr/techproducts/select?q=*:*&facet.pivot=cat,popularity,inSt

ock

&facet.pivot=popularity,cat&facet=true&facet.field=cat&facet.limit=5

&rows=0&wt=json&indent=true&facet.pivot.mincount=2

"facet_counts":{

"facet_queries":{},

"facet_fields":{

"cat":[

"electronics",14,

"currency",4,

"memory",3,

"connector",2,

"graphics card",2]},

"facet_dates":{},

"facet_ranges":{},

"facet_pivot":{

"cat,popularity,inStock":[{

"field":"cat",

"value":"electronics",

"count":14,

"pivot":[{

"field":"popularity",

"value":6,

"count":5,

"pivot":[{

"field":"inStock",

"value":true,

"count":5}]},

...

Combining Stats Component With Pivots

In addition to some of the supported by other types of faceting, a local general local parameters stats

parameters can be used with to refer to instances (by tag) that you would like tofacet.pivot stats.field

296Apache Solr Reference Guide 5.1

have computed for each Pivot Constraint.

In the example below, two different (overlapping) sets of statistics are computed for each of the facet.pivot result

hierarchies:

stats=true

stats.field={!tag=piv1,piv2 min=true max=true}price

stats.field={!tag=piv2 mean=true}popularity

facet=true

facet.pivot={!stats=piv1}cat,inStock

facet.pivot={!stats=piv2}manu,inStock

Results:

"facet_pivot":{

"cat,inStock":[{

"field":"cat",

"value":"electronics",

"count":12,

"pivot":[{

"field":"inStock",

"value":true,

"count":8,

"stats":{

"stats_fields":{

"price":{

"min":74.98999786376953,

"max":399.0}}}},

{

"field":"inStock",

"value":false,

"count":4,

"stats":{

"stats_fields":{

"price":{

"min":11.5,

"max":649.989990234375}}}}],

"stats":{

"stats_fields":{

"price":{

"min":11.5,

"max":649.989990234375}}}},

{

"field":"cat",

"value":"currency",

"count":4,

"pivot":[{

"field":"inStock",

"value":true,

"count":4,

"stats":{

"stats_fields":{

"price":{

...

"manu,inStock":[{

297Apache Solr Reference Guide 5.1

"field":"manu",

"value":"inc",

"count":8,

"pivot":[{

"field":"inStock",

"value":true,

"count":7,

"stats":{

"stats_fields":{

"price":{

"min":74.98999786376953,

"max":2199.0},

"popularity":{

"mean":5.857142857142857}}}},

{

"field":"inStock",

"value":false,

"count":1,

"stats":{

"stats_fields":{

"price":{

"min":479.95001220703125,

"max":479.95001220703125},

"popularity":{

298Apache Solr Reference Guide 5.1

"mean":7.0}}}}],

...

Additional Pivot Parameters

Although deviates in name from the parameter used by fieldfacet.pivot.mincount facet.mincount

faceting, many other Field faceting parameters described above can also be used with pivot faceting:

facet.limit

facet.offset

facet.sort

facet.overrequest.count

facet.overrequest.ratio

Interval Faceting

Another supported form of faceting is interval faceting. This sounds similar to range faceting, but the functionality

is really closer to doing facet queries with range queries. Interval faceting allows you to set variable intervals and

count the number of documents that have values within those intervals in the specified field.

In order to use interval faceting on a field, it is required that the field has docValues enabled. Even though

the same functionality can be achieved by using a facet query with range queries, the implementation of these

two methods is very different and will provide different performance depending on the context. If you are

concerned about the performance of your searches you should test with both options. Interval faceting tends to

be better with multiple intervals for the same fields, while facet query tend to be better in environments where

cache is more effective (static indexes for example).

Name What it does

facet.interval Specifies the field to facet by interval.

facet.interval.set Sets the intervals for the field.

The parameterfacet.interval

This parameter Indicates the field where interval faceting must be applied. It can be used multiple times in the

same request to indicate multiple fields. All the interval fields must have in the schema.docValues=“true”

facet.interval=price&facet.interval=size

The parameterfacet.interval.set

This parameter is used to set the intervals for the field, it can be specified multiple times to indicate multiple

intervals. This parameter is global, which means that it will be used for all fields indicated with facet.interval

unless there is an override for a specific field. To override this parameter on a specific field you can use: f.<fi

, for example:eldname>.facet.interval.set

f.price.facet.interval.set=[0,10]&f.price.facet.interval.set=(10,100]

Interval Syntax

Intervals must begin with either '(' or '[', be followed by the start value, then a comma (','), the end value, and

finally a closing ')' or ']’.

299Apache Solr Reference Guide 5.1

For example:

(1,10) -> will include values greater than 1 and lower than 10

[1,10) -> will include values greater or equal to 1 and lower than 10

[1,10] -> will include values greater or equal to 1 and lower or equal to 10

The initial and end values cannot be empty. If the interval needs to be unbounded, the special character '*' can

be used for both, start and end limit. When using '*', '(' and '[', and ')' and ']' will be treated equal. [*,*] will include

all documents with a value in the field. The interval limits may be strings but there is no need to add quotes. All

the text until the comma will be treated as the start limit, and the text after that will be the end limit. For example:

[Buenos Aires,New York]. Keep in mind that a string-like comparison will be done to match documents in string

intervals (case-sensitive). The comparator can't be changed.

Commas, brackets and square brackets can be escaped by using '\' in front of them. Whitespaces before and

after the values will be omitted. The start limit can't be grater than the end limit. Equal limits are allowed, this

allows you to indicate the specific values that you want to count, like [A,A], [B,B] and [C,Z].

Interval faceting supports output key replacement described below. Output keys can be replaced in both the fac

and in the . For example:et.interval parameter facet.interval.set parameter

&facet.interval={!key=popularity}some_field

&facet.interval.set={!key=bad}[0,5]

&facet.interval.set={!key=good}[5,*]

&facet=true

Local Parameters for Faceting

The allows overriding global settings. It can also provide a method of adding metadata toLocalParams syntax

other parameter values, much like XML attributes.

Tagging and Excluding Filters

You can tag specific filters and exclude those filters when faceting. This is useful when doing multi-select

faceting.

Consider the following example query with faceting:

q=mainquery&fq=status:public&fq=doctype:pdf&facet=true&facet.field=doctype

Because everything is already constrained by the filter , the facetdoctype:pdf facet.field=doctype

command is currently redundant and will return 0 counts for everything except .doctype:pdf

To implement a multi-select facet for doctype, a GUI may want to still display the other doctype values and their

associated counts, as if the constraint had not yet been applied. For example: doctype:pdf

=== Document Type ===

[ ] Word (42)

[x] PDF (96)

[ ] Excel(11)

[ ] HTML (63)

To return counts for doctype values that are currently not selected, tag filters that directly constrain doctype, and

exclude those filters when faceting on doctype.

300Apache Solr Reference Guide 5.1

q=mainquery&fq=status:public&fq={!tag=dt}doctype:pdf&facet=true&facet.field

={!ex=dt}doctype

Filter exclusion is supported for all types of facets. Both the and local parameters may specify multipletag ex

values by separating them with commas.

Changing the Output Key

To change the output key for a faceting command, specify a new name with the local parameter. Forkey

example:

facet.field={!ex=dt key=mylabel}doctype

The parameter setting above causes the field facet results for the "doctype" field to be returned using the key

"mylabel" rather than "doctype" in the response. This can be helpful when faceting on the same field multiple

times with different exclusions.

Related Topics

SimpleFacetParameters from the Solr Wiki.

Heatmap Faceting (Spatial)

Highlighting

Highlighting in Solr allows fragments of documents that match the user's query to be included with the query

response. The fragments are included in a special section of the response (the section), and thehighlighting

client uses the formatting clues also included to determine how to present the snippets to users.

Solr provides a collection of highlighting utilities which allow a great deal of control over the fields fragments are

taken from, the size of fragments, and how they are formatted. The highlighting utilities can be called by various

Request Handlers and can be used with the , , or query parsers.DisMax Extended DisMax standard

There are three highlighting implementations available:

Standard Highlighter: The is the swiss-army knife of the highlighters. It has theStandard Highlighter

most sophisticated and fine-grained query representation of the three highlighters. For example, this

highlighter is capable of providing precise matches even for advanced queryparsers such as the surroun

parser. It does not require any special datastructures such as , although it will use them ifd termVectors

they are present. If they are not, this highlighter will re-analyze the document on-the-fly to highlight it. This

highlighter is a good choice for a wide variety of search use-cases.

FastVector Highlighter: The requires term vector options ( , FastVector Highlighter termVectors termP

, and ) on the field, and is optimized with that in mind. It tends to work better forositions termOffsets

more languages than the Standard Highlighter, because it supports Unicode breakiterators. On the other

hand, its query-representation is less advanced than the Standard Highlighter: for example it will not work

well with the parser. This highlighter is a good choice for large documents and highlighting textsurround

in a variety of languages.

Postings Highlighter: The requires to bePostings Highlighter storeOffsetsWithPositions

configured on the field. This is a much more compact and efficient structure than term vectors, but is not

appropriate for huge numbers of query terms (e.g. wildcard queries). Like the FastVector Highlighter, it

301Apache Solr Reference Guide 5.1

supports Unicode algorithms for dividing up the document. On the other hand, it has the most coarse

query-representation: it focuses on summary quality and ignores the structure of the query completely,

ranking passages based solely on query terms and statistics. This highlighter a good choice for classic

full-text keyword search.

Configuring Highlighting

The configuration for highlighting, whichever implementation is chosen, is first to configure a search component

and then reference the component in one or more request handlers.

The exact parameters for the search component vary depending on the implementation, but there is a robust

example in the used in the " " example which shows how to configure bothsolrconfig.xml techproducts

the Standard Highlighter and the FastVector Highlighter (see the section for details on howPostings Highlighter

to configure that implementation).

Standard Highlighter

The standard highlighter (AKA the default highlighter) doesn't require any special indexing parameters on the

fields to highlight. However you can optionally turn on , , and fortermVectors termPositions termOffsets

any field to be highlighted. This will avoid having to run documents through the analysis chain at query-time and

will make highlighting significantly faster and use less memory, particularly for large text fields, and even more so

when . is enabledhl.usePhraseHighlighter

Standard Highlighting Parameters

The table below describes Solr's parameters for the Standard highlighter. These parameters can be defined in

the highlight search component, as defaults for the specific request handler, or passed to the request handler

with the query.

Parameter Default Value Description

hl blank (no highlight) When set to , enables highlighted snippets to betrue

generated in the query response. If set to or tofalse

a blank or missing value, disables highlighting.

hl.q blank Specifies an overriding query term for highlighting. If

is specified, the highlighter will use that termhl.q

rather than the main query term.

hl.qparser blank Specifies a qparser to use for the hl.q query. If

blank, will use the defType of the overall query.

302Apache Solr Reference Guide 5.1

hl.fl blank Specifies a list of fields to highlight. Accepts a

comma- or space-delimited list of fields for which

Solr should generate highlighted snippets. If left

blank, highlights the defaultSearchField (or the field

specified the parameter if used) for thedf

StandardRequestHandler. For the

DisMaxRequestHandler, the fields are used asqf

defaults.

A '*' can be used to match field globs, such as

'text_*' or even '*' to highlight on all fields where

highlighting is possible. When using '*', consider

adding .hl.requireFieldMatch=true

hl.snippets 1 Specifies maximum number of highlighted snippets

to generate per field. It is possible for any number of

snippets from zero to this value to be generated.

This parameter accepts per-field overrides.

hl.fragsize 100 Specifies the size, in characters, of fragments to

consider for highlighting. indicates that no0

fragmenting should be considered and the whole

field value should be used. This parameter accepts

per-field overrides.

hl.mergeContiguous false Instructs Solr to collapse contiguous fragments into

a single fragment. A value of indicatestrue

contiguous fragments will be collapsed into single

fragment. This parameter accepts per-field

overrides. The default value, , is also thefalse

backward-compatible setting.

hl.requireFieldMatch false If set to , highlights terms only if they appear intrue

the specified field. If , terms are highlighted infalse

all requested fields regardless of which field

matched the query.

hl.maxAnalyzedChars 51200 Specifies the number of characters into a document

that Solr should look for suitable snippets.

hl.maxMultiValuedToExamine integer.MAX_VALUE Specifies the maximum number of entries in a

multi-valued field to examine before stopping. This

can potentially return zero results if the limit is

reached before any matches are found. If used with

the , whichever limit ismaxMultiValuedToMatch

reached first will determine when to stop looking.

303Apache Solr Reference Guide 5.1

hl.maxMultiValuedToMatch integer.MAX_VALUE Specifies the maximum number of matches in a

multi-valued field that are found before stopping. If h

is also defined,l.maxMultiValuedToExamine

whichever limit is reached first will determine when

to stop looking.

hl.alternateField blank Specifies a field to be used as a backup default

summary if Solr cannot generate a snippet (i.e.,

because no terms match). This parameter accepts

per-field overrides.

hl.maxAlternateFieldLength unlimited Specifies the maximum number of characters of the

field to return. Any value less than or equal to 0

means the field's length is unlimited. This parameter

is only used in conjunction with the hl.alternate

parameter.Field

hl.formatter simple Selects a formatter for the highlighted output.

Currently the only legal value is , whichsimple

surrounds a highlighted term with a customizable

pre- and post-text snippet. This parameter accepts

per-field overrides.

hl.simple.pre

hl.simple.post

<em> and </em> Specifies the text that should appear before (hl.si

) and after ( ) ample.pre hl.simple.post

highlighted term, when using the simple formatter.

This parameter accepts per-field overrides.

hl.fragmenter gap Specifies a text snippet generator for highlighted

text. The standard fragmenter is , which createsgap

fixed-sized fragments with gaps for multi-valued

fields. Another option is , which tries to createregex

fragments that resemble a specified regular

expression. This parameter accepts per-field

overrides.

hl.usePhraseHighlighter true If set to , Solr will use the Lucene SpanScorertrue

class to highlight phrase terms only when they

appear within the query phrase in the document.

hl.highlightMultiTerm true If set to , Solr will use highlight phrase termstrue

that appear in multi-term queries.

304Apache Solr Reference Guide 5.1

hl.regex.slop 0.6 When using the regex fragmenter (hl.fragmente

), this parameter defines the factor byr=regex

which the fragmenter can stray from the ideal

fragment size (given by ) tohl.fragsize

accommodate a regular expression. For instance, a

slop of 0.2 with should yieldhl.fragsize=100

fragments between 80 and 120 characters in length.

It is usually good to provide a slightly smaller hl.fr

value when using the regex fragmenter.agsize

hl.regex.pattern blank Specifies the regular expression for fragmenting.

This could be used to extract sentences.

hl.regex.maxAnalyzedChars 10000 Instructs Solr to analyze only this many characters

from a field when using the regex fragmenter (after

which, the fragmenter produces fixed-sized

fragments). Applying a complicated regex to a huge

field is computationally expensive.

hl.preserveMulti false If , multi-valued fields will return all values in thetrue

order they were saved in the index. If , onlyfalse

values that match the highlight request will be

returned.

hl.payloads (automatic) When hl.usePhraseHighlighter=true and the indexed

field has payloads but not term vectors (generally

quite rare), the index's payloads will be read into the

highlighter's memory index along with the postings.

If this may happen and you know you don't need

them for highlighting (i.e. your queries don't filter by

payload) then you can save a little memory by

setting this to false.

Related Topics

How MoreLikeThis Works

MoreLikeThis constructs a Lucene query based on terms in a document. It does this by pulling terms from the

defined list of fields ( see the parameter, below). For best results, the fields should have stored termmlt.fl

vectors in . For example:schema.xml

If term vectors are not stored, will generate terms from stored fields. A must alsoMoreLikeThis uniqueKey

be stored in order for MoreLikeThis to work properly.

The next phase filters terms from the original document using thresholds defined with the MoreLikeThis

parameters. Finally, a query is run with these terms, and any other query parameters that have been defined

(see the parameter, below) and a new document set is returned.mlt.qf

Common Parameters for MoreLikeThis

The table below summarizes the parameters supported by Lucene/Solr. These parameters canMoreLikeThis

be used with any of the three possible MoreLikeThis approaches.

Parameter Description

mlt.fl Specifies the fields to use for similarity. If possible, these should have stored .termVectors

mlt.mintf Specifies the Minimum Term Frequency, the frequency below which terms will be ignored in the

source document.

mlt.mindf Specifies the Minimum Document Frequency, the frequency at which words will be ignored

which do not occur in at least this many documents.

mlt.maxdf Specifies the Maximum Document Frequency, the frequency at which words will be ignored

which occur in more than this many documents.

mlt.minwl Sets the minimum word length below which words will be ignored.

mlt.maxwl Sets the maximum word length above which words will be ignored.

mlt.maxqt Sets the maximum number of query terms that will be included in any generated query.

mlt.maxntp Sets the maximum number of tokens to parse in each example document field that is not stored

with TermVector support.

mlt.boost Specifies if the query will be boosted by the interesting term relevance. It can be either "true" or

"false".

mlt.qf Query fields and their boosts using the same format as that used by the DisMaxRequestHandler.

These fields must also be specified in .mlt.fl

Parameters for the MoreLikeThisComponent

Using MoreLikeThis as a search component returns similar documents for each document in the response set. In

335Apache Solr Reference Guide 5.1

addition to the common parameters, these additional options are available:

Parameter Description

mlt If set to true, activates the component and enables Solr to return MoreLikeThis MoreLikeThi

results.s

mlt.count Specifies the number of similar documents to be returned for each result. The default value is 5.

Parameters for the MoreLikeThisHandler

The table below summarizes parameters accessible through the . It supports faceting,MoreLikeThisHandler

paging, and filtering using common query parameters, but does not work well with alternate query parsers.

Parameter Description

mlt.match.include Specifies whether or not the response should include the matched document. If set to

false, the response will look like a normal select response.

mlt.match.offset Specifies an offset into the main query search results to locate the document on which

the query should operate. By default, the query operates on the firstMoreLikeThis

result for the q parameter.

mlt.interestingTerms Controls how the component presents the "interesting" terms (the topMoreLikeThis

TF/IDF terms) for the query. Supports three settings. The setting list lists the terms. The

setting none lists no terms. The setting details lists the terms along with the boost value

used for each term. Unless , all terms will have .mlt.boost=true boost=1.0

Related Topics

RequestHandlers and SearchComponents in SolrConfig

Pagination of Results

Basic Pagination

In most search application usage, the "top" matching results (sorted by score, or some other criteria) are then

displayed to some human user. In many applications the UI for these sorted results are displayed to the user in

"pages" containing a fixed number of matching results, and users don't typically look at results past the first few

pages worth of results.

In Solr, this basic paginated searching is supported using the and parameters, and performance ofstart rows

this common behaviour can be tuned by utilizing the and adjusting the queryResultCache queryResultWin

configuration options based on your expected page sizes.dowSize

Basic Pagination Examples

The easiest way to think about simple pagination, is to simply multiply the page number you want (treating the

"first" page number as "0") by the number of rows per page; such as in the following psuedo-code:

336Apache Solr Reference Guide 5.1

function fetch_solr_page($page_number, $rows_per_page) {

$start = $page_number * $rows_per_page

$params = [ q = $some_query, rows = $rows_per_page, start = $start ]

return fetch_solr($params)

}

How Basic Pagination is Affected by Index Updates

The param specified in a request to Solr indicates an "offset" in the complete sorted list ofstart absolute

matches that the client wants Solr to use as the beginning of the current "page". If an index modification (such

as adding or removing documents) which affects the sequence of ordered documents matching a query occurs in

between two requests from a client for subsequent pages of results, then it is possible that these modifications

can result in the same document being returned on multiple pages, or documents being "skipped" as the result

set shrinks or grows.

For example: consider an index containing 26 documents like so:

id name

1 A

2 B

...

26 Z

Followed by the following requests & index modifications interleaved:

A client requests q=*:*&rows=5&start=0&sort=name asc

documents with the ids will be returned to the client1-5

Document id is deleted3

The client requests "page #2" using q=*:*&rows=5&start=5&sort=name asc

Documents will be returned7-11

Document has been skipped, since it is now the 5th document in the sorted set of all matching6

results – it would be returned on a new request for "page #1"

3 new documents are now added with the ids , , and ; All three documents have a name of 90 91 92 A

The client requests "page #3" using q=*:*&rows=5&start=10&sort=name asc

Documents will be returned9-13

Documents , , and have now been returned on both page #2 and page #3 since they moved9 10 11

farther back in the list of sorted results

In typical situations these impacts from index changes on paginated searching don't significantly affect user

experience -- either because they happen extremely infrequently in fairly static collections, or because the users

recognize that the collection of data is constantly evolving and expect to see documents shift up and down in the

result sets.

Performance Problems with "Deep Paging"

In some situations, the results of a Solr search are not destined for a simple paginated user interface. When you

337Apache Solr Reference Guide 5.1

wish to fetch a very large number of sorted results from Solr to feed into an external system, using very large

values for the or parameters can be very inefficient. Pagination using and not onlystart rows start rows

require Solr to compute (and sort) in memory all of the matching documents that should be fetched for the

current page, but also all of the documents that would have appeared on previous pages. So while a request for

may be obviously inefficient because it requires Solr to maintain & sort in memory astart=0&rows=1000000

set of 1 million documents, likewise a request for is equally inefficient for the samestart=999000&rows=1000

reasons. Solr can't compute which matching document is the 999001st result in sorted order, without first

determining what the first 999000 matching sorted results are.

Fetching A Large Number of Sorted Results: Cursors

As an alternative to increasing the "start" parameter to request subsequent pages of sorted results, Solr supports

using a "Cursor" to scan through results. Cursors in Solr are a logical concept, that doesn't involve caching any

state information on the server. Instead the sort values of the last document returned to the client are used to

compute a "mark" representing a logical point in the ordered space of sort values. That "mark" can be specified

in the parameters of subsequent requests to tell Solr where to continue.

Using Cursors

To use a cursor with Solr, specify a parameter with the value of . You can think of this beingcursorMark " "*

analogous to as a way to tell Solr "start at the beginning of my sorted results" except that it alsostart=0

informs Solr that you want to use a Cursor. So in addition to returning the top N sorted results (where you can

control N using the parameter) the Solr response will also include an encoded String named rows nextCursor

. You then take the String value from the response, and pass it back to Solr as the Mark nextCursorMark cur

parameter for your next request. You can repeat this process until you've fetched as many docs assorMark

you want, or until the returned matches the you've already specified --nextCursorMark cursorMark

indicating that there are no more results.

Constraints when using Cursors

There are a few important constraints to be aware of when using parameter in a Solr requestcursorMark

cursorMark and are mutually exclusive parametersstart

Your requests must either not include a parameter, or it must be specified with a value of "start 0

sort clauses must include the uniqueKey field (either " " or ")asc "desc

If is your uniqueKey field, then sort params like and would bothid id asc name asc, id desc

work fine, but by itself would notname asc

Cursor mark values are computed based on the sort values of each document in the result, which means

multiple documents with identical sort values will produce identical Cursor mark values if one of them is the last

document on a page of results. In that situation, the subsequent request using that would notcursorMark

know which of the documents with the identical mark values should be skipped. Requiring that the uniqueKey

field be used as a clause in the sort criteria guarantees that a deterministic ordering will be returned, and that

every value will identify a unique point in the sequence of documents.cursorMark

Cursor Examples

Fetch All Docs

338Apache Solr Reference Guide 5.1

The psuedo-code shown here shows the basic logic involved in fetching all documents matching a query using a

cursor:

// when fetching all docs, you might as well use a simple id sort

// unless you really need the docs to come back in a specific order

$params = [ q => $some_query, sort => 'id asc', rows => $r, cursorMark => '*' ]

$done = false

while (not $done) {

$results = fetch_solr($params)

// do something with $results

if ($params[cursorMark] == $results[nextCursorMark]) {

$done = true

}

$params[cursorMark] = $results[nextCursorMark]

}

Using SolrJ, this psuedo-code would be:

SolrQuery q = (new SolrQuery(some_query)).setRows(r).setSort(SortClause.asc("id"));

String cursorMark = CursorMarkParams.CURSOR_MARK_START;

boolean done = false;

while (! done) {

q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);

QueryResponse rsp = solrServer.query(q);

String nextCursorMark = rsp.getNextCursorMark();

doCustomProcessingOfResults(rsp);

if (cursorMark.equals(nextCursorMark)) {

done = true;

}

cursorMark = nextCursorMark;

}

If you wanted to do this by hand using curl, the sequence of requests would look something like this:

339Apache Solr Reference Guide 5.1

$ curl '...&rows=10&sort=id+asc&cursorMark=*'

{

"response":{"numFound":32,"start":0,"docs":[

// ... 10 docs here ...

]},

"nextCursorMark":"AoEjR0JQ"}

$ curl '...&rows=10&sort=id+asc&cursorMark=AoEjR0JQ'

{

"response":{"numFound":32,"start":0,"docs":[

// ... 10 more docs here ...

]},

"nextCursorMark":"AoEpVkRCREIxQTE2"}

$ curl '...&rows=10&sort=id+asc&cursorMark=AoEpVkRCREIxQTE2'

{

"response":{"numFound":32,"start":0,"docs":[

// ... 10 more docs here ...

]},

"nextCursorMark":"AoEmbWF4dG9y"}

$ curl '...&rows=10&sort=id+asc&cursorMark=AoEmbWF4dG9y'

{

"response":{"numFound":32,"start":0,"docs":[

// ... 2 docs here because we've reached the end.

]},

"nextCursorMark":"AoEpdmlld3Nvbmlj"}

$ curl '...&rows=10&sort=id+asc&cursorMark=AoEpdmlld3Nvbmlj'

{

"response":{"numFound":32,"start":0,"docs":[

// no more docs here, and note that the nextCursorMark

// matches the cursorMark param we used

]},

"nextCursorMark":"AoEpdmlld3Nvbmlj"}

Fetch first N docs, Based on Post Processing

Since the cursor is stateless from Solr's perspective, your client code can stop fetching additional results as soon

as you have decided you have enough information:

while (! done) {

q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);

QueryResponse rsp = solrServer.query(q);

String nextCursorMark = rsp.getNextCursorMark();

boolean hadEnough = doCustomProcessingOfResults(rsp);

if (hadEnough || cursorMark.equals(nextCursorMark)) {

done = true;

}

cursorMark = nextCursorMark;

}

How cursors are Affected by Index Updates

Unlike basic pagination, Cursor pagination does not rely on using an absolute "offset" into the completed sorted

list of matching documents. Instead, the specified in a request encapsulates information aboutcursorMark

the position of the last document returned, based on the sort values of that document. Thisrelative absolute

means that the impact of index modifications is much smaller when using a cursor compared to basic pagination.

340Apache Solr Reference Guide 5.1

Consider the same example index described when discussing basic pagination:

id name

1 A

2 B

...

26 Z

A client requests q=*:*&rows=5&start=0&sort=name asc, id asc&cursorMark=*

Documents with the ids will be returned to the client in order1-5

Document id is deleted3

The client requests 5 more documents using the from the previous responsenextCursorMark

Documents will be returned -- the deletion of a document that's already been returned doesn't6-10

affect the relative position of the cursor

3 new documents are now added with the ids , , and ; All three documents have a name of 90 91 92 A

The client requests 5 more documents using the from the previous responsenextCursorMark

Documents will be returned -- the addition of new documents with sort values already past11-15

does not affect the relative position of the cursor

Document id is updated to change it's 'name' to 1 Q

Document id 17 is updated to change it's 'name' to A

The client requests 5 more documents using the from the previous responsenextCursorMark

The resulting documents are in that order16,1,18,19,20

Because the sort value of document changed so that it is the cursor position, the document1after

is returned to the client twice

Because the sort value of document changed so that it is the cursor position, the17 before

document has been "skipped" and will not be returned to the client as the cursor continues to

progress

In a nutshell: When fetching all results matching a query using , the only way index modificationscursorMark

can result in a document being skipped, or returned twice, is if the sort value of the document changes.

"Tailing" a Cursor

Because Cursor requests are stateless, and the values encapsulate the sort values ofcursorMark absolute

the last document returned from a search, it's possible to "continue" fetching additional results from a cursor that

has already reached its end -- if new documents are added (or existing documents are updated) to the end of the

results. You can think of this as similar to using something like " in Unix.tail -f"

The most common examples of how this can be useful is when you have a "timestamp" field recording when a

One way to ensure that a document will never be returned more then once, is to use the uniqueKey field

as the primary (and therefore: only significant) sort criterion.

In this situation, you will be guaranteed that each document is only returned once, no matter how it may

be be modified during the use of the cursor.

341Apache Solr Reference Guide 5.1

document has been added/updated in your index. Client applications can continuously poll a cursor using a sor

for documents matching a query, and always be notified when a document ist=timestamp asc, id asc

added or updated matching the request criteria. Another common example is when you have uniqueKey values

that always increase as new documents are created, and you can continuously poll a cursor using sort=id

to be notified about new documents.asc

The psuedo-code for tailing a cursor is only a slight modification from our early example for processing all docs

matching a query:

while (true) {

$doneForNow = false

while (not $doneForNow) {

$results = fetch_solr($params)

// do something with $results

if ($params[cursorMark] == $results[nextCursorMark]) {

$doneForNow = true

}

$params[cursorMark] = $results[nextCursorMark]

}

sleep($some_configured_delay)

}

Result Grouping

Result Grouping groups documents with a common field value into groups and returns the top documents for

each group. For example, if you searched for "DVD" on an electronic retailer's e-commerce site, you might be

returned three categories such as "TV and Video," "Movies," and "Computers," with three results per category. In

this case, the query term "DVD" appeared in all three categories, so Solr groups them together in order to

increase relevancy for the user.

Result Grouping is separate from . Though it is conceptually similar, faceting returns all relevant resultsFaceting

and allows the user to refine the results based on the facet category. For example, if you searched for "shoes" on

a footwear retailer's e-commerce site, you would be returned all results for that query term, along with selectable

facets such as "size," "color," "brand," and so on.

You can however combined grouping with faceting: Grouped faceting supports and facet.field facet.rang

but currently doesn't support date and pivot faceting. The facet counts are computed based on the first e group

parameter, and other parameters are ignored..field group.field

Grouped faceting differs from non grouped facets (sum of all facets) == (total of products with that property) as

shown in the following example:

Object 1

ppm: 62

product_range: 6

Object 2

ppm: 65

product_range: 6

342Apache Solr Reference Guide 5.1

Object 3

ppm: 62

product_range: 7

If you ask Solr to group these documents by "product_range", then the total amount of groups is 2, but the facets

for ppm are 2 for 62 and 1 for 65.

Request Parameters

Result Grouping takes the following request parameters. Any number of these request parameters can be

included in a single request:

Parameter Type Description

group Boolean If true, query results will be grouped.

group.field string The name of the field by which to group results. The field must be

single-valued, and either be indexed or a field type that has a value

source and works in a function query, such as ExternalFileFiel

. It must also be a string-based field, such as or d StrField TextFi

eld

group.func query Group based on the unique values of a function query.

NOTE: This option does not work with .distributed searches

group.query query Return a single group of documents that match the given query.

rows integer The number of groups to return. The default value is 10.

start integer Specifies an initial offset for the list of groups.

group.limit integer Specifies the number of results to return for each group. The default

value is 1.

group.offset integer Specifies an initial offset for the document list of each group.

sort sortspec Specifies how Solr sorts the groups relative to each other. For

example, will cause the groups to besort=popularity desc

sorted according to the highest popularity document in each group.

The default value is .score desc

group.sort sortspec Specifies how Solr sorts documents within a single group. The

default value is .score desc

group.format grouped/simple If this parameter is set to , the grouped documents aresimple

presented in a single flat list, and the and parametersstart rows

affect the numbers of documents instead of groups.

group.main Boolean If true, the result of the first field grouping command is used as the

main result list in the response, using .group.format=simple

343Apache Solr Reference Guide 5.1

group.ngroups Boolean If true, Solr includes the number of groups that have matched the

query in the results. The default value is false.

See below for when using Distributed Result Grouping Caveats

sharded indexes

group.truncate Boolean If true, facet counts are based on the most relevant document of

each group matching the query. The default value is false.

group.facet Boolean Determines whether to compute grouped facets for the field facets

specified in facet.field parameters. Grouped facets are computed

based on the first specified group. As with normal field faceting, fields

shouldn't be tokenized (otherwise counts are computed for each

token). Grouped faceting supports single and multivalued fields.

Default is false.

See below for when using Distributed Result Grouping Caveats

sharded indexes

group.cache.percent integer

between 0 and

100

Setting this parameter to a number greater than 0 enables caching

for result grouping. Result Grouping executes two searches; this

option caches the second search. The default value is 0. Testing has

shown that group caching only improves search time with Boolean,

wildcard, and fuzzy queries. For simple queries like term or "match

all" queries, group caching degrades performance.

Any number of group commands ( , , ) may be specified in a singlegroup.field group.func group.query

request.

Examples

All of the following sample queries work with Solr's " " example.bin/solr -e techproducts

Grouping Results by Field

In this example, we will group results based on the field, which specifies the manufacturer of themanu_exact

items in the sample dataset.

http://localhost:8983/solr/techproducts/select?wt=json&indent=true&fl=id,name&q=so

lr+memory&group=true&group.field=manu_exact

344Apache Solr Reference Guide 5.1

{

...

"grouped":{

"manu_exact":{

"matches":6,

"groups":[{

"groupValue":"Apache Software Foundation",

"doclist":{"numFound":1,"start":0,"docs":[

{

"id":"SOLR1000",

"name":"Solr, the Enterprise Search Server"}]

}},

{

"groupValue":"Corsair Microsystems Inc.",

"doclist":{"numFound":2,"start":0,"docs":[

{

"id":"VS1GB400C3",

"name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400

(PC 3200) System Memory - Retail"}]

}},

{

"groupValue":"A-DATA Technology Inc.",

"doclist":{"numFound":1,"start":0,"docs":[

{

"id":"VDBDB1A16",

"name":"A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC

3200) System Memory - OEM"}]

}},

{

"groupValue":"Canon Inc.",

"doclist":{"numFound":1,"start":0,"docs":[

{

"id":"0579B002",

"name":"Canon PIXMA MP500 All-In-One Photo Printer"}]

}},

{

"groupValue":"ASUS Computer Inc.",

"doclist":{"numFound":1,"start":0,"docs":[

{

"id":"EN7800GTX/2DHTV/256M",

"name":"ASUS Extreme N7800GTX/2DHTV (256 MB)"}]

}

]

}

The response indicates that there are six total matches for our query. For each unique value of ,group.field

Solr returns a with the top scoring document. The also includes the total number of matchesdocList docList

in that group as the value. The groups are sorted by the score of the top document within each group.numFound

We can run the same query with the request parameter . This will format the results as agroup.main=true

single flat document list. This flat format does not include as much information as the normal result grouping

query results, but it may be easier for existing Solr clients to parse.

http://localhost:8983/solr/techproducts/select?wt=json&indent=true&fl=id,name,manu

345Apache Solr Reference Guide 5.1

facturer&q=solr+memory&group=true&group.field=manu_exact&group.main=true

{

"responseHeader":{

"status":0,

"QTime":1,

"params":{

"fl":"id,name,manufacturer",

"indent":"true",

"q":"solr memory",

"group.field":"manu_exact",

"group.main":"true",

"group":"true",

"wt":"json"}},

"grouped":{},

"response":{"numFound":6,"start":0,"docs":[

{

"id":"SOLR1000",

"name":"Solr, the Enterprise Search Server"},

{

"id":"VS1GB400C3",

"name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC

3200) System Memory - Retail"},

{

"id":"VDBDB1A16",

"name":"A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200)

System Memory - OEM"},

{

"id":"0579B002",

"name":"Canon PIXMA MP500 All-In-One Photo Printer"},

{

"id":"EN7800GTX/2DHTV/256M",

"name":"ASUS Extreme N7800GTX/2DHTV (256 MB)"}]

}

Grouping by Query

In this example, we will use the parameter to find the top three results for "memory" in twogroup.query

different price ranges: 0.00 to 99.99, and over 100.

http://localhost:8983/solr/techproducts/select?wt=json&indent=true&fl=name,price&q

=memory&group=true&group.query=price:[0+TO+99.99]&group.query=price:[100+TO+*]&gro

up.limit=3

346Apache Solr Reference Guide 5.1

{

"responseHeader":{

"status":0,

"QTime":42,

"params":{

"fl":"name,price",

"indent":"true",

"q":"memory",

"group.limit":"3",

"group.query":["price:[0 TO 99.99]",

"price:[100 TO *]"],

"group":"true",

"wt":"json"}},

"grouped":{

"price:[0 TO 99.99]":{

"matches":5,

"doclist":{"numFound":1,"start":0,"docs":[

{

"name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC

3200) System Memory - Retail",

"price":74.99}]

}},

"price:[100 TO *]":{

"matches":5,

"doclist":{"numFound":3,"start":0,"docs":[

{

"name":"CORSAIR XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM Unbuffered DDR 400

(PC 3200) Dual Channel

Kit System Memory - Retail",

"price":185.0},

{

"name":"Canon PIXMA MP500 All-In-One Photo Printer",

"price":179.99},

{

"name":"ASUS Extreme N7800GTX/2DHTV (256 MB)",

"price":479.95}]

}

In this case, Solr found five matches for "memory," but only returns four results grouped by price. This is

because one result for "memory" did not have a price assigned to it.

Distributed Result Grouping Caveats

Grouping is supported for , with some caveats:distributed searches

Currently is is not supported in any distributed searchesgroup.func

group.ngroups and require that all documents in each group must be co-located on thegroup.facet

same shard in order for accurate counts to be returned. can be aDocument routing via composite keys

useful solution in many situations.

347Apache Solr Reference Guide 5.1

Collapse and Expand Results

The collapsing query parser and the expand component combine to form an approach to grouping documents for

field collapsing in search results.

Collapsing Query Parser

The is really a that provides more performant field collapsing than Solr'sCollapsingQParser post filter

standard approach when the number of distinct groups in the result set is high. This parser collapses the result

set to a single document per group before it forwards the result set to the rest of the search components. So all

downstream components (faceting, highlighting, etc...) will work with the collapsed result set.

The CollapsingQParser accepts the following local parameters:

Parameter Description Default

field The field that is being collapsed on. The field must be a single valued String, Int or

Float

none

min | max Selects the group heads by the min or max value of a numeric field or function query.

If not specified group heads will be selected based on the highest scoring documents

in the result set.

none

nullPolicy There are three null policies:

ignore: removes documents with a null value in the collapse field. This is the

default.

expand: treats each document with a null value in the collapse field as a separate

group.

collapse: collapses all documents with a null value into a single group using either

highest score, or minimum/maximum.

ignore

hint Currently there is only one hint available "top_fc", which stands for top level

FieldCache. The top_fc hint is only available when collapsing on String fields. top_fc

provides the best query time speed but takes the longest to warm on startup or

following a commit. top_fc also will result in having the collapsed field cached in

memory twice if the it's used for faceting or sorting.

none

size Sets the initial size of the collapse data structures when collapsing on a numeric field

. The data structures used for collapsing grow dynamically when collapsing ononly

numeric fields. Setting the size above the number of results expected in the result set

will eliminate the resizing cost.

100,000

Sample Syntax:

Collapse based on the highest scoring document:

fq={!collapse field=field_name}

Collapse based on the minimum value of a numeric field:

348Apache Solr Reference Guide 5.1

fq={!collapse field=field_name min=field_name}

Collapse based on the maximum value of a numeric field:

fq={!collapse field=field_name max=field_name}

Collapse based on the min/max value of a function. The function can be used with thecscore()

CollapsingQParserPlugin to return the score of the current document being collapsed.

fq={!collapse field=field_name max=sum(cscore(),field(A))}

Collapse with a null policy:

fq={!collapse field=field_name nullPolicy=nullPolicy}

Collapse with a hint:

fq={!collapse field=field_name hint=top_fc}

The CollapsingQParserPlugin fully supports the QueryElevationComponent.

Expand Component

The ExpandComponent can be used to expand the groups that were collapsed by the .CollapsingQParserPlugin

Example usage with the CollapsingQParserPlugin:

q=foo&fq={!collapse field=ISBN}

In the query above, the CollapsingQParserPlugin will collapse the search results on the field. The mainISBN

search results will contain the highest ranking document from each book.

The ExpandComponent can now be used to expand the results so you can see the documents grouped by ISBN.

For example:

q=foo&fq={!collapse field=ISBN}&expand=true

The “expand=true” parameter turns on the ExpandComponent. The ExpandComponent adds a new section to

the search output labeled “expanded”.

Inside the expanded section there is a with each group head pointing to the expanded documents that aremap

within the group. As applications iterate the main collapsed result set, they can access the map toexpanded

retrieve the expanded groups.

The ExpandComponent has the following parameters:

Parameter Description Default

349Apache Solr Reference Guide 5.1

expand.sort Orders the documents within the expanded groups score

desc

expand.rows The number of rows to display in each group 5

expand.q Overrides the main q parameter, determines which documents to include in the

main group.

main q

expand.fq Overrides main fq's, determines which documents to include in the main group. main fq's

Result Clustering

The (or ) plugin attempts to automatically discover groups of related search hitsclustering cluster analysis

(documents) and assign human-readable labels to these groups. By default in Solr, the clustering algorithm is

applied to the search result of each single query—this is called an clustering. While Solr contains anon-line

extension for for full-index clustering ( clustering) this section will focus on discussing on-line clusteringoff-line

only.

Clusters discovered for a given query can be perceived as . This is beneficial when regulardynamic facets

faceting is difficult (field values are not known in advance) or when the queries are exploratory in nature. Take a

look at the project's demo page to see an example of search results clustering in action (the groups inCarrot2

the visualization have been discovered automatically in search results to the right, there is no external

information involved).

The query issued to the system was . It seems clear that faceting could not yield a similar set of groups,Solr

although the goals of both techniques are similar—to let the user explore the set of search results and either

rephrase the query or narrow the focus to a subset of current documents. Clustering is also similar to Result

in that it can help to look deeper into search results, beyond the top few hits.Grouping

Topics covered in this section:

Preliminary Concepts

Quick Start Example

Installation

Configuration

Tweaking Algorithm Settings

350Apache Solr Reference Guide 5.1

Performance Considerations

Additional Resources

Preliminary Concepts

Each passed to the clustering component is composed of several logical parts:document

a unique identifier,

origin URL,

the title,

the main content,

a language code of the title and content.

The identifier part is mandatory, everything else is optional but at least one of the text fields (title or content) will

be required to make the clustering process reasonable. It is important to remember that logical document parts

must be mapped to a particular schema and its fields. The content (text) for clustering can be sourced from either

a stored text field or context-filtered using a highlighter, all these options are explained below in the configuration

section.

A is the actual logic (implementation) that discovers relationships among the documents inclustering algorithm

the search result and forms human-readable cluster labels. Depending on the choice of the algorithm the clusters

may (and probably will) vary. Solr comes with several algorithms implemented in the open source projectCarrot2

, commercial alternatives also exist.

Quick Start Example

The " " example included with Solr is pre-configured with all the necessary components for resulttechproducts

clustering - but they are disabled by default.

To enable the clustering component contrib and a dedicated search handler configured to use it, specify the " " -a

option to set a JVM System Property when running the example:

bin/solr start -e techproducts -a '-Dsolr.clustering.enabled=true'

You can now try out the clustering handler by opening the following URL in a browser: http://localhost:89

83/solr/techproducts/clustering?q=*:*&rows=100

The output XML should include search hits and an array of automatically discovered clusters at the end,

resembling the output shown here:

</lst>

<doc>

<str name="name">Test with some GB18030 encoded characters</str>

<str>No accents here</str>

<str>This is a feature (translated)</str>

351Apache Solr Reference Guide 5.1

<str>This document is very shiny (translated)</str>

</arr>

</doc>

</result>

<lst>

</arr>

<str>TWINX2048-3200PRO</str>

<str>VDBDB1A16</str>

</arr>

</lst>

<lst>

</arr>

</arr>

</lst>

<lst>

<str>Other Topics</str>

</arr>

<str>adata</str>

<str>apple</str>

</arr>

</lst>

352Apache Solr Reference Guide 5.1

</arr>

</response>

There were a few clusters discovered for this query ( ), separating search hits into various categories: DDR,*:*

iPod, Hard Drive, etc. Each cluster has a label and score that indicates the "goodness" of the cluster. The score

is algorithm-specific and is meaningful only in relation to the scores of other clusters in the same set. In other

words, if cluster has a higher score than cluster , cluster should be of better quality (have a better labelA B A

and/or more coherent document set). Each cluster has an array of identifiers of documents belonging to it. These

identifiers correspond to the field declared in the schema.uniqueKey

Depending on the quality of input documents, some clusters may not make much sense. Some documents may

be left out and not be clustered at all; these will be assigned to the synthetic group, marked with theOther Topics

property set to (see the XML dump above for an example). The score of the other topicsother-topics true

group is zero.

Installation

The clustering contrib extension requires and all JARs under dist/solr-clustering-*.jar contrib/clu

.stering/lib

Configuration

Declaration of the Search Component and Request Handler

Clustering extension is a search component and must be declared in . Such a component cansolrconfig.xml

be then appended to a request handler as the last component in the chain (because it requires search results

which must be previously fetched by the search component).

An example configuration could look as shown below.

Include the required contrib JARs. Note that by default paths are relative to the Solr core so they may

need adjustments to your configuration, or an explicit specification of the .$solr.install.dir

<lib dir="${solr.install.dir:../../..}/contrib/clustering/lib/"

regex=".*\.jar" />

<lib dir="${solr.install.dir:../../..}/dist/"

regex="solr-clustering-\d.*\.jar" />

Declaration of the search component. Each component can also declare multiple clustering pipelines

("engines"), which can be selected at runtime.

353Apache Solr Reference Guide 5.1

<searchComponent name="clustering"

class="solr.clustering.ClusteringComponent">

<str name="name">lingo</str>

<str

name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm<

/str>

</lst>

<str

name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm</str

</lst>

</searchComponent>

A request handler to which we append the clustering component declared above.

<requestHandler name="/clustering"

class="solr.SearchHandler">

<str name="carrot.title">doctitle</str>

<str name="carrot.snippet">content</str>

<!-- Configure any other request handler parameters. We will cluster the

top 100 search results so bump up the 'rows' parameter. -->

<str name="fl">*,score</str>

</lst>

<str>clustering</str>

</arr>

</requestHandler>

Configuration Parameters of the Clustering Component

The table below summarizes parameters of each clustering engine or the entire clustering component

(depending where they are declared).

Parameter Description

clustering When , clustering component is enabled.true

354Apache Solr Reference Guide 5.1

clustering.engine Declares which clustering engine to use. If not present, the first declared

engine will become the default one.

clustering.results When , the component will perform clustering of search results (thistrue

should be enabled).

clustering.collection When , the component will perform clustering of the whole documenttrue

index (this section does not cover full-index clustering).

At the engine declaration level, the following parameters are supported.

Parameter Description

carrot.algorithm The algorithm class.

carrot.resourcesDir Algorithm-specific resources and configuration files (stop words, other

lexical resources, default settings). By default points to conf/clusterin

g/carrot2/

carrot.outputSubClusters If and the algorithm supports hierarchical clustering, sub-clusters willtrue

also be emitted.

carrot.numDescriptions Maximum number of per-cluster labels to return (if the algorithm assigns

more than one label to a cluster).

The parameter should contain a fully qualified class name of an algorithm supported bycarrot.algorithm

the framework. Currently, the following algorithms are available:Carrot2

org.carrot2.clustering.lingo.LingoClusteringAlgorithm (open source)

org.carrot2.clustering.stc.STCClusteringAlgorithm (open source)

org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm (open source)

com.carrotsearch.lingo3g.Lingo3GClusteringAlgorithm (commercial)

For a comparison of characteristics of these algorithms see the following links:

http://doc.carrot2.org/#section.advanced-topics.fine-tuning.choosing-algorithm

http://project.carrot2.org/algorithms.html

http://carrotsearch.com/lingo3g-comparison.html

The question of which algorithm to choose depends on the amount of traffic (STC is faster than Lingo, but

arguably produces less intuitive clusters, Lingo3G is the fastest algorithm but is not free or open source),

expected result (Lingo3G provides hierarchical clusters, Lingo and STC provide flat clusters), and the input data

(each algorithm will cluster the input slightly differently). There is no one answer which algorithm is "the best".

Contextual and Full Field Clustering

The clustering engine can apply clustering to the full content of (stored) fields or it can run an internal highlighter

pass to extract context-snippets before clustering. Highlighting is recommended when the logical snippet field

contains a lot of content (this would affect clustering performance). Highlighting can also increase the quality of

clustering because the content passed to the algorithm will be more focused around the query (it will be

query-specific context). The following parameters control the internal highlighter.

Parameter Description

355Apache Solr Reference Guide 5.1

carrot.produceSummary When the clustering component will run a highlighter pass on thetrue

content of logical fields pointed to by and .carrot.title carrot.snippet

Otherwise full content of those fields will be clustered.

carrot.fragSize The size, in characters, of the snippets (aka fragments) created by the

highlighter. If not specified, the default highlighting fragsize ( )hl.fragsize

will be used.

carrot.summarySnippets The number of summary snippets to generate for clustering. If not specified,

the default highlighting snippet count ( ) will be used.hl.snippets

Logical to Document Field Mapping

As already mentioned in , the clustering component clusters "documents" consisting ofPreliminary Concepts

logical parts that need to be mapped onto physical schema of data stored in Solr. The field mapping attributes

provide a connection between fields and logical document parts. Note that the content of title and snippet fields

must be so that it can be retrieved at search time.stored

Parameter Description

carrot.title The field (alternatively comma- or space-separated list of fields) that should be mapped

to the logical document's title. The clustering algorithms typically give more weight to the

content of the title field compared to the content (snippet). For best results, the field

should contain concise, noise-free content. If there is no clear title in your data, you can

leave this parameter blank.

carrot.snippet The field (alternatively comma- or space-separated list of fields) that should be mapped

to the logical document's main content. If this mapping points to very large content fields

the performance of clustering may drop significantly. An alternative then is to use

query-context snippets for clustering instead of full field content. See the description of

the parameter for details.carrot.produceSummary

carrot.url The field that should be mapped to the logical document's content URL. Leave blank if

not required.

Clustering Multilingual Content

The field mapping specification can include a parameter, which defines the field that stores carrot.lang ISO

code of the language in which the title and content of the document are written. This information can be639-1

stored in the index based on apriori knowledge of the documents' source or a language detection filter applied at

indexing time. All algorithms inside the Carrot2 framework will accept ISO codes of languages defined in Langua

.geCode enum

The language hint makes it easier for clustering algorithms to separate documents from different languages on

input and to pick the right language resources for clustering. If you do have multi-lingual query results (or query

results in a language different than English), it is strongly advised to map the language field appropriately.

Parameter Description

carrot.lang The field that stores ISO 639-1 code of the language of the document's text fields.

356Apache Solr Reference Guide 5.1

carrot.lcmap A mapping of arbitrary strings into ISO 639 two-letter codes used by . Thecarrot.lang

syntax of this parameter is the same as , for example: langid.map.lcmap langid.map.

lcmap=japanese:ja polish:pl english:en

The default language can also be set using Carrot2-specific algorithm attributes (in this case the MultilingualClust

attribute).ering.defaultLanguage

Tweaking Algorithm Settings

The algorithms that come with Solr are using their default settings which may be inadequate for all data sets. All

algorithms have lexical resources and resources (stop words, stemmers, parameters) that may require tweaking

to get better clusters (and cluster labels). For Carrot2-based algorithms it is probably best to refer to a dedicated

tuning application called Carrot2 Workbench (screenshot below). From this application one can export a set of

algorithm attributes as an XML file, which can be then placed under the location pointed to by carrot.resourc

.esDir

Providing Defaults

The default attributes for all engines (algorithms) declared in the clustering component are placed under carrot

and with an expected file name of . So for an engine named .resourcesDir engineName-attributes.xml

and the default value of , the attributes would be read from a file in lingo carrot.resourcesDir conf/clus

.tering/carrot2/lingo-attributes.xml

An example XML file changing the default language of documents to Polish is shown below.

357Apache Solr Reference Guide 5.1

<attribute-sets default="attributes">

<attribute-set id="attributes">

<value-set>

<label>attributes</label>

</attribute>

</value-set>

</attribute-set>

</attribute-sets>

Tweaking at Query-Time

The clustering component and Carrot2 clustering algorithms can accept query-time attribute overrides. Note that

certain things (for example lexical resources) can only be initialized once (at startup, via the XML configuration

files).

An example query that changes the parameterLingoClusteringAlgorithm.desiredClusterCountBase

for the Lingo algorithm: http://localhost:8983/solr/techproducts/clustering?q=*:*&rows=100

&LingoClusteringAlgorithm.desiredClusterCountBase=20

Performance Considerations

Dynamic clustering of search results comes with two major performance penalties:

Increased cost of fetching a larger-than-usual number of search results (50, 100 or more documents),

Additional computational cost of the clustering itself.

For simple queries, the clustering time will usually dominate the fetch time. If the document content is very long

the retrieval of stored content can become a bottleneck. The performance impact of clustering can be lowered in

several ways:

feed less content to the clustering algorithm by enabling attribute,carrot.produceSummary

perform clustering on selected fields (titles only) to make the input smaller,

use a faster algorithm (STC instead of Lingo, Lingo3G instead of STC),

tune the performance attributes related directly to a specific algorithm.

Some of these techniques are described in document, availableApache SOLR and Carrot2 integration strategies

at . The topic of improving performance is also included in thehttp://carrot2.github.io/solr-integration-strategies

Carrot2 manual at .http://doc.carrot2.org/#section.advanced-topics.fine-tuning.performance

Additional Resources

The following resources provide additional information about the clustering component in Solr and its potential

applications.

Apache Solr and Carrot2 integration strategies: http://carrot2.github.io/solr-integration-strategies

Apache Solr Wiki (covers previous Solr versions, may be inaccurate): http://carrot2.github.io/solr-integratio

n-strategies

Clustering and Visualization of Solr search results (video from Berlin BuzzWords conference, 2011): http://

vimeo.com/26616444

358Apache Solr Reference Guide 5.1

Spatial Search

Solr supports location data for use in spatial/geospatial searches. Using spatial search, you can:

Index points or other shapes

Filter search results by a bounding box or circle or by other shapes

Sort or boost scoring by distance between points, or relative area between rectangles

Generate a 2D grid of facet count numbers for heatmap generation or point-plotting.

There are three main field types available for spatial search:

LatLonType (and its non-geodetic twin PointType)

SpatialRecursivePrefixTreeFieldType (RPT for short)

BBoxField

RPT offers more features than LatLonType and fast filter performance, although LatLonType is more appropriate

when efficient distance sorting/boosting is desired. They can both be used simultaneously for what each does

best – LatLonType for sorting/boosting, RPT for filtering.

BBoxField is for indexing bounding boxes, querying by a box, specifying a search predicate

(Intersects,Within,Contains,Disjoint,Equals), and a relevancy sort/boost like overlapRatio or simply the area.

For more information on Solr spatial search, see .http://wiki.apache.org/solr/SpatialSearch

Indexing and Configuration

For indexing geodetic points (latitude and longitude), supply the pair of numbers as a string with a comma

separating them in latitude then longitude order. For non-geodetic points, the order is x,y for PointType, and for

RPT you must use a space instead of a comma, or use WKT.

See the section below for RPT configuration specifics.SpatialRecursivePrefixTreeFieldType

Spatial Filters

There are 2 types of Spatial filters, which both support the following parameters:

Parameter Description

d the radial distance, usually in kilometers. (RPT & BBoxField can set other units via the setting di

)stanceUnits

pt the center point using the format "lat,lon" if latitude & longitude. Otherwise, "x,y" for PointType or

"x y" for RPT field types.

sfield a spatial indexed field

359Apache Solr Reference Guide 5.1

score (Advanced option; RPT and BBoxField field types only) If the query is used in a scoring context

(e.g. as the main query in ), this determines what scores will be produced. Validqlocal-param

values are:

none - A fixed score of 1.0. (the default)

kilometers - distance in kilometers between the field value and the specified center point

miles - distance in miles between the field value and the specified center point

degrees - distance in degrees between the field value and the specified center point

distance - distance between the field value and the specified center point in the distance

configured for this fieldUnits

recipDistance - 1 / the distance

When used with ,additional options are supported:BBoxField

overlapRatio - The relative overlap between the indexed shape & query shape.

area - haversine based area of the overlapping shapes expressed in terms of the distanc

configured for this fieldeUnits

area2D - cartesian coordinates based area of the overlapping shapes expressed in terms of

the configured for this fielddistanceUnits

filter (Advanced option; RPT and BBoxField field types only) If you only want the query to score (with

the above local-param), not filter, then set this to false.score local-param

geofilt

The filter allows you to retrieve results based on the geospatial distance (AKA the "great circlegeofilt

distance") from a given point. Another way of looking at it is that it creates a circular shape filter. For example, to

find all documents within five kilometers of a given lat/lon point, you could enter &q=*:*&fq={!geofilt

. This filter returns all results within a circle of the given radiussfield=store}&pt=45.15,-93.85&d=5

around the initial point:

bbox

The filter is very similar to except it uses the of the calculated circle. See the bluebbox geofilt bounding box

box in the diagram below. It takes the same parameters as geofilt. Here's a sample query: &q=*:*&fq={!bbox

. The rectangular shape is faster to compute and so it's sometimessfield=store}&pt=45.15,-93.85&d=5

used as an alternative to geofilt when it's acceptable to return points outside of the radius. However, if the ideal

goal is a circle but you want it to run faster, then instead consider using the RPT field and try a large "distErrPct"

value like (10% radius). This will return results outside the radius but it will do so somewhat uniformly around0.1

the shape.

360Apache Solr Reference Guide 5.1

Filtering by an arbitrary rectangle

Sometimes the spatial search requirement calls for finding everything in a rectangular area, such as the area

covered by a map the user is looking at. For this case, geofilt and bbox won't cut it. This is somewhat of a trick,

but you can use Solr's range query syntax for this by supplying the lower-left corner as the start of the range and

the upper-right corner as the end of the range. Here's an example: &q=*:*&fq=store:[45,-94 TO 46,-9

. LatLonType does support rectangles that cross the dateline, but RPT does. If you are using RPT with3] not

non-geospatial coordinates ( ) then you must quote the points due to the space, e.g. .geo="false" "x y"

Optimization: Solr Post Filtering

Most likely, the fastest spatial filters will be to simply use the RPT field type. However, sometimes it may be

faster to use LatLonType with in circumstances when both the spatial query isn't worth cachingSolr post filtering

and there aren't many matching documents that match the non-spatial filters (e.g. keyword queries and other

filters). To use with LatLonType, use the or query parsers in a filter query butSolr post filtering bbox geofilt

specify and (or greater) as local-params. Here's a short example:cache=false cost=100

&q=...mykeywords...&fq=...someotherfilters...&fq={!geofilt cache=false

cost=100}&sfield=store&pt=45.15,-93.85&d=5

Distance Function Queries

There are four distance function queries: , see below, usually the most appropriate; , to calculategeodist dist

the p-norm distance between multi-dimensional vectors; , to calculate the distance between two points on ahsin

sphere; and , to calculate the squared Euclidean distance between two points. For more informationsqedist

about these function queries, see the section on .Function Queries

geodist

geodist is a distance function that takes three optional parameters: . You(sfield,latitude,longitude)

can use the function to sort results by distance or score return results.geodist

For example, to sort your results by ascending distance, enter ...&q=*:*&fq={!geofilt}&sfield=store&

.pt=45.15,-93.85&d=50&sort=geodist() asc

To return the distance as the document score, enter ...&q={!func}geodist()&sfield=store&pt=45.15

.,-93.85&sort=score+asc

When a bounding box includes a pole, the bounding box ends up being a "bounding bowl" (a spherical

) that includes all values north of the lowest latitude of the circle if it touches the north pole (or southcap

of the highest latitude if it touches the south pole).

361Apache Solr Reference Guide 5.1

More Examples

Here are a few more useful examples of what you can do with spatial search in Solr.

Use as a Sub-Query to Expand Search Results

Here we will query for results in Jacksonville, Florida, or within 50 kilometers of 45.15,-93.85 (near Buffalo,

Minnesota):

&q=*:*&fq=(state:"FL" AND city:"Jacksonville") OR

{!geofilt}&sfield=store&pt=45.15,-93.85&d=50&sort=geodist()+asc

Facet by Distance

To facet by distance, you can use the Frange query parser:

&q=*:*&sfield=store&pt=45.15,-93.85&facet.query={!frange l=0

u=5}geodist()&facet.query={!frange l=5.001 u=3000}geodist()

There are other ways to do it too, like using a {!geofilt} in each facet.query.

Boost Nearest Results

Using the or , you can combine spatial search with the boost function to boost theDisMax Extended DisMax

nearest results:

&q.alt=*:*&fq={!geofilt}&sfield=store&pt=45.15,-93.85&d=50&bf=recip(geodist(),2,20

0,20)&sort=score desc

SpatialRecursivePrefixTreeFieldType (abbreviated as RPT)

This field type offers several functional improvements over LatLonType:

Query by polygons and other complex shapes, in addition to circles & rectangles

Multi-valued indexed fields

Ability to index non-point shapes (e.g. polygons) as well as point shapes

Rectangles with user-specified corners that can cross the dateline

Multi-value distance sort and score boosting (warning: non-optimized)

Well-Known-Text (WKT) shape syntax (required for specifying polygons & other complex shapes)

RPT incorporates the basic features of LatLonType and PointType, such as lat-lon bounding boxes and circles,

in addition to supporting geofilt, bbox, geodist, and a range-queries.

Schema configuration

To use RPT, the field type must be registered and configured in . There are many options for thisschema.xml

field type.

Setting Description

name The name of the field type.

362Apache Solr Reference Guide 5.1

class This should be . But be awaresolr.SpatialRecursivePrefixTreeFieldType

that the Lucene spatial module includes some other so-called "spatial strategies"

other than RPT, notably TermQueryPT*, BBox, PointVector*, and SerializedDV. Solr

requires a field type to parallel these in order to use them. The asterisked ones have

them.

spatialContextFactory If polygons or linestrings are required, then is a needed toJTS Topology Suite

implement them. It's a JAR file that you need to put on Solr's classpath (but not via

the standard solrconfig.xml mechanisms). If you intend to use those shapes, set this

attribute to com.spatial4j.core.context.jts.JtsSpatialContextFactory

. Furthermore, the context factory has its own options which are directly configurable

on the Solr field type here; follow the link to the Javadocs, and remember to look at

the superclass's options in as well. One option in particular youSpatialContextFactory

should most likely enable is (i.e. use PreparedGeometry) as it's beenautoIndex

shown to be a major performance boost for polygons. Further details about specifying

polygons to index or query are at Solr's Wiki linked below.

geo If , the default, latitude and longitude coordinates will be used and thetrue

mathematical model will generally be a sphere. If false, the coordinates will be generic

X & Y on a 2D plane using Euclidean/Cartesian geometry.

distanceUnits This is used to specify the units for distance measurements used throughout the use

of this field. This can be , or . It is applied to nearly alldegrees kilometers miles

distance measurements involving the field: , , , andmaxDistErr distErr d geodist

the when score is , , or . However, it doesn't affectscore distance area area2d

distances embedded in WKT strings, (eg: " "),BUFFER(POINT(200 10),0.2))

which are still in degrees.

distanceUnits defaults to either " " if is " ", or " " if kilometers geo true degress g

is " ".eo false

distanceUnits replaces the attribute; which is now deprecated andunits

mutually exclusive with this attribute.

distErrPct Defines the default precision of non-point shapes (both index & query), as a fraction

between 0.0 (fully precise) to 0.5. The closer this number is to zero, the more

accurate the shape will be. However, more precise indexed shapes use more disk

space and take longer to index. Bigger distErrPct values will make queries faster but

less accurate.

maxDistErr Defines the highest level of detail required for indexed data. If left blank, the default is

one meter – just a bit less than 0.000009 degrees. This setting is used internally to

compute an appropriate maxLevels (see below).

worldBounds Defines the valid numerical ranges for x and y, in the format of ENVELOPE(minX,

. If , the standard lat-lon world boundaries aremaxX, maxY, minY) geo="true"

assumed. If , you should define your boundaries.geo=false

distCalculator Defines the distance calculation algorithm. If , "haversine" is the default. If geo=true

, "cartesian" will be the default. Other possible values are "lawOfCosines",geo=false

"vincentySphere" and "cartesian^2".

363Apache Solr Reference Guide 5.1

prefixTree Defines the spatial grid implementation. Since a PrefixTree (such as

RecursivePrefixTree) maps the world as a grid, each grid cell is decomposed to

another set of grid cells at the next level. If then the default prefix tree isgeo=true

"geohash", otherwise it's "quad". Geohash has 32 children at each level, quad has 4.

Geohash can only be used for as it's strictly geospatial.geo=true

maxLevels Sets the maximum grid depth for indexed data. Instead, it's usually more intuitive to

compute an appropriate maxLevels by specifying .maxDistErr

<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType"

spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory"

autoIndex="true"

distErrPct="0.025"

maxDistErr="0.001"

distanceUnits="kilometers" />

Once the field type has been defined, use it to define a field that uses it.

Because RPT has more advanced features, some of which are new and experimental, please review the Solr

Wiki at for more information about using this fieldhttp://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4

type.

Heatmap Faceting

The RPT field supports generating a 2D grid of facet counts for documents having spatial data in each grid cell.

For high-detail grids, this can be used to plot points, and for lesser detail it can be used for heatmap generation.

The grid cells are determined at index-time based on RPT's configuration. At facet counting time, the indexed

cells in the region of interest are traversed and a grid of counters corresponding to each cell are incremented.

Solr can return the data in a straight-forward 2D array of integers or in a PNG which compresses better for

larger data sets but must be decoded.

The heatmap feature is accessed from Solr's faceting feature. As a part of faceting, it supports the local-parkey

am as well as excluding tagged filter queries, just like other types of faceting do. This allows multiple heatmaps

to be returned on the same field with different filters.

Parameter Description

facet Set to to enable facetingtrue

facet.heatmap The field name of type RPT

facet.heatmap.geom The region to compute the heatmap on, specified using the rectangle-range syntax

or WKT. It defaults to the world. ex: ["-180 -90" TO "180 90"]

facet.heatmap.gridLevel A specific grid level, which determines how big each grid cell is. Defaults to being

computed via distErrPct (or distErr)

facet.heatmap.distErrPct A fraction of the size of geom used to compute gridLevel. Defaults to 0.15. It's

computed the same as a similarly named parameter for RPT.

facet.heatmap.distErr A cell error distance used to pick the grid level indirectly. It's computed the same

as a similarly named parameter for RPT.

364Apache Solr Reference Guide 5.1

facet.heatmap.format The format, either (default) or .ints2D png

Here's some sample output in JSON (with some ..... inserted for brevity):

{gridLevel=6,columns=64,rows=64,minX=-180.0,maxX=180.0,minY=-90.0,maxY=90.0,

counts_ints2D=[[0, 0, 2, 1, ....],[1, 1, 3, 2, ...],...]}

The output shows the gridLevel which is interesting since it's often computed from other parameters. If an

interface being developed allows an explicit resolution increase/decrease feature then subsequent requests can

specify the gridLevel explicitly.

The , , , reports the region where the counts are. This is the minimally enclosing boundingminX maxX minY maxY

rectangle of the input at the target grid level. This may wrap the dateline. The and valuesgeom columns rows

are how many columns and rows that the output rectangle is to be divided by evenly. Note: Don't divide an

on-screen projected map rectangle evenly to plot these rectangles/points since the cell data is in the coordinate

space of decimal degrees if geo=true or whatever units were given if geo=false. This could be arranged to be

the same as an on-screen map but won't necessarily be.

The key has a 2D array of integers. The initial outer level is in row order (top-down), then thecounts_ints2D

inner arrays are the columns (left-right). If any array would be all zeros, a null is returned instead for efficiency

reasons. The entire value is null if there is no matching spatial data.

If format=png then the output key is counts_png. It's a base-64 encoded string of a 4-byte PNG. The PNG

logically holds the same data that the ints2D does. Note that the alpha channel byte is flipped to make it easier

to view the PNG for diagnostic purposes, since otherwise counts would have to exceed 2^24 before it becomes

non-opague. Thus counts greater than this value will become opaque.

BBoxField

The field type indexes a single rectangle (bounding box) per document field and supports searchingBBoxField

via a bounding box. It supports most spatial search predicates, it has enhanced relevancy modes based on the

overlap or area between the search rectangle and the indexed rectangle. It's particularly useful for its relevancy

modes. To configure it in the schema, use a configuration like this:

<fieldType name="bbox" class="solr.BBoxField"

geo="true" units="kilometers" numberType="_bbox_coord"

storeSubFields="false"/>

<fieldType name="_bbox_coord" class="solr.TrieDoubleField" precisionStep="8"

docValues="true" stored="false"/>

BBoxField is actually based off of 4 instances of another field type referred to by numberType. It also uses a

boolean to flag a dateline cross. Assuming you want to use the relevancy feature, docValues is required. Some

of the attributes are in common with the RPT field like geo, units, worldBounds, and spatialContextFactory

Tip

For heatmap purposes, you'll experiment with different distErrPct values (probably 0.10 - 0.20) with

various input geometries till the default size is what you're looking for. The specific details of how it's

computed isn't important. For high-detail grids used in point-plotting (loosely one cell per pixel), set

distErr to be the number of decimal-degrees of several pixels or so of the map being displayed.

365Apache Solr Reference Guide 5.1

because they share some of the same spatial infrastructure.

To index a box, add a field value to a bbox field that's a string in the WKT/CQL ENVELOPE syntax. Example: EN

VELOPE(-10, 20, 15, 10) which is minX, maxX, maxY, minY order. The parameter ordering is unintuitive

but that's what the spec calls for.

To search, you can use the query parser, or the range syntax e.g. , or the{!bbox} [10,-10 TO 15,20]

ENVELOPE syntax wrapped in parenthesis with a leading search predicate. The latter is the only way to choose

a predicate other than Intersects. For example:

&q={!field f=bbox}Contains(ENVELOPE(-10, 20, 15, 10))

Now to sort the results by one of the relevancy modes, use it like this:

&q={!field f=bbox score=overlapRatio}Intersects(ENVELOPE(-10, 20, 15, 10))

The local-param can be one of , , and . area scores by the document areascore overlapRatio area area2D

using surface-of-a-sphere (assuming geo=true) math, area2D uses simple width * height. overlapRatio

computes a [0-1] ranged score based on how much overlap exists relative to the document's area and the query

area. The javadocs of BBoxOverlapRatioValueSource have more info on the formula, if you're really curious.

There is an additional parameter queryTargetProportion that allows you to weight the query side of the formula

You can also use to see useful score computationto the index (target) side of the formula. &debug=results

info.

The Terms Component

The Terms Component provides access to the indexed terms in a field and the number of documents that match

each term. This can be useful for building an auto-suggest feature or any other feature that operates at the term

level instead of the search or document level. Retrieving terms in index order is very fast since the

implementation directly uses Lucene's TermEnum to iterate over the term dictionary.

In a sense, this search component provides fast field-faceting over the whole index, not restricted by the base

query or any filters. The document frequencies returned are the number of documents that match the term,

including any documents that have been marked for deletion but not yet removed from the index.

Configuring the Terms Component

By default, the Terms Component is already configured in for each collection.solrconfig.xml

Defining the Terms Component

Defining the Terms search component is straightforward: simply give it a name and use the class solr.TermsC

.omponent

This makes the component available for use, but by itself will not be useable until included with a request

handler.

Using the Terms Component in a Request Handler

366Apache Solr Reference Guide 5.1

The request handler is also defined in by default./terms solrConfig.xml

<bool name="distrib">false</bool>

</lst>

<str>terms</str>

</arr>

</requestHandler>

Note that the defaults for the this request handler set the parameter "terms" to true, which allows terms to be

returned on request. The parameter "distrib" is set to false, which allows this handler to be used only on a single

Solr core. To finish out the configuration, he Terms Component is included as an available component to this

request handler.

You could add this component to another handler if you wanted to, and pass "terms=true" in the HTTP request in

order to get terms back. If it is only defined in a separate handler, you must use that handler when querying in

order to get terms and not regular documents as results.

Terms Component Parameters

The parameters below allow you to control what terms are returned. You can also any of these to the request

handler if you'd like to set them permanently. Or, you can add them to the query request. These parameters are:

Parameter Required Default Description

terms No false If set to true, enables the Terms Component. By default, the Terms

Component is off.

Example: terms=true

terms.fl Yes null Specifies the field from which to retrieve terms.

Example: terms.fl=title

terms.limit No 10 Specifies the maximum number of terms to return. The default is 10.

If the limit is set to a number less than 0, then no maximum limit is

enforced. Although this is not required, either this parameter or term

must be defined.s.upper

Example: terms.limit=20

terms.lower No empty

string

Specifies the term at which to start. If not specified, the empty string

is used, causing Solr to start at the beginning of the field.

Example: terms.lower=orange

terms.lower.incl No true If set to true, includes the lower-bound term (specified with terms.l

in the result set.ower

Example: terms.lower.incl=false

367Apache Solr Reference Guide 5.1

terms.mincount No null Specifies the minimum document frequency to return in order for a

term to be included in a query response. Results are inclusive of the

mincount (that is, >= mincount).

Example: terms.mincount=5

terms.maxcount No null Specifies the maximum document frequency a term must have in

order to be included in a query response. The default setting is -1,

which sets no upper bound. Results are inclusive of the maxcount

(that is, <= maxcount).

Example: terms.maxcount=25

terms.prefix No null Restricts matches to terms that begin with the specified string.

Example: terms.prefix=inter

terms.raw No false If set to true, returns the raw characters of the indexed term,

regardless of whether it is human-readable. For instance, the

indexed form of numeric numbers is not human-readable.

Example: terms.raw=true

terms.regex No null Restricts matches to terms that match the regular expression.

Example: terms.regex=*pedist

terms.regex.flag No null Defines a Java regex flag to use when evaluating the regular

expression defined with . See terms.regex http://docs.oracle.com/j

for details of each flag.avase/tutorial/essential/regex/pattern.html

Valid options are:

case_insensitive

comments

multiline

literal

dotall

unicode_case

canon_eq

unix_lines

Example: terms.regex.flag=case_insensitive

terms.sort No count Defines how to sort the terms returned. Valid options are ,count

which sorts by the term frequency, with the highest term frequency

first, or , which sorts in index order.index

Example: terms.sort=index

terms.upper No null Specifies the term to stop at. Although this parameter is not required,

either this parameter or must be defined.terms.limit

Example: terms.upper=plum

368Apache Solr Reference Guide 5.1

terms.upper.incl No false If set to true, the upper bound term is included in the result set. The

default is false.

Example: terms.upper.incl=true

The output is a list of the terms and their document frequency values. See below for examples.

Examples

All of the following sample queries work with Solr's " " example.bin/solr -e techproducts

Get Top 10 Terms

This query requests the first ten terms in the name field: http://localhost:8983/solr/techproducts/t

erms?terms.fl=name

Results:

</lst>

</lst>

</response>

Get First 10 Terms Starting with Letter 'a'

This query requests the first ten terms in the name field, in index order (instead of the top 10 results by document

count): http://localhost:8983/solr/techproducts/terms?terms.fl=name&terms.lower=a&ter

ms.sort=index

Results:

369Apache Solr Reference Guide 5.1

</lst>

</lst>

</response>

Using the Terms Component for an Auto-Suggest Feature

If the doesn't suit your needs, you can use the Terms component in Solr to build a similar feature forSuggester

your own search application. Simply submit a query specifying whatever characters the user has typed so far as

a prefix. For example, if the user has typed "at", the search engine's interface would submit the following query:

http://localhost:8983/solr/techproducts/terms?terms.fl=name&terms.prefix=at

Result:

</lst>

</lst>

</response>

You can use the parameter to omit the response header from the query response, like inomitHeader=true

this example, which also returns the response in JSON format: http://localhost:8983/solr/techprodu

cts/terms?terms.fl=name&terms.prefix=at&indent=true&wt=json&omitHeader=true

Result:

370Apache Solr Reference Guide 5.1

{

"terms": {

"name": [

"ata",

"ati",

]

}

Distributed Search Support

The TermsComponent also supports distributed indexes. For the request handler, you must provide the/terms

following two parameters:

Parameter Description

shards Specifies the shards in your distributed indexing configuration. For more information about

distributed indexing, see .Distributed Search with Index Sharding

shards.qt Specifies the request handler Solr uses for requests to shards.

More Resources

TermsComponent wiki page

TermsComponent javadoc

The Term Vector Component

The TermVectorComponent is a search component designed to return additional information about documents

matching your search.

For each document in the response, the TermVectorCcomponent can return the term vector, the term frequency,

inverse document frequency, position, and offset information.

Configuration

The TermVectorComponent is not enabled implicitly in Solr - it must be explicitly configured in your solrconfig

file. The examples on this page show how it is configured in Solr's " " example:.xml techproducts

bin/solr -e techproducts

To enable the this component, you need to configure it using a element:searchComponent

<searchComponent name="tvComponent"

class="org.apache.solr.handler.component.TermVectorComponent"/>

A request handler must then be configured to use this component name. In this example, the component is

associated with a special request handler named , that enables term vectors by default using the /tvrh tv=tru

parameter; but you can associate it with any request handler:e

371Apache Solr Reference Guide 5.1

<requestHandler name="/tvrh"

class="org.apache.solr.handler.component.SearchHandler">

</lst>

<str>tvComponent</str>

</arr>

</requestHandler>

Once your handler is defined, you may use it to fetch term vectors for any fields configured with the termVecto

attribute in your , for example:r schema.xml

<field name="includes"

type="text"

indexed="true"

stored="true"

multiValued="true"

termVectors="true"

termPositions="true"

termOffsets="true" />

Invoking the Term Vector Component

The example below shows an invocation of this component using the above configuration:

http://localhost:8983/solr/techproducts/tvrh?q=*%3A*&start=0&rows=10&fl=id,includes

372Apache Solr Reference Guide 5.1

...

</arr>

</lst>

</lst>

</lst>

</lst>

</lst>

Request Parameters

The example below shows the available request parameters for this component:

http://localhost:8983/solr/techproducts/tvrh?q=includes:[* TO

*]&rows=10&indent=true&tv=true&tv.tf=true&tv.df=true&tv.positions=true&tv.offsets=

true&tv.payloads=true&tv.fl=includes

Boolean

Parameters

Description Type

tv Should the component run or not boolean

373Apache Solr Reference Guide 5.1

tv.docIds Returns term vectors for the specified list of Lucene document IDs (not the

Solr Unique Key).

comma

seperated

integers

tv.fl Returns term vectors for the specified list of fields. If not specified, the parafl

meter is used.

comma

seperated list

of field names

tv.all A shortcut that invokes all the boolean parameters listed below. boolean

tv.df Returns the Document Frequency (DF) of the term in the collection. This can

be computationally expensive.

boolean

tv.offsets Returns offset information for each term in the document. boolean

tv.positions Returns position information. boolean

tv.payloads Returns payload information. boolean

tv.tf Returns document term frequency info per term in the document. boolean

tv.tf_idf Calculates TF*IDF for each term. Requires the parameters and tv.tf tv.df

to be "true". This can be computationally expensive. (The results are not

shown in example output)

boolean

To learn more about TermVector component output, see the Wiki page: http://wiki.apache.org/solr/TermVectorC

omponentExampleOptions

For schema requirements, see the Wiki page: http://wiki.apache.org/solr/FieldOptionsByUseCase

SolrJ and the Term Vector Component

Neither the SolrQuery class nor the QueryResponse class offer specific method calls to set Term Vector

Component parameters or get the "termVectors" output. However, there is a patch for it: .SOLR-949

The Stats Component

The Stats component returns simple statistics for numeric, string, and date fields within the document set.

The sample queries in this section assume you are running the " " example included with Solr:techproducts

bin/solr -e techproducts

Stats Component Parameters

The Stats Component accepts the following parameters:

Parameter Description

stats If , then invokes the Stats component.true

374Apache Solr Reference Guide 5.1

stats.field Specifies a field for which statistics should be generated. This parameter may be invoked

multiple times in a query in order to request statistics on multiple fields.

Local Parameters may be used to indicate which subset of the supported statistics should

be computed, and/or that statistics should be computed over the results of an arbitrary

numeric function (or query) instead of a simple field name. See the examples below.

stats.facet Returns sub-results for values within the specified facet.

This legacy parameter is not recommended for new users - instead please consider combin

ing with stats.field facet.pivot

stats.calcdistinct If , distinct values will be calculated and returned as "countDistinct" andtrue

"distinctValues" in the response. This calculation can be very expensive for fields that do

not have a tiny cardinality, so it is by default.false

This parameter can be specified using per-filed override (ie: f.<field>.stats.calcdis

) but users are encouraged to instead - As atinct=true specify it as a Local Parameter

top level request parameter, this option is deprecated.

Example

The query below demonstrates computing stats against two different fields numeric fields, as well as stats over

the results of a a 'termfreq()' function call using the 'text' field:

http://localhost:8983/solr/techproducts/select?q=*:*&stats=true&stats.field={!func

}termfreq('text','memory')&stats.field=price&stats.field=popularity&rows=0&indent=

true

375Apache Solr Reference Guide 5.1

</lst>

</lst>

</lst>

Statistics Supported

The table below explains the statistics supported by the Stats component. Not all statistics are supported for all

field types, and not all statistics are computed by default (See below for details)Local Parameters

Local Param Sample

Input

Output Name Description Supported

Types

Computed

by Default

min true min The minimum value of the

field/function in all documents in

the set.

All Yes

max true max The maximum value of the

field/function in all documents in

the set.

All Yes

376Apache Solr Reference Guide 5.1

sum true sum The sum of all values of the

field/function in all documents in

the set.

Numeric &

Date

Yes

count true count The number of values found in all

documents in the set for this

field/function.

All Yes

missing true missing The number of documents in the

set which do not have a value for

this field/function.

All Yes

sumOfSquares true sumOfSquares Sum of all values squared (a by

product of computing stddev)

Numeric &

Date

Yes

mean true mean The average (v1 + v2 .... +

vN)/N

Numeric &

Date

Yes

stddev true stddev Standard deviation, measuring

how widely spread the values in

the data set are.

Numeric &

Date

Yes

percentiles "1,99,99.9" percentiles A list of percentile values based

on cut-off points specified by the

param value. These values are

an approximation, using the t-dige

.st algorithm

Numeric No

calcdistinct true distinctValues The set of all distinct values for

the field/function in all of the

documents in the set. This

calculation can be very expensive

for fields that do not have a tiny

cardinality.

All No

calcdistinct true countDistinct The number of distinct values in

the field/function in all of the

documents in the set. This

calculation can be very expensive

for fields that do not have a tiny

cardinality.

All No

Local Parameters

Similar to the , the parameter supports local parameters for:Facet Component stats.field

Tagging & Excluding Filters: stats.field={!ex=filterA}price

Changing the Output Key: stats.field={!key=my_price_stats}price

Tagging stats for : use with facet.pivot stats.field={!tag=my_pivot_stats}price

Local parameters can also be used to specify individual statistics by name, overriding the set of statistics

computed by default, eg: stats.field={min=true max=true percentiles='99,99.9,99.99'}price

377Apache Solr Reference Guide 5.1

Additional "Expert" local params are supported in some cases for affecting the behavior of some statistics:

tdigestCompression - a positive numeric value defaulting to controlling the compression factor100.0

of the T-Digest. Larger values means more accuracy, but also uses more memory.

Examples

Here we compute some statistics for the price field. The min, max, mean, 90th, and 99th percentile price values

are computed against all products that are in stock ( and ), and independently all ofq=*:* fq=inStock:true

the default statistics are computed against all products regardless of whether they are in stock or not (by

excluding that filter).

http://localhost:8983/solr/techproducts/select?q=*:*&fq={!tag=stock_check}inStock:

true&stats=true&stats.field={!ex=stock_check+key=instock_prices+min=true+max=true+

mean=true+percentiles='90,99'}price&stats.field={!key=all_prices}price&rows=0&inde

nt=true

</lst>

</lst>

The Stats Component and Faceting

Although the parameter is no longer recommended, sets of parameters can bestats.facet stats.field

referenced by ' ' when using Pivot Faceting to compute multiple statistics at every level (i.e.: field) in the treetag

of pivot constraints.

For more information and a detailed example, please see .Combining Stats Component With Pivots

If any supported statistics are specified via local parameters, then the entire set of default statistics is

overridden and only the requested statistics are computed.

378Apache Solr Reference Guide 5.1

The Query Elevation Component

The lets you configure the top results for a given query regardless of the normalQuery Elevation Component

Lucene scoring. This is sometimes called "sponsored search," "editorial boosting," or "best bets." This

component matches the user query text to a configured map of top results. The text can be any string or

non-string IDs, as long as it's indexed. Although this component will work with any QueryParser, it makes the

most sense to use with or .DisMax eDisMax

The is supported by distributed searching.Query Elevation Component

All of the sample configuration and queries used in this section assume you are running Solr's " "techproducts

example:

bin/solr -e techproducts

Configuring the Query Elevation Component

You can configure the Query Elevation Component in the file.solrconfig.xml

<str name="queryFieldType">string</str>

<str name="config-file">elevate.xml</str>

</searchComponent>

<str name="echoParams">explicit</str>

</lst>

<str>elevator</str>

</arr>

</requestHandler>

Optionally, in the Query Elevation Component configuration you can also specify the following to distinguish

editorial results from "normal" results:

The Query Elevation Search Component takes the following arguments:

Argument Description

queryFieldType Specifies which fieldType should be used to analyze the incoming text. For example, it

may be appropriate to use a fieldType with a LowerCaseFilter.

config-file Path to the file that defines query elevation. This file must exist in <instanceDir>/con

or . f/<config-file> <dataDir>/<config-file>

If the file exists in the /conf/ directory it will be loaded once at startup. If it exists in the

data directory, it will be reloaded for each IndexReader.

379Apache Solr Reference Guide 5.1

forceElevation By default, this component respects the requested parameter: if the request askssort

to sort by date, it will order the results by date. If (the default),forceElevation=true

results will first return the boosted docs, then order by date.

elevate.xml

Elevated query results are configured in an external XML file specified in the argument. An config-file elev

file might look like this:ate.xml

</query>

</query>

</elevate>

In this example, the query "foo bar" would first return documents 1, 2 and 3, then whatever normally appears for

the same query. For the query "ipod", it would first return "MA147LL/A", and would make sure that "IW-02" is not

in the result set.

Using the Query Elevation Component

The ParameterenableElevation

For debugging it may be useful to see results with and without the elevated docs. To hide results, use enableEl

:evation=false

http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&debugQuery=true&ena

bleElevation=true

http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&debugQuery=true&ena

bleElevation=false

The ParameterforceElevation

You can force elevation during runtime by adding to the query URL:forceElevation=true

http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&debugQuery=true&ena

bleElevation=true&forceElevation=true

The Parameterexclusive

You can force Solr to return only the results specified in the elevation file by adding to theexclusive=true

URL:

http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&debugQuery=true&exc

lusive=true

380Apache Solr Reference Guide 5.1

Document Transformers and the ParametermarkExcludes

The can be used to annotate each document with information about[elevated] Document Transformer

whether or not it was elevated:

http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&fl=id,[elevated]

Likewise, it can be helpful when troubleshooting to see all matching documents – including documents that the

elevation configuration would normally exclude. This is possible by using the parameter,markExcludes=true

and then using the transformer:[excluded]

http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&markExcludes=true&f

l=id,[elevated],[excluded]

The and ParameterselevateIds excludeIds

When the elevation component is in use, the pre-configured list of elevations for a query can be overridden at

request time to use the unique keys specified in these request parameters.

For example, in the request below documents 3007WFP and 9885A004 will be elevated, and document IW-02

will be excluded -- regardless of what elevations or exclusions are configured for the query "cable" in elevate.xml:

http://localhost:8983/solr/techproducts/elevate?q=cable&df=text&excludeIds=IW-02&e

levateIds=3007WFP,9885A004

If either one of these parameters is specified at request time, the the entire elevation configuration for the query

is ignored.

For example, in the request below documents IW-02 and F8V7067-APL-KIT will be elevated, and no documents

will be excluded – regardless of what elevations or exclusions are configured for the query "ipod" in elevate.xml:

http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&elevateIds=IW-02,F8

V7067-APL-KIT

The Parameterfq

Query elevation respects the standard filter query ( ) parameter. That is, if the query contains the parameterfq fq

, all results will be within that filter even if adds other documents to the result set.elevate.xml

Response Writers

A Response Writer generates the formatted response of a search. Solr supports a variety of Response Writers to

ensure that query responses can be parsed by the appropriate language or application.

The parameter selects the Response Writer to be used. The table below lists the most common settings forwt

the parameter.wt

wt Parameter Setting Response Writer Selected

csv CSVResponseWriter

json JSONResponseWriter

php PHPResponseWriter

381Apache Solr Reference Guide 5.1

phps PHPSerializedResponseWriter

python PythonResponseWriter

ruby RubyResponseWriter

velocity VelocityResponseWriter

xml XMLResponseWriter

xslt XSLTResponseWriter

The Standard XML Response Writer

The XML Response Writer is the most general purpose and reusable Response Writer currently included with

Solr. It is the format used in most discussions and documentation about the response of Solr queries.

Note that the XSLT Response Writer can be used to convert the XML produced by this writer to other

vocabularies or text-based formats.

The behavior of the XML Response Writer can be driven by the following query parameters.

The Parameterversion

The parameter determines the XML protocol used in the response. Clients are strongly encouraged to version

specify the protocol version, so as to ensure that the format of the response they receive does notalways

change unexpectedly if the Solr server is upgraded and a new default format is introduced.

Currently supported version values are:

XML

Version

Notes

2.2 The format of the responseHeader changed to use the same structure as the rest of the<lst>

response.

The default value is the latest supported.

The Parameterstylesheet

The parameter can be used to direct Solr to include a stylesheet <?xml-stylesheet type="text/xsl"

declaration in the XML response it returns.href="..."?>

The default behavior is not to return any stylesheet declaration at all.

The Parameterindent

If the parameter is used, and has a non-blank value, then Solr will make some attempts at indenting itsindent

XML response to make it more readable by humans.

Use of the parameter is discouraged, as there is currently no way to specify externalstylesheet

stylesheets, and no stylesheets are provided in the Solr distributions. This is a legacy parameter, which

may be developed further in a future release.

382Apache Solr Reference Guide 5.1

The default behavior is not to indent.

The XSLT Response Writer

The XSLT Response Writer applies an XML stylesheet to output. It can be used for tasks such as formatting

results for an RSS feed.

tr Parameter

The XSLT Response Writer accepts one parameter: the parameter, which identifies the XML transformationtr

to use. The transformation must be found in the Solr directory.conf/xslt

The Content-Type of the response is set according to the statement in the XSLT transform, for<xsl:output>

example: <xsl:output media-type="text/html"/>

Configuration

The example below, from the in the Solr distribution , shows howsample_techproducts_configs config set

the XSLT Response Writer is configured.

<!--

Changes to XSLT transforms are taken into account

every xsltCacheLifetimeSeconds at most.

-->

<queryResponseWriter name="xslt"

class="org.apache.solr.request.XSLTResponseWriter">

</queryResponseWriter>

A value of 5 for is good for development, to see XSLT changes quickly. ForxsltCacheLifetimeSeconds

production you probably want a much higher value.

JSON Response Writer

A very commonly used Response Writer is the , which formats output in JavaScriptJsonResponseWriter

Object Notation (JSON), a lightweight data interchange format specified in specified in RFC 4627. Setting the wt

parameter to invokes this Response Writer.json

The default mime type for the JSON writer is , however this can be overridden in the application/json solr

- such as in this example from the " " configuration:config.xml techproducts

<!-- For the purposes of the tutorial, JSON response are written as

plain text so that it's easy to read in *any* browser.

If you are building applications that consume JSON, just remove

this override to get the default "application/json" mime type.

-->

<str name="content-type">text/plain</str>

</queryResponseWriter>

Python Response Writer

Solr has an optional Python response format that extends its JSON output in the following ways to allow the

383Apache Solr Reference Guide 5.1

response to be safely evaluated by the python interpreter:

true and false changed to True and False

Python unicode strings are used where needed

ASCII output (with unicode escapes) is used for less error-prone interoperability

newlines are escaped

null changed to None

PHP Response Writer and PHP Serialized Response Writer

Solr has a PHP response format that outputs an array (as PHP code) which can be evaluated. Setting the parwt

ameter to invokes the PHP Response Writer.php

Example usage:

$code =

file_get_contents('http://localhost:8983/solr/techproducts/select?q=iPod&wt=php');

eval("$result = " . $code . ";");

print_r($result);

Solr also includes a PHP Serialized Response Writer that formats output in a serialized array. Setting the parwt

ameter to invokes the PHP Serialized Response Writer.phps

Example usage:

$serializedResult =

file_get_contents('http://localhost:8983/solr/techproducts/select?q=iPod&wt=phps');

$result = unserialize($serializedResult);

print_r($result);

Ruby Response Writer

Solr has an optional Ruby response format that extends its JSON output in the following ways to allow the

response to be safely evaluated by Ruby's interpreter:

Ruby's single quoted strings are used to prevent possible string exploits.

\ and ' are the only two characters escaped.

Unicode escapes are not used. Data is written as raw UTF-8.

nil used for null.

=> is used as the key/value separator in maps.

Here is a simple example of how one may query Solr using the Ruby response format:

require 'net/http'

h = Net::HTTP.new('localhost', 8983)

hresp, data = h.get('/solr/techproducts/select?q=iPod&wt=ruby', nil)

rsp = eval(data)

puts 'number of matches = ' + rsp['response']['numFound'].to_s

#print out the name field for each returned document

rsp['response']['docs'].each { |doc| puts 'name field = ' + doc['name'\] }

CSV Response Writer

384Apache Solr Reference Guide 5.1

The CSV response writer returns a list of documents in comma-separated values (CSV) format. Other

information that would normally be included in a response, such as facet information, is excluded.

The CSV response writer supports multi-valued fields, as well as , and the output of this CSV psuedo-fields

format is compatible with Solr's .CSV update format

CSV Parameters

These parameters specify the CSV format that will be returned. You can accept the default values or specify your

own.

Parameter Default Value

csv.encapsulator "

csv.escape None

csv.separator ,

csv.header Defaults to true. If false, Solr does not print the column headers

csv.newline \n

csv.null Defaults to a zero length string. Use this parameter when a document has no value for a

particular field.

Multi-Valued Field CSV Parameters

These parameters specify how multi-valued fields are encoded. Per-field overrides for these values can be done

using .f.<fieldname>.csv.separator=|

Parameter Default Value

csv.mv.encapsulator None

csv.mv.escape \

csv.mv.separator Defaults to the valuecsv.separator

Example

http://localhost:8983/solr/techproducts/select?q=ipod&fl=id,cat,name,popularity,pr

ice,score&wt=csv returns:

id,cat,name,popularity,price,score

IW-02,"electronics,connector",iPod & iPod Mini USB 2.0 Cable,1,11.5,0.98867977

F8V7067-APL-KIT,"electronics,connector",Belkin Mobile Power Cord for iPod w/

Dock,1,19.95,0.6523595

MA147LL/A,"electronics,music",Apple 60 GB iPod with Video Playback

Black,10,399.0,0.2446348

Velocity Response Writer

The VelocityResponseWriter is an optional plugin available in the directory. It is used tocontrib/velocity

385Apache Solr Reference Guide 5.1

power the in the example configuration.Velocity Search UI

Its JAR and dependencies must be added (via <lib> or solr/home lib inclusion), and must be registered in solrc

like this:onfig.xml

<str name="template.base.dir">${velocity.template.base.dir:}</str>

<!--

<str name="init.properties.file">velocity-init.properties</str>

<bool name="solr.resource.loader.enabled">false</bool>

-->

</queryResponseWriter>

The above example shows the optional initialization parameters used by VelocityResponseWriter and detailed in

the following table. These initialization parameters are only specified in the writer registration in solrconfig.xml,

not as request-time parameters. See further below for request-time parameters.

VelocityResponseWriter initialization parameters

Parameter Description Default

value

template.base.dir If specified and exists as a file system directory, a file resource

loader will be added for this directory. Templates in this directory

will override "solr" resource loader templates.

init.properties.file Specifies a properties file name which must exist in the Solr conf/

directory ( under a velocity/ subdirectory) or root of a JAR filenot

in a <lib>.

params.resource.loader.enabled The "params" resource loader allows templates to be specified in

Solr request parameters. For example: http://localhost:8

983/solr/gettingstarted/select?q=*:*&wt=velocity

&v.template=custom&v.template.custom=CUSTOM%3A%

where v.template=custom says to render a20%23core_name

template called "custom" and v.template.custom's value is the

actual custom template. This is disabled by default; it'd be a

niche, unusual use case to need this enabled.

false

solr.resource.loader.enabled The "solr" resource loader is the only template loader registered

by default. Templates are served from resources visible to the

SolrResourceLoader under a subdirectory. Thevelocity/

VelocityResponseWriter itself has some built-in templates (in its

JAR file, under velocity/) that are available automatically through

this loader. These built-in templates can be overridden when the

same template name is in conf/velocity/ or by using the templat

option.e.base.dir

true

386Apache Solr Reference Guide 5.1

VelocityResponseWriter request parameters

Parameter Description Default value

v.template Specifies the name of the template to render.

v.layout Specifies a template name to use as the

layout around the main, ,v.template

specified template.

The main template is rendered into a string

value included into the layout rendering as $c

.ontent

v.layout.enabled Determines if the main template should have

a layout wrapped around it. True by default,

but

requires to specified as well.v.layout

true

v.contentType Specifies the content type used in the HTTP

response. If not specified, the default will

depend on

whether is specified or not. v.json

without json.wrf:

text/html;charset=UTF-8

with json.wrf: application/json;c

harset=UTF-8

v.json Specifies a function name to wrap around the

response rendered as JSON. If specified, the

content type used in the response will be

"application/json;charset=UTF-8", unless

overridden

by . v.contentType

Output will be in this format (with v.json=wrf):

wrf("result":"<Velocity

generated response string,

with quotes and backslashes

escaped>")

387Apache Solr Reference Guide 5.1
v.locale Locale to use with the   tool.  The$resource
default locale is Locale.ROOT.  Localized
resources are
loaded from standard Java resource bundles
named resources[_locale-code].prop
.erties
Resource bundles can be added by providing
a JAR file visible by the SolrResourceLoader
with
resource bundles under a velocity
sub-directory.  Resource bundles are not
loadable under conf/,
as only the class loader aspect of
SolrResourceLoader can be used here.
 
v.template.<template_name> When the "params" resource loader is
enabled, templates can be specified as part
of the Solr request. 
 
 
VelocityResponseWriter context objects
Context
reference
Description
request SolrQueryRequest javadocs
response QueryResponse most of the time, but in some cases where   doesn't like theQueryResponse
request handlers output ( , AnalysisRequestHandler
for example, causes a   parsing "response") the response will be a ClassCastException SolrRespo
 object.nseBase
esc A Velocity   instanceEscapeTool
date A Velocity   instanceComparisonDateTool
list A Velocity   instanceListTool
math A Velocity   instanceMathTool
number A Velocity   instanceNumberTool
sort A Velocity   instanceSortTool
display A Velocity   instanceDisplayTool
resource A Velocity   instanceResourceTool
engine The current VelocityEngine instance
page An instance of Solr's PageTool (only included if the response is a QueryResponse where paging
makes sense)

388Apache Solr Reference Guide 5.1

debug A shortcut to the debug part of the response, or null if debug is not on. This is handy for having

debug-only sections in a template using #if($debug)...#end

content The rendered output of the main template, when rendering the layout (v.layout.enabled=true and

v.layout=<template>).

For more information about the Velocity Response Writer, see https://wiki.apache.org/solr/VelocityResponseWrit

.er

Binary Response Writer

Solr also includes a Response Writer that outputs binary format for use with a Java client. See forClient APIs

more details.

Near Real Time Searching

Near Real Time (NRT) search means that documents are available for search almost immediately after being

indexed: additions and updates to documents are seen in 'near' real time. Solr does not block updates while a

commit is in progress. Nor does it wait for background merges to complete before opening a new search of

indexes and returning.

With NRT, you can modify a command to be a , which avoids parts of a standard commitcommit soft commit

that can be costly. You will still want to do standard commits to ensure that documents are in stable storage, but

let you see a very near real time view of the index in the meantime. However, pay specialsoft commits

attention to cache and autowarm settings as they can have a significant impact on NRT performance.

Commits and Optimizing

A commit operation makes index changes visible to new search requests. A uses the transactionhard commit

log to get the id of the latest document changes, and also calls on the index files to ensure they havefsync

been flushed to stable storage and no data loss will result from a power failure.

A is much faster since it only makes index changes visible and does not index files or writesoft commit fsync

a new index descriptor. If the JVM crashes or there is a loss of power, changes that occurred after the last hard

will be lost. Search collections that have NRT requirements (that want index changes to be quicklycommit

visible to searches) will want to soft commit often but hard commit less frequently. A softCommit may be "less

expensive" in terms of time, but not free, since it can slow throughput.

An is like a except that it forces all of the index segments to be merged into a singleoptimize hard commit

segment first. Depending on the use, this operation should be performed infrequently (e.g., nightly), if at all, since

it involves reading and re-writing the entire index. Segments are normally merged over time anyway (as

determined by the merge policy), and optimize just forces these merges to occur immediately.

Soft commit takes uses two parameters: and .maxDocs maxTime

Parameter Description

maxDocs Integer. Defines the number of documents to queue before pushing them to the index. It works in

conjunction with the parameter in that ifupdate_handler_autosoftcommit_max_time

either limit is reached, the documents will be pushed to the index.

389Apache Solr Reference Guide 5.1

maxTime The number of milliseconds to wait before pushing documents to the index. It works in

conjunction with the parameter in that ifupdate_handler_autosoftcommit_max_docs

either limit is reached, the documents will be pushed to the index.

Use and judiciously to fine-tune your commit strategies.maxDocs maxTime

AutoCommits

An autocommit also uses the parameters and . However it's useful in many strategies to usemaxDocs maxTime

both a hard and to achieve more flexible commits.autocommit autosoftcommit

A common configuration is to do a hard every 1-10 minutes and a everyautocommit autosoftcommit

second. With this configuration, new documents will show up within about a second of being added, and if the

power goes out, soft commits are lost unless a hard commit has been done.

For example:

</autoSoftCommit>

It's better to use rather than to modify an , especially when indexing amaxTime maxDocs autoSoftCommit

large number of documents through the commit operation. It's also better to turn off for bulkautoSoftCommit

indexing.

Optional Attributes for and commit optimize

Parameter Valid

Attributes

Description

waitSearcher true, false Block until a new searcher is opened and registered as the main query

searcher, making the changes visible. Default is true.

softCommit true, false Perform a soft commit. This will refresh the view of the index faster, but

without guarantees that the document is stably stored. Default is false.

expungeDeletes true, false Valid for only. This parameter purges deleted data from segments.commit

The default is false.

maxSegments integer Valid for only. Optimize down to at most this number ofoptimize

segments. The default is 1.

Example of and with optional attributes:commit optimize

Passing and parameters as part of the URLcommit commitWithin

Update handlers can also get -related parameters as part of the update URL. This example adds a smallcommit

test document and causes an explicit commit to happen immediately afterwards:

390Apache Solr Reference Guide 5.1

http://localhost:8983/solr/my_collection/update?stream.body=<add><doc>

<field name="id">testdoc</field></doc></add>&commit=true

Alternately, you may want to use this:

http://localhost:8983/solr/my_collection/update?stream.body=<optimize/>

This example causes the index to be optimized down to at most 10 segments, but won't wait around until it's

done ( ):waitFlush=false

curl

'http://localhost:8983/solr/my_collection/update?optimize=true&maxSegments=10&waitFl

ush=false'

This example adds a small test document with a instruction that tells Solr to make sure thecommitWithin

document is committed no later than 10 seconds later (this method is generally preferred over explicit commits):

curl http://localhost:8983/solr/my_collection/update?commitWithin=10000

-H "Content-Type: text/xml" --data-binary

'<add><doc><field name="id">testdoc</field></doc></add>'

Changing default BehaviorcommitWithin

The settings allow forcing document commits to happen in a defined time period. This is usedcommitWithin

most frequently with , and for that reason the default is to perform a soft commit. ThisNear Real Time Searching

does not, however, replicate new documents to slave servers in a master/slave environment. If that's a

requirement for your implementation, you can force a hard commit by adding a parameter, as in this example:

<softCommit>false</softCommit>

</commitWithin>

With this configuration, when you call as part of your update message, it will automaticallycommitWithin

perform a hard commit every time.

RealTime Get

For index updates to be visible (searchable), some kind of commit must reopen a searcher to a new point-in-time

view of the index. The feature allows retrieval (by ) of the latest version of anyrealtime get unique-key

documents without the associated cost of reopening a searcher. This is primarily useful when using Solr as a

NoSQL data store and not just a search index.

Real Time Get relies on the update log feature, which is enabled by default and can be configured in solrconf

:ig.xml

</updateLog>

391Apache Solr Reference Guide 5.1

Real Time Get requests can be performed using the handler which exists implicitly in Solr - it's equivalent/get

to the following configuration:

</lst>

</requestHandler>

For example, if you started Solr using the example command, you could thenbin/solr -e techproducts

index a new document (with out committing it) like so:

curl 'http://localhost:8983/solr/techproducts/update/json?commitWithin=10000000'

-H 'Content-type:application/json' -d '[{"id":"mydoc","name":"realtime-get

test!"}]'

If you do a normal search, this document should not be found yet:

http://localhost:8983/solr/techproducts/select?q=id:mydoc

...

"response":

{"numFound":0,"start":0,"docs":[]}

However if you use the Real Time Get handler exposed at , you can still retrieve that document:/get

http://localhost:8983/solr/techproducts/get?id=mydoc

...

{"doc":{"id":"mydoc","name":"realtime-get test!", "_version_":1487137811571146752}}

You can also specify multiple documents at once via the parameter and a comma separated list of ids, or byids

using multiple parameters. If you specify multiple ids, or use the parameter, the response will mimic aid ids

normal query response to make it easier for existing clients to parse.

For example:

http://localhost:8983/solr/techproducts/get?ids=mydoc,IW-02

http://localhost:8983/solr/techproducts/get?id=mydoc&id=IW-02

...

{"response":

{"numFound":2,"start":0,"docs":

[ { "id":"mydoc",

"name":"realtime-get test!",

"_version_":1487137811571146752},

{

"id":"IW-02",

"name":"iPod & iPod Mini USB 2.0 Cable",

...

]

}

392Apache Solr Reference Guide 5.1

Exporting Result Sets

It's possible to export fully sorted result sets using a special and specificallyrank query parser response writer

designed to work together to handle scenarios that involve sorting and exporting millions of records. This uses a

stream sorting technique that begins to send records within milliseconds and continues to stream results until the

entire result set has been sorted and exported.

The cases where this functionality may be useful include: session analysis, distributed merge joins, time series

roll-ups, aggregations on high cardinality fields, fully distributed field collapsing, and sort based stats.

Field Requirements

All the fields being sorted and exported must have docValues set to true. For more information, see the section

on .DocValues

Defining the /export Request Handler

To export the full sorted result set you'll want to use a request handler explicitly configured to only run the "query"

component, using the the export "rq" and "wt" params.

An " " request handler with the appropriate configuration is included in the example /export techproducts so

. If however, you would like to add it to an existing , you can add a section likelrconfig.xml solrconfig.xml

this:

<str name="rq">{!xport}</str>

<str name="wt">xsort</str>

<str name="distrib">false</str>

</lst>

<str>query</str>

</arr>

</requestHandler>

Note that this request handler's properties are defined as "invariants", which means they cannot be overridden by

other properties passed at another time (such as at query time).

Requesting Results Export

Once the request handler is defined, you can use it to make requests to export the result set of a/export

query.

All queries must include and parameters, or the query will return an error. Filter queries are alsosort fl

supported. Results are always returned in JSON format.

Do disable the realtime get handler at if you are using SolrCloud otherwise any leaderNOT /get

election will cause a full sync in replicas for the shard in question. Similarly, a replica recovery willALL

also always fetch the complete index from the leader because a partial sync will not be possible in the

absence of this handler.

393Apache Solr Reference Guide 5.1

Here is an example of what an export request of some indexed log data might look like:

http://localhost:8983/solr/core_name/export?q=my-query&sort=severity+desc,timestamp+

desc&fl=severity,timestamp,msg

Specifying the Sort Criteria

The property defines how documents will be sorted in the exported result set. Results can be sorted by anysort

field that has a field type of int,long, float, double, string. The sort fields must be single valued fields.

Up to four sort fields can be specified per request, with the 'asc' or 'desc' properties.

Specifying the Field List

The property defines the fields that will be exported with the result set. Any of the field types that can befl

sorted (i.e., int, long, float, double, string) can be used in the field list. The fields can be single or multi-valued.

However, returning scores and wildcards are not supported at this time.

Distributed Support

The initial release treats all queries as non-distributed requests. So the client is responsible for making the calls

to each Solr instance and merging the results.

Using SolrJ’s CloudSolrClient as a model, developers could build clients that automatically send requests to all

the shards in a collection (or multiple collections) and then merge the sorted sets any way they wish.

394Apache Solr Reference Guide 5.1
The Well-Configured Solr Instance
This section tells you how to fine-tune your Solr instance for optimum performance. This section covers the
following topics:
Configuring solrconfig.xml: Describes how to work with the main configuration file for Solr,  ,solrconfig.xml
covering the major sections of the file.
Solr Cores and solr.xml: Describes how to work with   and   to configure your Solrsolr.xml core.properties
core, or multiple Solr cores within a single instance.
Solr Plugins: Introduces Solr plugins with pointers to more information.
JVM Settings: Gives some guidance on best practices for working with Java Virtual Machines.
Configuring solrconfig.xml
The   file is the configuration file with the most parameters affecting Solr itself. Whilesolrconfig.xml
configuring Solr, you'll work with   often. The file comprises a series of XML statements that setsolrconfig.xml
configuration values for an individual collection.
In  , you configure important features such as:solrconfig.xml
request handlers, which process the requests to Solr, such as requests to add documents to the index or
requests to return results for a query
listeners, processes that "listen" for particular query-related events; listeners can be used to trigger the
execution of special code, such as invoking some common queries to warm-up caches
the Request Dispatcher for managing HTTP communications
the Admin Web interface
parameters related to replication and duplication (these parameters are covered in detail in Legacy
)Scaling and Distribution
The   file is located in the   directory for each collection. Several well-commentedsolrconfig.xml conf/
example files can be found in the   directories demonstrating best practices forserver/solr/configsets/
many different types of installations.
We've covered the options in the following sections:
DataDir and DirectoryFactory in SolrConfig
Lib Directives in SolrConfig
Managed Schema Definition in SolrConfig
IndexConfig in SolrConfig
RequestHandlers and SearchComponents in SolrConfig
InitParams in SolrConfig
UpdateHandlers in SolrConfig
The focus of this section is generally on configuring a single Solr instance, but for those interested in
scaling a Solr implementation in a cluster environment, see also the section  . There are alsoSolrCloud
options to scale through sharding or replication, described in the section  .Legacy Scaling and Distribution

395Apache Solr Reference Guide 5.1

Query Settings in SolrConfig

RequestDispatcher in SolrConfig

Substituting Properties in Solr Config Files

Solr supports variable substitution of property values in config files, which allows runtime specification of various

configuration options in . The syntax is }.solrconfig.xml ${propertyname[:option default value]

This allows defining a default that can be overridden when Solr is launched. If a default value is not specified,

then the property be specified at runtime or the configuration file will generate an error when parsed.must

There are multiple methods for specifying properties that can be used in configuration files.

JVM System Properties

Any JVM System properties, usually specified using the flag when starting the JVM, can be used as variables-D

in any XML configuration file in Solr.

For example, in the sample files, you will see this value which defines the locking type to use:solrconfig.xml

<lockType>${solr.lock.type:native}</lockType>

Which means the lock type defaults to "native" but when starting Solr, you could override this using a JVM

system property by launching the Solr it with:

bin/solr start -Dsolr.lock.type=none

In general, any Java system property that you want to set can be passed through the script using thebin/solr

standard syntax. Alternatively, you can add common system properties to the -Dproperty=value SOLR_OPTS

environment variable defined in the Solr include file ( ). For more information about how thebin/solr.in.sh

Solr include file works, refer to: .Taking Solr to Production

solrcore.properties

If the configuration directory for a Solr core contains a file named that file can containsolrcore.properties

any arbitrary user defined property names and values using the Java standard , and thoseproperties file format

properties can be used as variables in the XML configuration files for that Solr core.

For example, the following file could be created in the directory of a collectionsolrcore.properties conf/

using one of the example configurations, to override the lockType used.

#conf/solrcore.properties

solr.lock.type=none

User defined properties from core.properties

If you are using , then any user defined properties in that file may bea core.properties file with solr.xml

specified there and those properties will be available for substitution when parsing XML configuration files for that

The path and name of the file can be overridden using the solrcore.properties propproperties

.erty in core.properties

396Apache Solr Reference Guide 5.1

Solr core.

For example, consider the following file:core.properties

#core.properties

name=collection2

my.custom.prop=edismax

The property can then be used as a variable, such as in :my.custom.prop solrconfig.xml

<str name="defType">${my.custom.prop}</str>

</lst>

</requestHandler>

Implicit Core Properties

Several attributes of a Solr core are available as "implicit" properties that can be used in variable substitution,

independent of where or how they underlying value is initialized. For example: regardless of whether the name

for a particular Solr core is explicitly configured in or inferred from the name of the instancecore.properties

directory, the implicit property is available for use as a variable in that core's configurationsolr.core.name

file...

</lst>

</requestHandler>

All implicit properties use the name prefix, and reflect the runtime value of the equivalent solr.core. core.pr

: propertyoperties

solr.core.name

solr.core.config

solr.core.schema

solr.core.dataDir

solr.core.transient

solr.core.loadOnStartup

DataDir and DirectoryFactory in SolrConfig

Specifying a Location for Index Data with the ParameterdataDir

By default, Solr stores its index data in a directory called under the Solr home. If you would like to specify/data

a different directory for storing index data, use the parameter in the file. You can<dataDir> solrconfig.xml

specify another directory either with a full pathname or a pathname relative to the instance dir of the SolrCore.

For example:

397Apache Solr Reference Guide 5.1

If you are using replication to replicate the Solr index (as described in ), then the Legacy Scaling and Distribution

directory should correspond to the index directory used in the replication configuration.<dataDir>

Specifying the DirectoryFactory For Your Index

The default is filesystem based, and tries to pick the best implementationsolr.StandardDirectoryFactory

for the current JVM and platform. You can force a particular implementation by specifying solr.MMapDirector

, , or .yFactory solr.NIOFSDirectoryFactory solr.SimpleFSDirectoryFactory

<directoryFactory name="DirectoryFactory"

class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>

The is memory based, not persistent, and does not work with replication. Usesolr.RAMDirectoryFactory

this DirectoryFactory to store your index in RAM.

Lib Directives in SolrConfig

Solr allows loading plugins by defining directives in .<lib/> solrconfig.xml

The plugins are loaded in the order they appear in . If there are dependencies, list the lowestsolrconfig.xml

level dependency jar first.

Regular expressions can be used to provide control loading jars with dependencies on other jars in the same

directory. All directories are resolved as relative to the Solr .instanceDir

Managed Schema Definition in SolrConfig

The enables modifications through a REST interface. (Read-only access to all schemaSchema API schema

elements is also supported.)

There are challenges with allowing programmatic access to a configuration file that is also open to manual edits:

system-generated and manual edits may overlap and the system-generated edits may remove comments or

other customizations that are critical for the organization to understand why fields, field types, etc., are defined

the way they are. You may want to version the file with source control, or limit manual edits altogether.

398Apache Solr Reference Guide 5.1

The option in controls whether the Schema should be defined as aschemaFactory solrconfig.xml

"managed index schema": schema modification is only possible through the .Schema API

By default, if no is specified, then the default behavior is to use the schemaFactory "ClassicIndexSchemaF

as in this example from the :actory" sample_techproducts_configs config set

The requires the use of a file, which can be edited manually andClassicIndexSchemaFactory schema.xml

is only loaded only when the collection is loaded. This setting disallows Schema API methods that modify the

schema.

In the config set however, we see a usage of the data_driven_schema_configs ManagedIndexSchemaFa

:ctory

<!-- To disable dynamic schema REST APIs, use the following for <schemaFactory>:

When ManagedIndexSchemaFactory is specified instead, Solr will load the

schema from

he resource named in 'managedSchemaResourceName', rather than from

schema.xml.

Note that the managed schema resource CANNOT be named schema.xml. If the

managed

schema does not exist, Solr will create it after reading schema.xml, then

rename

'schema.xml' to 'schema.xml.bak'.

Do NOT hand edit the managed schema - external modifications will be ignored

and

overwritten as a result of schema modification REST API calls.

When ManagedIndexSchemaFactory is specified with mutable = true, schema

modification REST API calls will be allowed; otherwise, error responses will

sent back for these requests.

-->

<str name="managedSchemaResourceName">managed-schema</str>

</schemaFactory>

Here you can see configuration for the managed schema. In order for schema modifications to be possible via

the , the will need to be used. The parameter must alsoSchema API ManagedIndexSchemaFactory mutable

be set to . The , which defaults to "managed-schema", may also betrue managedSchemaResourceName

defined, and can be anything other than " ".schema.xml

With the above configuration, you can use the to modify the schema as much as you want, andSchema API

then later change the value of to if you wish to "lock" the schema in place and prevent futuremutable false

changes.

If you have an existing Solr collection you wish to convert to use a managed schema, you can simplify modify the

to specify the use of the . Once Solr is restarted and itsolrconfig.xml ManagedIndexSchemaFactory

detects that a file exists, but the file does not exist, the existing schema.xml managedSchemaResourceName

399Apache Solr Reference Guide 5.1

file will be renamed to and the contents are written to a file with the nameschema.xml schema.xml.bak

defined as the . If you look at the resulting file, you'll see this at the top of themanagedSchemaResourceName

page:

You are now free to use the as much as you want to make changes, and remove the Schema API schema.xml

..bak

IndexConfig in SolrConfig

The section of defines low-level behavior of the Lucene index writers. By<indexConfig> solrconfig.xml

default, the settings are commented out in the sample included with Solr, which means thesolrconfig.xml

defaults are used. In most cases, the defaults are fine.

...

</indexConfig>

Parameters covered in this section:

Sizing Index Segments

Merging Index Segments

Index Locks

Other Indexing Settings

Sizing Index Segments

ramBufferSizeMB

Once accumulated document updates exceed this much memory space (defined in megabytes), then the

pending updates are flushed. This can also create new segments or trigger a merge. Using this setting is

generally preferable to . If both and are set in maxBufferedDocs maxBufferedDocs ramBufferSizeMB sol

, then a flush will occur when either limit is reached. The default is 100Mb.rconfig.xml

maxBufferedDocs

Sets the number of document updates to buffer in memory before they are flushed as a new segment. This may

also trigger a merge. The default Solr configuration sets to flush by RAM usage ( ).ramBufferSizeMB

maxIndexingThreads

The maximum number of simultaneous threads used to index documents. Once this threshold is reached,

additional threads will wait for the others to finish. The default is 8.

400Apache Solr Reference Guide 5.1

UseCompoundFile

Setting to combines the various files of a segment into a single file, although the<useCompoundFile> true

default is . On systems where the number of open files allowed per process is limited, setting this to mfalse false

ay avoid hitting that limit (the open files limit might also be tunable for your OS with the Linux/Unix comulimit

mand, or something similar for other operating systems). In some cases, other internal factors may set a

segment to "compound=false", even if this is setting is explicitly set to true, so the compounding of the files in a

segment may not always happen.

Updating a compound index may incur a minor performance hit for various reasons, depending on the runtime

environment. For example, filesystem buffers are typically associated with open file descriptors, which may limit

the total cache space available to each index.

This setting may also affect how much data needs to be transferred during index replication operations.

The default is .false

<useCompoundFile>false</useCompoundFile>

Merging Index Segments

mergePolicy

Defines how merging segments is done. The default in Solr is , which merges segmentsTieredMergePolicy

of approximately equal size, subject to an allowed number of segments per tier. Other policies available are the L

and . For more information on these policies, please see ogByteSizeMergePolicy LogDocMergePolicy the

.MergePolicy javadocs

</mergePolicy>

mergeFactor

For the LogByteSizeMergePolicy, or the (default) TieredMergePolicy, the configuration option<mergeFactor>

can be used as a convenience setting to indicate how many segments a Lucene index is allowed to merge at

one time. It is equivalent to setting either on the LogByteSizeMergePolicy, or<int name="mergeFactor">

setting both and (to the same value)<int name="maxMergeAtOnce"> <int name="segmentsPerTier">

on the TieredMergePolicy.

To understand why a mergeFactor is important, consider what happens when an update is made to an index:

Documents are always added to the most recently opened segment. When a segment fills up, a new segment is

created and subsequent updates are placed there. If creating a new segment would cause the number of

lowest-level segments to exceed the value, then all those segments are merged together to formmergeFactor

401Apache Solr Reference Guide 5.1

a single large segment. Thus, if the merge factor is 10, each merge results in the creation of a single segment

that is roughly ten times larger than each of its ten constituents. When there are 10 of these larger segments,

then they in turn are merged into an even larger single segment. This process can continue indefinitely.

Choosing the best merge factor is generally a trade-off of indexing speed vs. searching speed. Having fewer

segments in the index generally accelerates searches, because there are fewer places to look. It also can also

result in fewer physical files on disk. But to keep the number of segments low, merges will occur more often,

which can add load to the system and slow down updates to the index.

Conversely, keeping more segments can accelerate indexing, because merges happen less often, making an

update is less likely to trigger a merge. But searches become more computationally expensive and will likely be

slower, because search terms must be looked up in more index segments. Faster index updates also means

shorter commit turnaround times, which means more timely search results.

The default behavior of both LogByteSizeMergePolicy and TieredMergePolicy is an effective ofmergeFactor

10:

mergeScheduler

The merge scheduler controls how merges are performed. The default performsConcurrentMergeScheduler

merges in the background using separate threads. The alternative, , does not performSerialMergeScheduler

merges with separate threads.

mergedSegmentWarmer

When using Solr in for a merged segment warmer can be configured to warm theNear Real Time Searching

reader on the newly merged segment, before the merge commits. This is not required for near real-time search,

but will reduce search latency on opening a new near real-time reader after a merge completes.

checkIntegrityAtMerge

If set to , any actions that result in merging segments will first trigger an integrity check using checksumstrue

stored in the index segments (if available). If the checksums are not correct, the merge will fail and throw an

Exception. (defaults to " " for backwards compatibility)false

Index Locks

lockType

The LockFactory options specify the locking implementation to use.

The set of valid lock type options depends on the you have configured. The values listed belowDirectoryFactory

402Apache Solr Reference Guide 5.1

are are supported by (the default):StandardDirectoryFactory

native (default) uses NativeFSLockFactory to specify native OS file locking. If a second Solr process

attempts to access the directory, it will fail. Do not use when multiple Solr web applications are attempting

to share a single index.

simple uses SimpleFSLockFactory to specify a plain file for locking.

single (expert) uses SingleInstanceLockFactory. Use for special situations of a read-only index

directory, or when there is no possibility of more than one process trying to modify the index (even

sequentially). This type will protect against multiple cores within the JVM attempting to access thesame

same index. WARNING! If multiple Solr instances in different JVMs modify an index, this type will protenot

ct against index corruption.

For more information on the nuances of each LockFactory, see http://wiki.apache.org/lucene-java/AvailableLock

.Factories

<lockType>native</lockType>

unlockOnStartup

If , any write or commit locks that have been held will be unlocked on system startup. This defeats thetrue

locking mechanism that allows multiple processes to safely access a Lucene index. The default is , andfalse

changing this should only be done with care. This parameter is not used if the is "none" or "single".lockType

<unlockOnStartup>false</unlockOnStartup>

writeLockTimeout

The maximum time to wait for a write lock on an IndexWriter. The default is 1000, expressed in milliseconds.

Other Indexing Settings

There are a few other parameters that may be important to configure for your implementation. These settings

affect how or when updates are made to an index.

Setting Description

reopenReaders Controls if IndexReaders will be re-opened, instead of closed and then opened, which is

often less efficient. The default is true.

deletionPolicy Controls how commits are retained in case of rollback. The default is SolrDeletionPolic

, which has sub-parameters for the maximum number of commits to keep (y maxCommitsTo

), the maximum number of optimized commits to keep (Keep maxOptimizedCommitsToKe

), and the maximum age of any commit to keep ( ), which supports ep maxCommitAge DateM

syntax.athParser

infoStream The InfoStream setting instructs the underlying Lucene classes to write detailed debug

information from the indexing process as Solr log messages.

403Apache Solr Reference Guide 5.1

</deletionPolicy>

<infoStream>false</infoStream>

RequestHandlers and SearchComponents in SolrConfig

After the section of , request handlers and search components are configured.<query> solrconfig.xml

A processes requests coming to Solr. These might be query requests or index update requests.request handler

You will likely need several of these defined, depending on how you want Solr to handle the various requests you

will make.

A is a feature of search, such as highlighting or faceting. The search component is defined in search component

separate from the request handlers, and then registered with a request handler as needed.solrconfig.xml

These are often referred to as "requestHandler" and "searchComponent", which is how they are defined in solr

.config.xml

Topics covered in this section:

Request Handlers

SearchHandlers

UpdateRequestHandlers

ShardHandlers

Other Request Handlers

Search Components

Default Components

First-Components and Last-Components

Components

Other Useful Components

Request Handlers

Every request handler is defined with a name and a class. The name of the request handler is referenced with

the request to Solr, typically as a path. For example, if Solr is installed at ahttp://localhost:8983/solr/

nd you have a collection named " ", you can make a request using URLs like this:gettingstarted

http://localhost:8983/solr/gettingstarted/select?q=solr

This query will be processed by the request handler with the name " ". We've only used the "q"/select

parameter here, which includes our query term, a simple keyword of "solr". If the request handler has more

parameters defined, those will be used with any query we send to this request handler unless they are

over-ridden by the client (or user) in the query itself.

If you have another request handler defined, you would send your request with that name. For example, /updat

is a request handler that handles index updates (i.e., sending new documents to the index). By default, e /sele

is a request handler that handles query requests.ct

404Apache Solr Reference Guide 5.1

Request handlers can also process requests for nested paths of their names, for example, a request using /myh

may be processed by a request handler registered with the name . If aandler/extrapath /myhandler

request handler is explicitly defined by the name , that would take precedence over/myhandler/extrapath

the nested path. This assumes you are using the request handler classes included with Solr; if you create your

own request handler, you should make sure it includes the ability to handle nested paths if you want to use them

with your custom request handler.

It is also possible to configure defaults for request handlers with a section called . These defaultsinitParams

can be used when you want to have common properties that will be used by each separate handler. For

example, if you intend to create several request handlers that will all request the same list of fields in the

response, you can configure an section with your list of fields. For more information about initParams initPa

, see the section .rams InitParams in SolrConfig

SearchHandlers

The primary request handler defined with Solr by default is the "SearchHandler", which handles search queries.

The request handler is defined, and then a list of defaults for the handler are defined with a list.defaults

For example, in the default , the first request handler defined looks like this:solrconfig.xml

<str name="echoParams">explicit</str>

</lst>

</requestHandler>

This example defines the parameter, which defines how many search results to return, to "10". The rows echoP

parameter defines that the parameters defined in the query should be returned when debug information isarams

returned. Note also that the way the defaults are defined in the list varies if the parameter is a string, an integer,

or another type.

All of the parameters described in the section on can be defined as defaults for any of thesearching

SearchHandlers.

Besides , there are other options for the SearchHandler, which are:defaults

appends: This allows definition of parameters that are added to the user query. These might be filter

, or other query rules that should be added to each query. There is no mechanism in Solr to allowqueries

a client to override these additions, so you should be absolutely sure you always want these parameters

applied to queries.

<str name="fq">inStock:true</str>

</lst>

In this example, the filter query "inStock:true" will always be added to every query.

invariants: This allows definition of parameters that cannot be overridden by a client. The values

defined in an section will always be used regardless of the values specified by the user, byinvariants

the client, in or in .defaults appends

405Apache Solr Reference Guide 5.1

<str name="facet.field">manu_exact</str>

<str name="facet.query">price:[* TO 500]</str>

<str name="facet.query">price:[500 TO *]</str>

</lst>

In this example, facet fields have been defined which limits the facets that will be returned by Solr. If the

client requests facets, the facets defined with a configuration like this are the only facets they will see.

The final section of a request handler definition is , which defines a list of search components thatcomponents

can be used with a request handler. They are only registered with the request handler. How to define a search

component is discussed further on in the section on . The element can only beSearch Components components

used with a request handler that is a SearchHandler.

The file includes many other examples of SearchHandlers that can be used or modified assolrconfig.xml

needed.

UpdateRequestHandlers

The UpdateRequestHandlers are request handlers which process updates to the index.

In this guide, we've covered these handlers in detail in the section .Uploading Data with Index Handlers

ShardHandlers

It is possible to configure a request handler to search across shards of a cluster, used with distributed search.

More information about distributed search and how to configure the shardHandler is in the section Distributed

.Search with Index Sharding

Other Request Handlers

There are other request handlers defined in , covered in other sections of this guide:solrconfig.xml

RealTime Get

Index Replication

Ping

Search Components

Search components define the logic that is used by the SearchHandler to perform queries for users.

Default Components

There are several default search components that work with all SearchHandlers without any additional

configuration. If no components are defined (with the exception of and first-components last-component

- see below), these are executed by default, in the following order:s

Component

Name

Class Name More Information

query solr.QueryComponent Described in the section .Query Syntax and Parsing

406Apache Solr Reference Guide 5.1

facet solr.FacetComponent Described in the section .Faceting

mlt solr.MoreLikeThisComponent Described in the section .MoreLikeThis

highlight solr.HighlightComponent Described in the section .Highlighting

stats solr.StatsComponent Described in the section .The Stats Component

debug solr.DebugComponent Described in the section on Common Query Parameters

expand solr.ExpandComponent Described in the section .Collapse and Expand Results

If you register a new search component with one of these default names, the newly defined component will be

used instead of the default.

First-Components and Last-Components

It's possible to define some components as being used before (with ) or after (with first-components last-c

) the default components listed above.omponents

<str>mycomponent</str>

</arr>

<str>spellcheck</str>

</arr>

Components

If you define , the default components (see above) will not be executed, and components first-components

and are disallowed:last-components

<str>mycomponent</str>

<str>query</str>

<str>debug</str>

</arr>

Other Useful Components

Many of the other useful components are described in sections of this Guide for the features they support. These

are:

SpellCheckComponent, described in the section .Spell Checking

TermVectorComponent, described in the section .The Term Vector Component

QueryElevationComponent, described in the section .The Query Elevation Component

TermsComponent, described in the section .The Terms Component

first-components and/or may only be used in conjunction with the defaultlast-components

components. If you define your own , the default components will not be executed, and components fir

and are disallowed.st-components last-components

407Apache Solr Reference Guide 5.1

InitParams in SolrConfig

An section of allows you to define request handler parameters outside of<initParams> solrconfig.xml

the handler configuration.

The use cases are

Some handlers are implicitly defined in code and there should be a way to add/append/override some of

the implicitly defined properties

There are a few properties that are used across handlers . This helps you keep only a single definition of

those properties and apply them over multiple handlers

For example, if you want several of your search handlers to return the same list of fields, you can create an <ini

section without having to define the same set of parameters in each request handler definition. If youtParams>

have a single request handler that should return different fields, you can define the overriding parameters in

individual sections as usual.<requestHandler>

The properties and configuration of an section mirror the properties and configuration of a<initParams>

request handler. It can include sections for defaults, appends, and invariants, the same as any request handler.

For example, here is one of the sections defined by default in the exam<initParams> data_driven_config

ple:

</lst>

</initParams>

This sets the default search field ("df") to be "_text_" for all of the request handlers named in the path section. If

we later want to change the request handler to search a different field by default, we could override the /query

by defining the parameter in the section for .<initParams> <requestHandler> /query

The syntax and semantics are similar to that of a . The following are the attributes<requestHandler>

property Description

path A comma-separated list of paths which will use the parameters. Wildcards can be used in paths to

define nested paths, as described below.

name The name of this set of parameters. The name can be directly in a requestHandler definition if a

path is not explicitly named. If you give your a name, you can refer to the params<initParams>

in a that is not defined as a path.<requestHandler>

For example, if an section has the name "myParams", you can call the name<initParams>

when defining your request handler:

<requestHandler name="/dump1" class="DumpRequestHandler"

initParams="myParams"/>

408Apache Solr Reference Guide 5.1

Wildcards

An section can support wildcards to define nested paths that should use the parameters<initParams>

defined. A single asterisk (*) denotes that a nested path one level deeper should use the parameters. Double

asterisks (**) denote all nested paths no matter how deep should use the parameters.

For example, if we have an that looks like this:<initParams>

</lst>

</lst>

<str name="df">title</str>

</lst>

</initParams>

We've defined three paths with this section:

/myhandler declared as a direct path.

/root/* with a single asterisk to indicate the parameters should apply to paths that are one level deep.

/root1/** with double asterisks to indicate the parameters should apply to all nested paths, no matter

how deep.

When we define the request handlers, the wildcards will work in the following ways:

The class was named as a path in the so this will use those parameters./myhandler <initParams>

Next we have a request handler named :/root/search5

We defined a wildcard for nested paths that are one level deeper than , so this request handler will use/root

the parameters. This one, however, will not, because is more than one level deep from /root/search5/test

:/root

If we want to define all levels of nested paths, we should use double asterisks, as in the example path /root1/

:**

Any path under , whether explicitly defined in a request handler or not, will use the parameters defined in/root1

the matching section.initParams

409Apache Solr Reference Guide 5.1

UpdateHandlers in SolrConfig

The settings in this section are configured in the element in and may<updateHandler> solrconfig.xml

affect the performance of index updates. These settings affect how updates are done internally. <updateHandl

configurations do not affect the higher level configuration of that process client updateer> RequestHandlers

requests.

...

</updateHandler>

Topics covered in this section:

Commits

commit and softCommit

autoCommit

commitWithin

maxPendingDeletes

Event Listeners

Transaction Log

Commits

Data sent to Solr is not searchable until it has been to the index. The reason for this is that in somecommitted

cases commits can be slow and they should be done in isolation from other possible commit requests to avoid

overwriting data. So, it's preferable to provide control over when data is committed. Several options are available

to control the timing of commits.

commit and softCommit

In Solr, a is an action which asks Solr to "commit" those changes to the Lucne index files. By defaultcommit

commit actions result in a "hard commit" of all the Lucene index files to stable storage (disk). When a client

includes a parameter with an update request, this ensures that all index segments affected by thecommit=true

adds & deletes on an update are written to disk as soon as index updates are completed.

If an additional flag is specified, then Solr performs a 'soft commit', meaning that Solr willsoftCommit=true

commit your changes to the Lucene data structures quickly but not guarantee that the Lucene index files are

written to stable storage. This is an implementation of Near Real Time storage, a feature that boosts document

visibility, since you don't have to wait for background merges and storage (to ZooKeeper, if using ) toSolrCloud

finish before moving on to something else. A full commit means that, if a server crashes, Solr will know exactly

where your data was stored; a soft commit means that the data is stored, but the location information isn't yet

stored. The tradeoff is that a soft commit gives you faster visibility because it's not waiting for background

merges to finish.

For more information about Near Real Time operations, see .Near Real Time Searching

autoCommit

These settings control how often pending updates will be automatically pushed to the index. An alternative to au

is to use , which can be defined when making the update request to Solr (i.e., whentoCommit commitWithin

410Apache Solr Reference Guide 5.1

pushing documents), or in an update RequestHandler.

Setting Description

maxDocs The number of updates that have occurred since the last commit.

maxTime The number of milliseconds since the oldest uncommitted update.

openSearcher Whether to open a new searcher when performing a commit. If this is , the default, thefalse

commit will flush recent index changes to stable storage, but does not cause a new searcher

to be opened to make those changes visible

If either of these or limits are reached, Solr automatically performs a commit operation. If the maxDocs maxTime

tag is missing, then only explicit commits will update the index. The decision whether to useautoCommit

auto-commit or not depends on the needs of your application.

Determining the best auto-commit settings is a tradeoff between performance and accuracy. Settings that cause

frequent updates will improve the accuracy of searches because new content will be searchable more quickly,

but performance may suffer because of the frequent updates. Less frequent updates may improve performance

but it will take longer for updates to show up in queries.

<openSearcher>false</openSearcher>

</autoCommit>

You can also specify 'soft' autoCommits in the same way that you can specify 'soft' commits, except that instead

of using you set the tag.autoCommit autoSoftCommit

</autoSoftCommit>

commitWithin

The settings allow forcing document commits to happen in a defined time period. This is usedcommitWithin

most frequently with , and for that reason the default is to perform a soft commit. ThisNear Real Time Searching

does not, however, replicate new documents to slave servers in a master/slave environment. If that's a

requirement for your implementation, you can force a hard commit by adding a parameter, as in this example:

<softCommit>false</softCommit>

</commitWithin>

With this configuration, when you call as part of your update message, it will automaticallycommitWithin

perform a hard commit every time.

maxPendingDeletes

This value sets a limit on the number of deletions that Solr will buffer during document deletion. This can affect

how much memory is used during indexing.

411Apache Solr Reference Guide 5.1

Event Listeners

The UpdateHandler section is also where update-related event listeners can be configured. These can be

triggered to occur after any commit ( ) or only after optimize commands (event="postCommit" event="postO

).ptimize"

Users can write custom update event listener classes, but a common use case is to run external executables via

the :RunExecutableListener

Setting Description

exe The name of the executable to run. It should include the path to the file, relative to Solr home.

dir The directory to use as the working directory. The default is ".".

wait Forces the calling thread to wait until the executable returns a response. The default is .true

args Any arguments to pass to the program. The default is none.

env Any environment variables to set. The default is none.

Transaction Log

As described in the section , a transaction log is required for that feature. It is configured in the RealTime Get upd

section of .ateHandler solrconfig.xml

Realtime Get currently relies on the update log feature, which is enabled by default. It relies on an update log,

which is configured in , in a section like:solrconfig.xml

</updateLog>

Two additional expert-level configuration settings affect how far a replica can fall behind on updates before it

must enter into full recovery - see the section on for more information: write side fault tolerance

Setting Name Type Default Description

numRecordsToKeep int 100 The number of update records to keep per log

maxNumLogsToKeep int 10 The maximum number of logs keep

An example, to be included under in , employing both of the<config><updateHandler> solrconfig.xml

above settings:

</updateLog>

412Apache Solr Reference Guide 5.1

Query Settings in SolrConfig

The settings in this section affect the way that Solr will process and respond to queries. These settings are all

configured in child elements of the element in .<query> solrconfig.xml

<query>

...

</query>

Topics covered in this section:

Caches

Query Sizing and Warming

Query-Related Listeners

Caches

Solr caches are associated with a specific instance of an Index Searcher, a specific view of an index that doesn't

change during the lifetime of that searcher. As long as that Index Searcher is being used, any items in its cache

will be valid and available for reuse. Caching in Solr differs from caching in many other applications in that

cached Solr objects do not expire after a time interval; instead, they remain valid for the lifetime of the Index

Searcher.

When a new searcher is opened, the current searcher continues servicing requests while the new one

auto-warms its cache. The new searcher uses the current searcher's cache to pre-populate its own. When the

new searcher is ready, it is registered as the current searcher and begins handling all new search requests. The

old searcher will be closed once it has finished servicing all its requests.

In Solr, there are three cache implementations: , asolr.search.LRUCache solr.search.FastLRUCache,

nd .solr.search.LFUCache

The acronym LRU stands for Least Recently Used. When an LRU cache fills up, the entry with the oldest

last-accessed timestamp is evicted to make room for the new entry. The net effect is that entries that are

accessed frequently tend to stay in the cache, while those that are not accessed frequently tend to drop out and

will be re-fetched from the index if needed again.

The , which was introduced in Solr 1.4, is designed to be lock-free, so it is well suited for cachesFastLRUCache

which are hit several times in a request.

Both and use an auto-warm count that supports both integers and percentagesLRUCache FastLRUCache

which get evaluated relative to the current size of the cache when warming happens.

The refers to the Least Frequently Used cache. This works in a way similar to the LRU cache, exceptLFUCache

that when the cache fills up, the entry that has been used the least is evicted.

The Statistics page in the Solr Admin UI will display information about the performance of all the active caches.

This information can help you fine-tune the sizes of the various caches appropriately for your particular

application. When a Searcher terminates, a summary of its cache usage is also written to the log.

Each cache has settings to define it's initial size ( ), maximum size ( ) and number of items toinitialSize size

use for during warming ( ). The LRU and FastLRU cache implementations can take aautowarmCount

percentage instead of an absolute value for .autowarmCount

Details of each cache are described below.

413Apache Solr Reference Guide 5.1

filterCache

This cache is used by for filters (DocSets) for unordered sets of all documents that matchSolrIndexSearcher

a query. The numeric attributes control the number of entries in the cache.

Solr uses the to cache results of queries that use the search parameter. Subsequent queriesfilterCache fq

using the same parameter setting result in cache hits and rapid returns of results. See for a detailedSearching

discussion of the parameter.fq

Solr also makes this cache for faceting when the configuration parameter is set to . For afacet.method fc

discussion of faceting, see .Searching

<filterCache class="solr.LRUCache"

size="512"

initialSize="512"

autowarmCount="128"/>

queryResultCache

This cache holds the results of previous searches: ordered lists of document IDs (DocList) based on a query, a

sort, and the range of documents requested.

<queryResultCache class="solr.LRUCache"

size="512"

initialSize="512"

autowarmCount="128"/>

documentCache

This cache holds Lucene Document objects (the stored fields for each document). Since Lucene internal

document IDs are transient, this cache is not auto-warmed. The size for the should always bedocumentCache

greater than times the , to ensure that Solr does not need to refetchmax_results max_concurrent_queries

a document during a request. The more fields you store in your documents, the higher the memory usage of this

cache will be.

<documentCache class="solr.LRUCache"

size="512"

initialSize="512"

autowarmCount="0"/>

User Defined Caches

You can also define named caches for your own application code to use. You can locate and use your cache

object by name by calling the methods , and SolrIndexSearcher getCache() cacheLookup() cacheInse

.rt()

414Apache Solr Reference Guide 5.1

<cache name="myUserCache" class="solr.LRUCache"

size="4096"

initialSize="1024"

autowarmCount="1024"

regenerator="org.mycompany.mypackage.MyRegenerator" />

If you want auto-warming of your cache, include a attribute with the fully qualified name of a classregenerator

that implements . You can also use the , which simplysolr.search.CacheRegenerator NoOpRegenerator

repopulates the cache with old items. Define it with the parameter asregenerator :

.regenerator="solr.NoOpRegenerator"

Query Sizing and Warming

maxBooleanClauses

This sets the maximum number of clauses allowed in a boolean query. This can affect range or prefix queries

that expand to a query with a large number of boolean terms. If this limit is exceeded, an exception is thrown.

enableLazyFieldLoading

If this parameter is set to true, then fields that are not directly requested will be loaded lazily as needed. This can

boost performance if the most common queries only need a small subset of fields, especially if infrequently

accessed fields are large in size.

useFilterForSortedQuery

This parameter configures Solr to use a filter to satisfy a search. If the requested sort does not include "score",

the will be checked for a filter matching the query. For most situations, this is only useful if thefilterCache

same search is requested often with different sort options and none of them ever use "score".

queryResultWindowSize

Used with the , this will cache a superset of the requested number of document IDs. ForqueryResultCache

example, if the a search in response to a particular query requests documents 10 through 19, and queryWindo

is 50, documents 0 through 49 will be cached.wSize

This option modifies a global property that effects all Solr cores. If multiple filessolrconfig.xml

disagree on this property, the value at any point in time will be based on the last Solr core that was

initialized.

415Apache Solr Reference Guide 5.1

queryResultMaxDocsCached

This parameter sets the maximum number of documents to cache for any entry in the .queryResultCache

useColdSearcher

This setting controls whether search requests for which there is not a currently registered searcher should wait

for a new searcher to warm up (false) or proceed immediately (true). When set to "false", requests will block until

the searcher has warmed its caches.

<useColdSearcher>false</useColdSearcher>

maxWarmingSearchers

This parameter sets the maximum number of searchers that may be warming up in the background at any given

time. Exceeding this limit will raise an error. For read-only slaves, a value of two is reasonable. Masters should

probably be set a little higher.

Query-Related Listeners

As described in the section on , new Index Searchers are cached. It's possible to use the triggers forCaches

listeners to perform query-related tasks. The most common use of this is to define queries to further "warm" the

Index Searchers while they are starting. One benefit of this approach is that field caches are pre-populated for

faster sorting.

Good query selection is key with this type of listener. It's best to choose your most common and/or heaviest

queries and include not just the keywords used, but any other parameters such as sorting or filtering requests.

There are two types of events that can trigger a listener. A event occurs when a new searcherfirstSearcher

is being prepared but there is no current registered searcher to handle requests or to gain auto-warming data

from (i.e., on Solr startup). A event is fired whenever a new searcher is being prepared and therenewSearcher

is a current searcher handling requests.

The (commented out) examples below can be found in the file of the solrconfig.xml sample_techproduct

included with Solr, and demonstrate using the class tos_configs config set solr.QuerySenderListener

warm a set of explicit queries:

416Apache Solr Reference Guide 5.1

<!--

<lst><str name="q">solr</str><str name="sort">price asc</str></lst>

<lst><str name="q">rocks</str><str name="sort">weight asc</str></lst>

-->

</arr>

</listener>

<lst><str name="q">static firstSearcher warming in solrconfig.xml</str></lst>

</arr>

</listener>

RequestDispatcher in SolrConfig

The element of controls the way the Solr HTTP requestDispatcher solrconfig.xml RequestDispatche

implementation responds to requests. Included are parameters for defining if it should handle urlsr /select

(for Solr 1.1 compatibility), if it will support remote streaming, the maximum size of file uploads and how it will

respond to HTTP cache headers in requests.

Topics in this section:

handleSelect Element

requestParsers Element

httpCaching Element

handleSelect Element

The first configurable item is the attribute on the element itself. ThishandleSelect <requestDispatcher>

attribute can be set to one of two values, either "true" or "false". It governs how Solr responds to requests such

as . The default value "false" will ignore requests to { if a requestHandler is not/select?qt=XXX /select

explicitly registered with the name . A value of "true" will route query requests to the parser defined with/select

the value.qt

In recent versions of Solr, a requestHandler is defined by default, so a value of "false" will work fine./select

See the section for more information.RequestHandlers and SearchComponents in SolrConfig

...

</requestDispatcher>

The above code comes from a . A key best practice is to modify thesesample solrconfig.xml

defaults before taking your application to production, but please note: while the sample queries are

commented out in the section for the "newSearcher", the sample quer is not commented out for the

"firstSearcher" event. There is no point in auto-warming your Index Searcher with the query string "static

firstSearcher warming in solrconfig.xml" if that is not relevant to your search application.

handleSelect is for legacy back-compatibility; those new to Solr do not need to change anything about

the way this is configured by default.

417Apache Solr Reference Guide 5.1

requestParsers Element

The sub-element controls values related to parsing requests. This is an empty XML<requestParsers>

element that doesn't have any content, only attributes.

The attribute controls whether remote streaming of content is allowed. If set to enableRemoteStreaming fal

, streaming will not be allowed. Setting it to (the default) lets you specify the location of content to bese true

streamed using or parameters.stream.file stream.url

If you enable remote streaming, be sure that you have authentication enabled. Otherwise, someone could

potentially gain access to your content by accessing arbitrary URLs. It's also a good idea to place Solr behind a

firewall to prevent it being accessed from untrusted clients.

The attribute sets an upper limit in kilobytes on the size of a document that maymultipartUploadLimitInKB

be submitted in a multi-part HTTP POST request. The value specified is multiplied by 1024 to determine the size

in bytes.

The attribute sets a limit in kilobytes on the size of form dataformdataUploadLimitInKB

(application/x-www-form-urlencoded) submitted in a HTTP POST request, which can be used to pass request

parameters that will not fit in a URL.

The attribute can be used to indicate that the original obaddHttpRequestToContext HttpServletRequest

ject should be included in the context map of the using the key . This SolrQueryRequest httpRequest Http

is not used by any Solr component, but may be useful when developing custom plugins.ServletRequest

<requestParsers enableRemoteStreaming="true"

multipartUploadLimitInKB="2048000"

formdataUploadLimitInKB="2048"

addHttpRequestToContext="false" />

httpCaching Element

The element controls HTTP cache control headers. Do not confuse these settings with Solr's<httpCaching>

internal cache configuration. This element controls caching of HTTP responses as defined by the W3C HTTP

specifications.

This element allows for three attributes and one sub-element. The attributes of the element<httpCaching>

control whether a 304 response to a GET request is allowed, and if so, what sort of response it should be. When

an HTTP client application issues a GET, it may optionally specify that a 304 response is acceptable if the

resource has not been modified since the last time it was fetched.

Parameter Description

never304 If present with the value , then a GET request will never respond with a 304 code, even iftrue

the requested resource has not been modified. When this attribute is set to true, the next two

attributes are ignored. Setting this to true is handy for development, as the 304 response can

be confusing when tinkering with Solr responses through a web browser or other client that

supports cache headers.

418Apache Solr Reference Guide 5.1

lastModFrom This attribute may be set to either (the default) or . The value openTime dirLastMod openTi

indicates that last modification times, as compared to the If-Modified-Since header sent byme

the client, should be calculated relative to the time the Searcher started. Use ifdirLastMod

you want times to exactly correspond to when the index was last updated on disk.

etagSeed This value of this attribute is sent as the value of the header. Changing this value can beETag

helpful to force clients to re-fetch content even when the indexes have not changed---for

example, when you've made some changes to the configuration.

<httpCaching never304="false"

lastModFrom="openTime"

etagSeed="Solr">

<cacheControl>max-age=30, public</cacheControl>

</httpCaching>

cacheControl Element

In addition to these attributes, accepts one child element: . The content of<httpCaching> <cacheControl>

this element will be sent as the value of the Cache-Control header on HTTP responses. This header is used to

modify the default caching behavior of the requesting client. The possible values for the Cache-Control header

are defined by the HTTP 1.1 specification in .Section 14.9

Setting the max-age field controls how long a client may re-use a cached response before requesting it again

from the server. This time interval should be set according to how often you update your index and whether or

not it is acceptable for your application to use content that is somewhat out of date. Setting must-revalidate

will tell the client to validate with the server that its cached copy is still good before re-using it. This will ensure

that the most timely result is used, while avoiding a second fetch of the content if it isn't needed, at the cost of a

request to the server to do the check.

Solr Cores and solr.xml

In Solr, the term is used to refer to a single index and associated transaction log and configuration filescore

(including and , among others). Your Solr installation can have multiple cores ifschema.xml solrconfig.xml

needed, which allows you to index data with different structures in the same server, and maintain more control

over how your data is presented to different audiences.

In older versions of Solr, cores had to be defined in in order for Solr to know about them. Now,solr.xml

however, Solr supports automatic discovery of cores and they no longer need to be explicitly defined. Instead,

core-specific properties (such as the directories to use for the indexes or configuration files, the core name, and

other options) are defined in a file. Any file in any directory of your Solrcore.properties core.properties

installation (or in a directory under where is defined) will be found by Solr and the defined propertiessolr_home

will be used for the core named in the file.

The following sections describe these options in more detail.

Format of solr.xml: Details on how to define , including the acceptable parameters for the solr.xml sol

filer.xml

Defining core.properties: Details on placement of and available property options.core.properties

CoreAdmin API: Tools and commands for core administration using a REST API.

Config Sets: How to use configsets to avoid duplicating effort when defining a new core.

419Apache Solr Reference Guide 5.1

Format of solr.xml

This section will describe the default file included with Solr and how to modify it for your needs. Forsolr.xml

details on how to configure , see the section .core.properties Defining core.properties

Defining solr.xml

Solr.xml Parameters

The <solr> Element

The <solrcloud> element

The <logging> element

The <logging><watcher> element

The <shardHandlerFactory> element

Substituting JVM System Properties in solr.xml

Defining solr.xml

You can find in your Solr Home directory. The default file looks like this:solr.xml solr.xml

<solr>

<int name="hostPort">${jetty.port:8983}</int>

<str name="hostContext">${hostContext:solr}</str>

<int name="zkClientTimeout">${zkClientTimeout:15000}</int>

<bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>

</solrcloud>

<shardHandlerFactory name="shardHandlerFactory"

class="HttpShardHandlerFactory">

<int name="socketTimeout">${socketTimeout:0}</int>

<int name="connTimeout">${connTimeout:0}</int>

</shardHandlerFactory>

</solr>

As you can see, the discovery Solr configuration is "SolrCloud friendly". However, the presence of the <solrclo

element does mean that the Solr instance is running in SolrCloud mode. Unless the or ud> not -DzkHost -Dzk

are specified at startup time, this section is ignored.Run

Solr.xml Parameters

The Element<solr>

There are no attributes that you can specify in the tag, which is the root element of . The<solr> solr.xml

tables below list the child nodes of each XML element in .solr.xml

Node Description

420Apache Solr Reference Guide 5.1

adminHandler If used, this attribute should be set to the FQN (Fully qualified name) of a class that

inherits from CoreAdminHandler. For example,

adminHandler="com.myorg.MyAdminHandler" would configure the custom admin

handler (MyAdminHandler) to handle admin requests. If this attribute isn't set, Solr

uses the default admin handler, org.apache.solr.handler.admin.CoreAdminHandler.

For more information on this parameter, see the Solr Wiki at http://wiki.apache.org/

.solr/CoreAdmin#cores

collectionsHandler As above, for custom CollectionsHandler implementations.

infoHandler As above, for custom InfoHandler implementations.

coreLoadThreads Specifies the number of threads that will be assigned to load cores in parallel.

coreRootDirectory The root of the core discovery tree, defaults to SOLR_HOME.

managementPath Currently non-operational.

sharedLib Specifies the path to a common library directory that will be shared across all cores.

Any JAR files in this directory will be added to the search path for Solr plugins. This

path is relative to the top-level container's Solr Home.

shareSchema This attribute, when set to true, ensures that the multiple cores pointing to the same

schema.xml will be referring to the same IndexSchema Object. Sharing the

IndexSchema Object makes loading the core faster. If you use this feature, make

sure that no core-specific property is used in your schema.xml.

transientCacheSize Defines how many cores with transient=true that can be loaded before swapping

the least recently used core for a new core.

configSetBaseDir The directory under which configsets for solr cores can be found. Defaults to

SOLR_HOME/configsets

The element<solrcloud>

This element defines several parameters that relate so SolrCloud. This section is ignored unless the solr

instance is started with either or -DzkRun -DzkHost

Node Description

distribUpdateConnTimeout Used to set the underlying "connTimeout" for intra-cluster updates.

distribUpdateSoTimeout Used to set the underlying "socketTimeout" for intra-cluster updates.

host The hostname Solr uses to access cores.

hostContext The url context path.

hostPort The port Solr uses to access cores. In the default file, this issolr.xml

set to }, which will use the Solr port defined in${jetty.port:8983

Jetty, and otherwise fall back to 8983.

421Apache Solr Reference Guide 5.1

leaderVoteWait When SolrCloud is starting up, how long each Solr node will wait for all

known replicas for that shard to be found before assuming that any nodes

that haven't reported are down.

leaderConflictResolveWait When trying to elect a leader for a shard, this property sets the maximum

time a replica will wait to see conflicting state information to be resolved;

temporary conflicts in state information can occur when doing rolling

restarts, especially when the node hosting the Overseer is restarted.

Typically, the default value of 180000 (ms) is sufficient for conflicts to be

resolved; you may need to increase this value if you have hundreds or

thousands of small collections in SolrCloud.

zkClientTimeout A timeout for connection to a ZooKeeper server. It is used with SolrCloud.

zkHost In SolrCloud mode, the URL of the ZooKeeper host that Solr should use

for cluster state information.

genericCoreNodeNames If , node names are not based on the address of the node, but on aTRUE

generic name that identifies the core. When a different machine takes

over serving that core things will be much easier to understand.

The element<logging>

Node Description

class The class to use for logging. The corresponding JAR file must be available to solr, perhaps

through a directive in solrconfig.xml.<lib>

enabled true/false - whether to enable logging or not.

The element<logging><watcher>

Node Description

size The number of log events that are buffered.

threshold The logging level above which your particular logging implementation will record. For example

when using log4j one might specify DEBUG, WARN, INFO, etc.

The element<shardHandlerFactory>

Custom shard handlers can be defined in if you wish to create a custom shard handler.solr.xml

Since this is a custom shard handler, sub-elements are specific to the implementation.

Substituting JVM System Properties in solr.xml

Solr supports variable substitution of JVM system property values in , which allows runtimesolr.xml

specification of various configuration options. The syntax is }.${propertyname[:option default value]

This allows defining a default that can be overridden when Solr is launched. If a default value is not specified,

422Apache Solr Reference Guide 5.1

then the property must be specified at runtime or the file will generate an error when parsed.solr.xml

Any JVM system properties usually specified using the -D flag when starting the JVM, can be used as variables

in the file.solr.xml

For example, in the file shown below, the and values are each setsolr.xml socketTimeout connTimeout

to "0". However, if you start Solr using ' ', the option ofbin/solr -DsocketTimeout=1000 socketTimeout

the to be overridden using a value of 1000ms, while the optionHttpShardHandlerFactory connTimeout

will continue to use the default property value of "0".

<solr>

<shardHandlerFactory name="shardHandlerFactory"

class="HttpShardHandlerFactory">

<int name="socketTimeout">${socketTimeout:0}</int>

<int name="connTimeout">${connTimeout:0}</int>

</shardHandlerFactory>

</solr>

Defining core.properties

Core discovery means that creating a core is as simple as a file located on disk. The core.properties core.

file is a simple Java Properties file where each line is just a key=value pair, e.g., .properties name=core1

Notice that no quotes are required.

A minimal file looks like this (however, it can also be empty, see information on placementcore.properties

of below):core.properties

name=my_core_name

Placement of core.properties

Solr cores are configured by placing a file named in a sub-directory under .core.properties solr.home

There are no a-priori limits to the depth of the tree, nor are there limits to the number of cores that can be

defined. Cores may be anywhere in the tree with the exception that cores may be defined under an existingnot

core. That is, the following is not allowed:

./cores/core1/core.properties

./cores/core1/coremore/core5/core.properties

In this example, the enumeration will stop at "core1".

The following is legal:

./cores/somecores/core1/core.properties

./cores/somecores/core2/core.properties

./cores/othercores/core3/core.properties

./cores/extracores/deepertree/core4/core.properties

It is possible to segment Solr into multiple cores, each with its own configuration and indices. Cores may be

dedicated to a single application or to very different ones, but all are administered through a common

administration interface. You can create new Solr cores on the fly, shutdown cores, even replace one running

423Apache Solr Reference Guide 5.1

core with another, all without ever stopping or restarting Solr.

Your file can be empty if necessary. Suppose is located in core.properties core.properties ./cores/c

In that case, the core name is assumed to be "core1". Theore1 (relative to solr_home ) but is empty.

instanceDir will be the folder containing (i.e., ). The dataDir will be core.properties ./cores/core1 ../cor

, etc.es/core1/data

Defining core.properties Files

The minimal file is an empty file, in which case all of the properties are defaultedcore.properties

appropriately.

Java properties files allow the hash ("#") or bang ("!") characters to specify comment-to-end-of-line.

This table defines the recognized properties:

Property Description

name The name of the SolrCore. You'll use this name to reference the SolrCore when running

commands with the CoreAdminHandler.

config The configuration file name for a given core. The default is .solrconfig.xml

schema The schema file name for a given core. The default is schema.xml

dataDir The core's data directory (where indexes are stored) as a path relative to the core's

instanceDir. This is by default.data

configSet The name of a defined configset, if desired, to use to configure the core (see the Config

for more details).Sets

properties The name of the properties file for this core. The value can be an absolute pathname or a

path relative to the value of .instanceDir

transient If , the core can be unloaded if Solr reaches the . The defaulttrue transientCacheSize

if not specified is . Cores are unloaded in order of least recently used first. false Setting to

is not recommended in SolrCloud mode.true

loadOnStartup If , the default if it is not specified, the core will loaded when Solr starts. true Setting to fals

is not recommended in SolrCloud mode.e

coreNodeName Used only in , this is a unique identifier for the node hosting this replica. BySolrCloud

default a coreNodeName is generated automatically, but setting this attribute explicitly

allows you to manually assign a new core to replace an existing replica. For example:

when replacing a machine that has had a hardware failure by restoring from backups on a

new machine with a new hostname or port..

ulogDir The absolute or relative directory for the update log for this core (SolrCloud).

shard The shard to assign this core to (SolrCloud).

You can run Solr without configuring any cores.

424Apache Solr Reference Guide 5.1

collection The name of the collection this core is part of (SolrCloud).

roles Future param for SolrCloud or a way for users to mark nodes for their own use.

Additional "user defined" properties may be specified for use as variables. For more information on how to define

local properties, see the section .Substituting Properties in Solr Config Files

CoreAdmin API

The CoreAdminHandler is a special SolrRequestHandler that is used to manage Solr cores. Unlike normal

SolrRequestHandlers, the CoreAdminHandler is not attached to a single core. Instead, it manages all the cores

running in a single Solr instance. Only one CoreAdminHandler exists for each top-level Solr instance.

To use the CoreAdminHandler, make sure that the attribute is defined on the <cores> element;adminPath

otherwise you will not be able to make HTTP requests to perform Solr core administration.

The action to perform is named by the HTTP request parameter "action", with arguments for a specific action

being provided as additional parameters.

All action names are uppercase, and are defined in depth in the sections below.

STATUS

CREATE

RELOAD

RENAME

SWAP

UNLOAD

MERGEINDEXES

SPLIT

REQUESTSTATUS

STATUS

The action returns the status of all running Solr cores, or status for only the named core.STATUS

http://localhost:8983/solr/admin/cores?action=STATUS&core=core0

Input

Query Parameters

Parameter Type Required Default Description

core string No The name of a core, as listed in the "name" attribute of a <cor

element in .e> solr.xml

indexInfo boolean No true If , information about the index will not be returned with afalse

core STATUS request. In Solr implementations with a large

number of cores (i.e., more than hundreds), retrieving the index

information for each core can take a lot of time and isn't always

required.

425Apache Solr Reference Guide 5.1

CREATE

The action creates a new core and registers it.CREATE

If a Solr core with the given name already exists, it will continue to handle requests while the new core is

initializing. When the new core is ready, it will take new requests and the old core will be unloaded.

http://localhost:8983/solr/admin/cores?action=CREATE&name=coreX&instanceDir=pat

h/to/dir&config=config_file_name.xml&schema=schem_file_name.xml&dataDir=data

Input

Query Parameters

Parameter Type Required Default Description

name string Yes N/A The name of the new core. Same as "name" on the el<core>

ement.

instanceDir string No whatever

specified

for

"name"

parameter

The directory where files for this SolrCore should be stored.

Same as on the element.instanceDir <core>

config string No Name of the config file (i.e., ) relative to solrconfig.xml ins

.tanceDir

schema string No Name of the schema file (i.e., ) relative to schema.xml insta

.nceDir

dataDir string No Name of the data directory relative to .instanceDir

configSet string No Name of the configset to use for this core. For more

information, see the section .Config Sets

CREATE must be able to find a configuration!

Your CREATE call must be able to find a configuration, or it will not succeed.

When you are running SolrCloud and create a new core for a collection, the configuration will be

inherited from the collection – each collection is linked to a configName, which is stored in the zookeeper

database. This satisfies the config requirement.

When you are not running SolrCloud, if you have defined, you can use the configSetConfig Sets

parameter as documented below. If there are no config sets, then the instanceDir specified in the

CREATE call must already exist, and it must contain a conf directory which in turn must contain solrco

and (or the filenames specified by the config and schema parameters), as wellnfig.xml schema.xml

as any files referenced in those configs. One thing you COULD do is use config and schema parameters

that point at absolute paths, but this can lead to confusing configurations unless you fully understand

what you are doing.

426Apache Solr Reference Guide 5.1

collection string No The name of the collection to which this core belongs. The

default is the name of the core. collection.<param>=<val

causes a property of to be set if aue> <param>=<value>

new collection is being created. Use collection.configNa

to point to the configuration for a newme=<configname>

collection.

shard string No The shard id this core represents. Normally you want to be

auto-assigned a shard id.

property.n

=ame value

string No Sets the core property to . See the section onname value

defining .core.properties file contents

async string No Request ID to track this action which will be processed

asynchronously

Use to point to the config for a new collection.collection.configName=<configname>

Example

http://localhost:8983/solr/admin/cores?action=CREATE&name=my_core&collection=my_co

llection&shard=shard2

RELOAD

The action loads a new core from the configuration of an existing, registered Solr core. While the newRELOAD

core is initializing, the existing one will continue to handle requests. When the new Solr core is ready, it takes

over and the old core is unloaded.

http://localhost:8983/solr/admin/cores?action=RELOAD&core=core0

This is useful when you've made changes to a Solr core's configuration on disk, such as adding new field

definitions. Calling the RELOAD action lets you apply the new configuration without having to restart the Web

container.

Input

Query Parameters

Parameter Type Required Default Description

core string Yes N/A The name of the core, as listed in the "name" attribute of a <core

element in .> solr.xml

While it's possible to create a core for a non-existent collection, this approach is not supported and not

recommended. Always create a collection using the before creating a core directly for it.Collections API

RELOAD performs "live" reloads of SolrCore, reusing some existing objects. Some configuration options,

such as the location and -related settings in can not bedataDir IndexWriter solrconfig.xml

changed and made active with a simple RELOAD action.

427Apache Solr Reference Guide 5.1

RENAME

The action changes the name of a Solr core.RENAME

http://localhost:8983/solr/admin/cores?action=RENAME&core=core0&other=core5

Input

Query Parameters

Parameter Type Required Default Description

core string Yes The name of the Solr core to be renamed.

other string Yes The new name for the Solr core. If the persistent attribute of <sol

is , the new name will be written to as the r> true solr.xml nam

attribute of the attribute.e <core>

async string No Request ID to track this action which will be processed

asynchronously

SWAP

SWAP atomically swaps the names used to access two existing Solr cores. This can be used to swap new content

into production. The prior core remains available and can be swapped back, if necessary. Each core will be

known by the name of the other, after the swap.

http://localhost:8983/solr/admin/cores?action=SWAP&core=core1&other=core0

Input

Query Parameters

Parameter Type Required Default Description

core string Yes The name of one of the cores to be swapped.

other string Yes The name of one of the cores to be swapped.

async string No Request ID to track this action which will be processed

asynchronously

UNLOAD

The action removes a core from Solr. Active requests will continue to be processed, but no newUNLOAD

requests will be sent to the named core. If a core is registered under more than one name, only the given name

is removed.

http://localhost:8983/solr/admin/cores?action=UNLOAD&core=core0

The action requires a parameter ( ) identifying the core to be removed. If the persistent attribute of UNLOAD core

is set to , the element with this attribute will be removed from .<solr> true <core> name solr.xml

Do not use with a SolrCloud node. It is not supported and can result in the core being unusable.SWAP

428Apache Solr Reference Guide 5.1

Input

Query Parameters

Parameter Type Required Default Description

core string Yes The name of one of the cores to be removed.

deleteIndex boolean No false If true, will remove the index when unloading the core.

deleteDataDir boolean No false If true, removes the directory and alldata

sub-directories.

deleteInstanceDir boolean No false If true, removes everything related to the core, including

the index directory, configuration files and other related

files.

async string No Request ID to track this action which will be processed

asynchronously

MERGEINDEXES

The action merges one or more indexes to another index. The indexes must have completedMERGEINDEXES

commits, and should be locked against writes until the merge is complete or the resulting merged index may

become corrupted. The target core index must already exist and have a compatible schema with the one or more

indexes that will be merged to it. Another commit on the target core should also be performed after the merge is

complete.

http://localhost:8983/solr/admin/cores?action=MERGEINDEXES&core=new_core_name&i

ndexDir=/solr_home/core1/data/index&indexDir=/solr_home/core2/data/index

In this example, we use the parameter to define the index locations of the source cores. The paindexDir core

rameter defines the target index. A benefit of this approach is that we can merge any Lucene-based index that

may not be associated with a Solr core.

Alternatively, we can instead use a parameter, as in this example:srcCore

http://localhost:8983/solr/admin/cores?action=mergeindexes&core=new_core_name&s

rcCore=core1&srcCore=core2

This approach allows us to define cores that may not have an index path that is on the same physical server as

the target core. However, we can only use Solr cores as the source indexes. Another benefit of this approach is

that we don't have as high a risk for corruption if writes occur in parallel with the source index.

We can make this call run asynchronously by specifying the parameter and passing a request-id. This idasync

can then be used to check the status of the already submitted task using the REQUESTSTATUS API.

Input

Query Parameters

Unloading all cores in a SolrCloud collection causes the removal of that collection's metadata from

ZooKeeper.

429Apache Solr Reference Guide 5.1

Parameter Type Required Default Description

core string Yes The name of the target core/index.

indexDir string Multi-valued, directories that would be merged.

srcCore string Multi-valued, source cores that would be merged.

async string Request ID to track this action which will be processed

asynchronously

SPLIT

The action splits an index into two or more indexes. The index being split can continue to handleSPLIT

requests. The split pieces can be placed into a specified directory on the server's filesystem or it can be merged

into running Solr cores.

The action supports five parameters, which are described in the table below.SPLIT

Input

Query Parameters

Parameter Type Required Default Description

core string Yes The name of the core to be split.

path string Multi-valued, the directory path in which a piece of the index will

be written.

targetCore string Multi-valued, the target Solr core to which a piece of the index will

be merged

ranges string No A comma-separated list of hash ranges in hexadecimal format

split.key string No The key to be used for splitting the index

async string No Request ID to track this action which will be processed

asynchronously

Examples

The index will be split into as many pieces as the number of or parameters.core path targetCore

Usage with two parameters:targetCore

http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&targetCore=core1

&targetCore=core2

Here the index will be split into two pieces and merged into the two indexes.core targetCore

Either or parameter must be specified but not both. The ranges and split.keypath targetCore

parameters are optional and only one of the two should be specified, if at all required.

430Apache Solr Reference Guide 5.1

Usage of with two parameters:path

http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&path=/path/to/in

dex/1&path=/path/to/index/2

The index will be split into two pieces and written into the two directory paths specified.core

Usage with the parameter:split.key

http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&targetCore=core1

&split.key=A!

Here all documents having the same route key as the i.e. 'A!' will be split from the index andsplit.key core

written to the .targetCore

Usage with ranges parameter:

http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&targetCore=core1

&targetCore=core2&targetCore=core3&ranges=0-1f4,1f5-3e8,3e9-5dc

This example uses the parameter with hash ranges 0-500, 501-1000 and 1001-1500 specified inranges

hexadecimal. Here the index will be split into three pieces with each targetCore receiving documents matching

the hash ranges specified i.e. core1 will get documents with hash range 0-500, core2 will receive documents with

hash range 501-1000 and finally, core3 will receive documents with hash range 1001-1500. At least one hash

range must be specified. Please note that using a single hash range equal to a route key's hash range is NOT

equivalent to using the parameter because multiple route keys can hash to the same range.split.key

The must already exist and must have a compatible schema with the index. A commit istargetCore core

automatically called on the index before it is split.core

This command is used as part of the command but it can be used for non-cloud Solr cores asSPLITSHARD

well. When used against a non-cloud core without parameter, this action will split the source indexsplit.key

and distribute its documents alternately so that each split piece contains an equal number of documents. If the s

parameter is specified then only documents having the same route key will be split from the sourceplit.key

index.

REQUESTSTATUS

Request the status of an already submitted asynchronous CoreAdmin API call.

Input

Query Parameters

Parameter Type Required Default Description

requestid string Yes The user defined request-id for the Asynchronous request.

The call below will return the status of an already submitted Asynchronous CoreAdmin call.

http://localhost:8983/solr/admin/cores?action=REQUESTSTATUS&requestid=1

Config Sets

431Apache Solr Reference Guide 5.1

On a multicore Solr instance, you may find that you want to share configuration between a number of different

cores. You can achieve this using named configsets, which are essentially shared configuration directories

stored under a configurable configset base directory.

To create a configset, simply add a new directory under the configset base directory. The configset will be

identified by the name of this directory. Then into this copy the config directory you want to share. The structure

should look something like this:

/<configSetBaseDir>

/configset1

/conf

/schema.xml

/solrconfig.xml

/configset2

/conf

/schema.xml

/solrconfig.xml

The default base directory is , and it can be configured in .$SOLR_HOME/configsets solr.xml

To create a new core using a configset, pass as one of the core properties. For example, if you doconfigSet

this via the core admin API:

http://<solr>/admin/cores?action=CREATE&name=mycore&instanceDir=path/to/instanc

e&configSet=configset2

Configuration APIs

Solr includes several APIs that can be used to modify settings in .solrconfig.xml

Blob Store API

Config API

Managed Resources

Request Parameters API

Blob Store API

The Blob Store REST API provides REST methods to store, retrieve or list files in a Lucene index. This can be

used to upload a jar file which contains standard solr components such as RequestHandlers,

SearchComponents, or other custom code you have written for Solr.

When using the blob store, note that the API does not delete or overwrite a previous object if a new one is

uploaded with the same name. It always adds a new version of the blob to the index. Deletes can be performed

with standard REST delete commands.

The blob store is only available when running in SolrCloud mode. Solr in standalone mode does not support

use of a blob store.

The blob store API is implemented as a requestHandler. A special collection named ".system" must be created

as the collection that contains the blob store index.

432Apache Solr Reference Guide 5.1

Create a .system Collection

Before using the blob store, a special collection must be created and it must be named ..system

The BlobHandler is automatically registered in the .system collection. The and other configurationschema.xml

files for the collection are automatically provided by the system and don't need to be defined specifically.

If you do not use the or options, then defaults of 1 shard and 1 replica will be-shards -replicationFactor

used.

You can create the collection with the , as in this example:.system Collections API

curl

"http://localhost:8983/solr/admin/collections?action=CREATE&name=.system&replication

Factor=2"

Note that the script cannot be used to create the collection at this time. Also, pleasebin/solr .system

ensure that there is at least one collection created before creating the collection..system

Upload Files to Blob Store

After the collection has been created, files can be uploaded to the blob store with a request similar to.system

the following:

curl -X POST -H 'Content-Type: application/octet-stream' --data-binary @{filename}

http://localhost:8983/solr/.system/blob/{blobname}

For example, to upload a file named "test1.jar" as a blob named "test", you would make a POST request like:

curl -X POST -H 'Content-Type: application/octet-stream' --data-binary @test1.jar

http://localhost:8983/solr/.system/blob/test

A GET request will return the list of blobs and other details:

curl http://localhost:8983/solr/.system/blob?omitHeader=true

Output:

{

"response":{"numFound":1,"start":0,"docs":[

{

"id":"test/1",

"md5":"20ff915fa3f5a5d66216081ae705c41b",

"blobName":"test",

"version":1,

"timestamp":"2015-02-04T16:45:48.374Z",

"size":13108}]

}

Details on individual blobs can be accessed with a request similar to:

433Apache Solr Reference Guide 5.1

curl http://localhost:8983/solr/.system/blob/{blobname}

For example, this request will return only the blob named 'test':

curl http://localhost:8983/solr/.system/blob/test?omitHeader=true

Output:

{

"response":{"numFound":1,"start":0,"docs":[

{

"id":"test/1",

"md5":"20ff915fa3f5a5d66216081ae705c41b",

"blobName":"test",

"version":1,

"timestamp":"2015-02-04T16:45:48.374Z",

"size":13108}]

}

The filestream response writer can return a particular version of a blob for download, as in:

curl http://localhost:8983/solr/.system/blob/{blobname}/{version}?wt=filestream >

{outputfilename}

For the latest version of a blob, the {version} can be omitted,

curl http://localhost:8983/solr/.system/blob/{blobname}?wt=filestream >

{outputfilename}

Use a Blob in a Handler or Component

To use the blob as the class for a request handler or search component, you create a request handler in solrco

as usual. You will need to define the following parameters:nfig.xml

class: the fully qualified class name. For example, if you created a new request handler class called

CRUDHandler, you would enter .org.apache.solr.core.CRUDHandler

lib: the name of the blob given when you uploaded it to the blob store.

version: the blob version to use. As mentioned above, blobs are not overwritten, but saved with a

different version.

For example, to use a blob named test, you would configure like this:solrconfig.xml

<requestHandler name="/myhandler" class="org.apache.solr.core.myHandler" lib="test"

version="1">

</requestHandler>

If there are parameters available in the custom handler, you can define them in the same way as any other

request handler definition.

434Apache Solr Reference Guide 5.1
 
Config API
The Config API enables manipulating various aspects of your   using REST-like API calls. Thissolrconfig.xml
feature is enabled by default and works similarly in both SolrCloud and standalone mode. Many commonly edited
properties (such as cache sizes and commit settings) and request handler definitions can be changed with this
API.
When using this API,   is  is not changed. Instead, all edited configuration is stored in a filesolrconfig.xml
called  . The values in   override the values in  .configoverlay.json configoverlay.json solrconfig.xml
API Entry Points
Commands
Commands for Common Properties
Commands for Custom Handlers and Local Components
Commands for User-Defined Properties
How to Map solrconfig.xml Properties to JSON
Examples
Creating and Updating Common Properties 
Creating and Updating Request Handlers
Creating and Updating User-Defined Properties
How It Works
Empty Command 
Listening to config Changes
API Entry Points
/config: retrieve or modify the config. GET to retrieve and POST for executing commands
/config/overlay: retrieve the details in the   aloneconfigoverlay.json
Commands
This API uses specific commands to tell Solr what property or type of property to add to  .configoverlay.json
The commands are passed as part of the data sent with the request.
The config commands are categorized into 3 different sections which manipulate various data structures in solr
. Each of these is described below.config.xml
Common Properties
Components
User-defined properties
Commands for Common Properties
The common properties are those that are frequently need to be customized in a Solr instance. They are
manipulated with two commands: 
set-property: Set a well known property. The names of the properties are predefined and fixed. If the
property has already been set, this command will overwrite the previous setting.
unset-property: Remove a property set using the   command.set-property
The properties that are configured with these commands are predefined and listed below. The names of these

435Apache Solr Reference Guide 5.1

properties are derived from their XML paths as found in .solrconfig.xml

updateHandler.autoCommit.maxDocs

updateHandler.autoCommit.maxTime

updateHandler.autoCommit.openSearcher

updateHandler.autoSoftCommit.maxDocs

updateHandler.autoSoftCommit.maxTime

updateHandler.commitWithin.softCommit

updateHandler.commitIntervalLowerBound

updateHandler.indexWriter.closeWaitsForMerges

query.filterCache.class

query.filterCache.size

query.filterCache.initialSize

query.filterCache.autowarmCount

query.filterCache.regenerator

query.queryResultCache.class

query.queryResultCache.size

query.queryResultCache.initialSize

query.queryResultCache.autowarmCount

query.queryResultCache.regenerator

query.documentCache.class

query.documentCache.size

query.documentCache.initialSize

query.documentCache.autowarmCount

query.documentCache.regenerator

query.fieldValueCache.class

query.fieldValueCache.size

query.fieldValueCache.initialSize

query.fieldValueCache.autowarmCount

query.fieldValueCache.regenerator

query.useFilterForSortedQuery

query.queryResultWindowSize

query.queryResultMaxDocCached

query.enableLazyFieldLoading

query.boolToFilterOptimizer

query.maxBooleanClauses

jmx.agentId

jmx.serviceUrl

jmx.rootName

requestDispatcher.handleSelect

requestDispatcher.requestParsers.multipartUploadLimitInKB

requestDispatcher.requestParsers.formdataUploadLimitInKB

requestDispatcher.requestParsers.enableRemoteStreaming

requestDispatcher.requestParsers.addHttpRequestToContext

Commands for Custom Handlers and Local Components

Custom request handlers, search components, and other types of localized Solr components (such as custom

436Apache Solr Reference Guide 5.1

query parsers, update processors, etc.) can be added, updated and deleted with specific commands for the

component being modified.

The syntax is similar in each case: , , and add-<component-name> update-<component-name> delete-<c

. Please note that the command name is not case sensitive, so , omponent-name> Add-RequestHandler ADD

and are all equivalent. In each case, commands add the new-REQUESTHANDLER add-requesthandler add-

configuration to , which will override any other settings for the component in configoverlay.json solrconfi

; commands modify an existing setting in ; and commandsg.xml update- configoverlay.json delete-

remove the setting from . Settings removed from are notconfigoverlay.json configoverlay.json

removed from .solrconfig.xml

The full list of available commands follows below:

General Purpose Commands

These commands are the most commonly used:

add-requesthandler

update-requesthandler

delete-requesthandler

add-searchcomponent

update-searchcomponent

delete-searchcomponent

add-initparams

update-initparams

delete-initparams

add-queryresponsewriter

update-queryresponsewriter

delete-queryresponsewriter

Advanced Commands

These commands allow registering more advanced customizations to Solr:

add-queryparser

update-queryparser

delete-queryparser

add-valuesourceparser

update-valuesourceparser

delete-valuesourceparser

add-transformer

update-transformer

delete-transformer

add-updateprocessor

update-updateprocessor

delete-updateprocessor

add-queryconverter

update-queryconverter

delete-queryconverter

add-listener

update-listener

delete-listener

437Apache Solr Reference Guide 5.1

add-runtimelib

update-runtimelib

delete-runtimelib

See the section below for examples of using these commands.Creating and Updating Request Handlers

What about < ?updateRequestProcessorChain>

The Config API does not let you create or edit elements. However, it is<updateRequestProcessorChain>

possible to create entries and can use them by name to create a chain.<updateProcessor>

example:

curl http://localhost:8983/solr/techproducts/config -H

'Content-type:application/json' -d '{

"add-updateprocessor" : {"name" : "firstFld",

"class": "solr.FirstFieldValueUpdateProcessorFactory",

"fieldName":"test_s"}}'

You can use this directly in your request by adding a parameter in the for<updateRequestProcessorChain>

the specific update processor called .processor=firstFld

Commands for User-Defined Properties

Solr lets users templatize the using the place holder format solrconfig.xml ${variable_name:default_

. You could set the values using system properties, for example, .val} -Dvariable_name= my_customvalue

The same can be achieved during runtime using these commands:

set-user-property: Set a user-defined property. If the property has already been set, this command

will overwrite the previous setting.

unset-user-property: Remove a user-defined property.

The structure of the request is similar to the structure of requests using other commands, in the format of "comm

. You can add more than one variable at a time ifand":{"variable_name":"property_value"}

necessary.

For more information about user-defined properties, see the section User defined properties from core.properties

. See also the section below for examples of how to use this typeCreating and Updating User-Defined Properties

of command.

How to Map Properties to JSONsolrconfig.xml

By using this API, you will be generating JSON representations of properties defined in . Tosolrconfig.xml

understand how properties should be represented with the API, let's take a look at a few examples.

Here is what a request handler looks like in :solrconfig.xml

438Apache Solr Reference Guide 5.1

<str name="echoParams">explicit</str>

</lst>

</requestHandler>

The same request handler defined with the Config API would look like this:

{"add-requesthandler":{

"name":"/query",

"class":"solr.SearchHandler"

"defaults":{

"echoParams":"explicit",

"wt":"json",

"indent":true}

}

A searchComponent in looks like this:solrconfig.xml

<str name="queryFieldType">string</str>

<str name="config-file">elevate.xml</str>

</searchComponent>

And the same searchComponent with the Config API:

{"add-searchcomponent":

"name":"elevator",

"class":"QueryElevationComponent",

"queryFieldType":"string",

"config-file":"elevate.xml"

}

Set autoCommit properties in :solrconfig.xml

<openSearcher>false</openSearcher>

</autoCommit>

Define the same properties with the Config API:

{"set-property":{

"updateHandler.autoCommit.maxTime":15000,

"updateHandler.autoCommit.openSearcher":false

}

Name Components for the Config API

439Apache Solr Reference Guide 5.1

The Config API always allows changing the configuration of any component by name. However, some

configurations such as or do not require a name in . In order to belistener initParams solrconfig.xml

able to and of the same item in , the name attribute becomesupdate delete configoverlay.json

mandatory.

Examples

Creating and Updating Common Properties

This change sets the to 1000 items and unsets the query.filterCache.autowarmCount query.filterCa

.che.size

curl http://localhost:8983/solr/techproducts/config -H

'Content-type:application/json' -d'{

"set-property" : {"query.filterCache.autowarmCount":1000},

"unset-property" :"query.filterCache.size"}'

Using the endpoint, you can verify the changes with a request like this:/config/overlay

curl http://localhost:8983/solr/gettingstarted/config/overlay?omitHeader=true

And you should get a response like this:

{

"overlay":{

"znodeVersion":1,

"props":{"query":{"filterCache":{

"autowarmCount":1000,

"size":25}}}}}

Creating and Updating Request Handlers

To create a request handler, we can use the command:add-requesthandler

curl http://localhost:8983/solr/techproducts/config -H

'Content-type:application/json' -d '{

"add-requesthandler" : {"name": "/mypath" ,

"class":"solr.DumpRequestHandler" ,

"defaults":{"x":"y" ,"a":"b", "wt":"json",

"indent":true},

"useParams":"x"

Make a call to the new request handler to check if it is registered:

curl http://localhost:8983/solr/techproducts/mypath?omitHeader=true

And you should see the following as output:

440Apache Solr Reference Guide 5.1

{

"params":{

"indent":"true",

"a":"b",

"x":"y",

"wt":"json"},

"context":{

"webapp":"/solr",

"path":"/mypath",

"httpMethod":"GET"}}

To update a request handler, you should use the command :update-requesthandler

curl http://localhost:8983/solr/techproducts/config -H

'Content-type:application/json' -d '{

"update-requesthandler" : {"name": "/mypath" ,

"class":"solr.DumpRequestHandler" ,

"defaults":{"x":"new value for X"

,"wt":"json","indent":true},

"useParams":"x"

}}'

As another example, we'll create another request handler, this time adding the 'terms' component as part of the

definition:

curl http://localhost:8983/solr/techproducts/config -H

'Content-type:application/json' -d '{

"add-requesthandler" : {"name": "/myterms" ,

"class":"solr.Searchhandler" ,

"defaults":{"terms":true ,"distrib":false},

"components":["terms"]

}}'

Creating and Updating User-Defined Properties

This command sets a user property.

curl http://localhost:8983/solr/techproducts/config

-H'Content-type:application/json' -d '{

"set-user-property" : {"variable_name":"some_value"}}'

Again, we can use the endpoint to verify the changes have been made:/config/overlay

curl http://localhost:8983/solr/techproducts/config/overlay?omitHeader=true

And we would expect to see output like this:

441Apache Solr Reference Guide 5.1

{"overlay":{

"znodeVersion":5,

"userProps":{

"variable_name":"some_value"}}

}

To unset the variable, issue a command like this:

curl http://localhost:8983/solr/techproducts/config

-H'Content-type:application/json' -d '{"unset-user-property" : "variable_name"}'

How It Works

Every core watches the ZooKeeper directory for the configset being used with that core. In standalone mode,

however, there is no watch (because ZooKeeper is not running). If there are multiple cores in the same node

using the same configset, only one ZooKeeper watch is used. For instance, if the configset 'myconf' is used by a

core, the node would watch . Every write operation performed through the API would 'touch'/configs/myconf

the directory (sets an empty byte[] to trigger watches) and all watchers are notified. Every core would check if sc

, or is modified by comparing the versions and ifhema.xml solrconfig.xml configoverlay.json znode

modified, the core is reloaded.

If is modified, the params object is just updated without a core reload (see the section params.json Request

for more information about ).Parameters API params.json

Empty Command

If an empty command is sent to the endpoint, the watch is triggered on all cores using this configset./config

For example:

curl http://localhost:8983/solr/techproducts/config

-H'Content-type:application/json' -d '{}'

Directly editing any files without 'touching' the directory make it visible to all nodes.will not

It is possible for components to watch for the configset 'touch' events by registering a listener using SolrCore#r

.egisterConfListener()

Listening to config Changes

Any component can register a listener using:

SolrCore#addConfListener(Runnable listener)

to get notified for config changes. This is not very useful if the files modified result in core reloads (i.e., configo

or ) . Components can use this to reload the files they are interested in. verlay.xml managedschema.xml

Managed Resources

Managed resources expose a REST API endpoint for performing Create-Read-Update-Delete (CRUD)

442Apache Solr Reference Guide 5.1

operations on a Solr object. Any long-lived Solr object that has configuration settings and/or data is a good

candidate to be a managed resource. Managed resources complement other programmatically manageable

components in Solr, such as the RESTful schema API to add fields to a managed schema. Consider a

Web-based UI that offers Solr-as-a-Service where users need to configure a set of stop words and synonym

mappings as part of an initial setup process for their search application. This type of use case can easily be

supported using the Managed Stop Filter & Managed Synonym Filter Factories provided by Solr, via the

Managed resources REST API. Users can also write their own custom plugins, that leverage the same internal

hooks to make additional resources REST managed.

All of the examples in this section assume you are running the "techproducts" Solr example:

bin/solr -e techproducts

Overview

Let's begin learning about managed resources by looking at a couple of examples provided by Solr for managing

stop words and synonyms using a REST API. After reading this section, you'll be ready to dig into the details of

how managed resources are implemented in Solr so you can start building your own implementation.

Stop words

To begin, you need to define a field type that uses the , such as:ManagedStopFilterFactory

<filter class="solr.ManagedStopFilterFactory"

managed="english" />

</analyzer>

</fieldType>

There are two important things to notice about this field type definition. First, the filter implementation class is so

. This is a special implementation of the that uses a set oflr.ManagedStopFilterFactory StopFilterFactory

stop words that are managed from a REST API. Second, the attribute gives a name tomanaged=”english”

the set of managed stop words, in this case indicating the stop words are for English text.

The REST endpoint for managing the English stop words in the techproducts collection is: /solr/techproduc

.ts/schema/analysis/stopwords/english

The example resource path should be mostly self-explanatory. It should be noted that the

ManagedStopFilterFactory implementation determines the part of the path, which/schema/analysis/stopwords

makes sense because this is an analysis component defined by the schema. It follows that a field type that uses

the following filter:

<filter class="solr.ManagedStopFilterFactory"

managed="french" />

would resolve to path: ./solr/techproducts/schema/analysis/stopwords/french

So now let’s see this API in action, starting with a simple GET request:

443Apache Solr Reference Guide 5.1

curl "http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english"

Assuming you sent this request to Solr, the response body is a JSON document:

{

"responseHeader":{

"status":0,

"QTime":1

"wordSet":{

"initArgs":{"ignoreCase":true},

"initializedOn":"2014-03-28T20:53:53.058Z",

"managedList":[

"a",

"an",

"and",

"are",

... ]

}

The ships with a pre-built set of managed stop words, howeversample_techproducts_configsconfig set

you should only interact with this file using the API and not edit it directly.

One thing that should stand out to you in this response is that it contains a of words as well as managedList in

. This is an important concept in this framework—managed resources typically have configuration anditArgs

data. For stop words, the only configuration parameter is a boolean that determines whether to ignore the case

of tokens during stop word filtering (ignoreCase=true|false). The data is a list of words, which is represented as a

JSON array named in the response. managedList

Now, let’s add a new word to the English stop word list using an HTTP PUT:

curl -X PUT -H 'Content-type:application/json' --data-binary '["foo"]'

"http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english"

Here we’re using cURL to PUT a JSON list containing a single word “foo” to the managed English stop words

set. Solr will return 200 if the request was successful. You can also put multiple words in a single PUT request.

You can test to see if a specific word exists by sending a GET request for that word as a child resource of the

set, such as:

curl "http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english/foo"

This request will return a status code of 200 if the child resource (foo) exists or 404 if it does not exist the

managed list.

To delete a stop word, you would do:

curl -X DELETE

"http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english/foo"

444Apache Solr Reference Guide 5.1

Note: PUT/POST is used to add terms to an existing list instead of replacing the list entirely. This is because it is

more common to add a term to an existing list than it is to replace a list altogether, so the API favors the more

common approach of incrementally adding terms especially since deleting individual terms is also supported.

Synonyms

For the most part, the API for managing synonyms behaves similar to the API for stop words, except instead of

working with a list of words, it uses a map, where the value for each entry in the map is a set of synonyms for a

term. As with stop words, the includes a pre-built set of synonymsample_techproducts_configs config set

mappings suitable for the sample data that is activated by the following field type definition in schema.xml:

<filter class="solr.ManagedStopFilterFactory"

managed="english" />

<filter class="solr.ManagedSynonymFilterFactory"

managed="english" />

</analyzer>

</fieldType>

To get the map of managed synonyms, send a GET request to:

curl "http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english"

This request will return a response that looks like:

{

"responseHeader":{

"status":0,

"QTime":3},

"synonymMappings":{

"initArgs":{

"ignoreCase":true,

"format":"solr"},

"initializedOn":"2014-12-16T22:44:05.33Z",

"managedMap":{

"GB":

["GiB",

"Gigabyte"],

"TV":

["Television"],

"happy":

["glad",

"joyful"]}}}

Managed synonyms are returned under the property which contains a JSON Map where the valuemanagedMap

of each entry is a set of synonyms for a term, such as "happy" has synonyms "glad" and "joyful" in the example

above.

To add a new synonym mapping, you can PUT/POST a single mapping such as:

445Apache Solr Reference Guide 5.1

curl -X PUT -H 'Content-type:application/json' --data-binary

'{"mad":["angry","upset"]}'

"http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english"

The API will return status code 200 if the PUT request was successful. To determine the synonyms for a specific

term, you send a GET request for the child resource, such as /schema/analysis/synonyms/english/mad

would return . Lastly, you can delete a mapping by sending a DELETE request to the["angry","upset"]

managed endpoint.

Applying Changes

Changes made to managed resources via this REST API are not applied to the active Solr components until the

Solr collection (or Solr core in single server mode) is reloaded. For example:, after adding or deleting a stop

word, you must reload the core/collection before changes become active.

This approach is required when running in distributed mode so that we are assured changes are applied to all

cores in a collection at the same time so that behavior is consistent and predictable. It goes without saying that

you don’t want one of your replicas working with a different set of stop words or synonyms than the others.

One subtle outcome of this approach is that the once you make changes with the API,apply-changes-at-reload

there is no way to read the active data. In other words, the API returns the most up-to-date data from an API

perspective, which could be different than what is currently being used by Solr components. However, the intent

of this API implementation is that changes will be applied using a reload within a short time frame after making

them so the time in which the data returned by the API differs from what is active in the server is intended to be

negligible.

RestManager Endpoint

Metadata about registered ManagedResources is available using the and /schema/managed /config/managed

endpoints for each collection. Assuming you have the field type shown above defined in yourmanaged_en

schema.xml, sending a GET request to the following resource will return metadata about which schema-related

resources are being managed by the RestManager:

curl "http://localhost:8983/solr/techproducts/schema/managed"

The response body is a JSON document containing metadata about managed resources under

the /schema root:

Changing things like stop words and synonym mappings typically require re-indexing existing documents

if being used by index-time analyzers. The RestManager framework does not guard you from this, it

simply makes it possible to programmatically build up a set of stop words, synonyms etc.

446Apache Solr Reference Guide 5.1

{

"responseHeader":{

"status":0,

"QTime":3

"managedResources":[

{

"resourceId":"/schema/analysis/stopwords/english",

"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource",

"numObservers":"1"

{

"resourceId":"/schema/analysis/synonyms/english",

"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymFilterFactory$SynonymMan

ager",

"numObservers":"1"

}

]

}

You can also create new managed resource using PUT/POST to the appropriate URL – before ever configuring

anything that uses these resources.

For example: imagine we want to build up a set of German stop words. Before we can start adding stop words,

we need to create the endpoint:

/solr/techproducts/schema/analysis/stopwords/german

To create this endpoint, send the following PUT/POST request to the endpoint we wish to create:

curl -X PUT -H 'Content-type:application/json' --data-binary \

'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}' \

"http://localhost:8983/solr/techproducts/schema/analysis/stopwords/german"

Solr will respond with status code 200 if the request is successful. Effectively, this action registers a new

endpoint for a managed resource in the RestManager. From here you can start adding German stop words as

we saw above:

curl -X PUT -H 'Content-type:application/json' --data-binary '["die"]' \

"http://localhost:8983/solr/techproducts/schema/analysis/stopwords/german"

For most users, creating resources in this way should never be necessary, since managed resources are created

automatically when configured.

However: You may want to explicitly delete managed resources if they are no longer being used by a Solr

component.

For instance, the managed resource for German that we created above can be deleted because there are no

Solr components that are using it, whereas the managed resource for English stop words cannot be deleted

because there is a token filter declared in schema.xml that is using it.

447Apache Solr Reference Guide 5.1

curl -X DELETE

"http://localhost:8983/solr/techproducts/schema/analysis/stopwords/german"

Related Topics

Using Solr’s REST APIs to manage stop words and synonyms by Tim Potter @ SearchHub.org

Request Parameters API

The Request Parameters API allows creating parameter sets that can override or take the place of parameters

defined in . The parameter sets defined with this API can be used in requests to Solr, orsolrconfig.xml

referenced directly in request handler definitions.solrconfig.xml

It is really another endpoint of the instead of a separate API, and has distinct commands. It does notConfig API

replace or modify any sections of , but instead provides another approach to handlingsolrconfig.xml

parameters used in requests. It behaves in the same way as the Config API, by storing parameters in another file

that will be used at runtime. In this case, the parameters are stored in a file named . This file isparams.json

kept in ZooKeeper or in the directory of a standalone Solr instance.conf

The settings stored in are used at query time to override settings defined in inparams.json solrconfig.xml

some cases as described below.

When might you want to use this feature?

To avoid frequently editing your to update request parameters that change often.solrconfig.xml

To reuse parameters across various request handlers.

To mix and match parameter sets at request time.

To avoid a reload of your collection for small parameter changes.

The Request Parameters Endpoint

All requests are sent to the endpoint of the Config API. /config/params

Setting Request Parameters

The request to set, unset, or update request parameters is sent as a set of Maps with names. These objects can

be directly used in a request or a request handler definition.

The available commands are:

set: Create or overwrite a parameter set map.

unset: delete a parameter set map.

update: update a parameter set map. This essentially overwrites the old parameter set, so all parameters

must be sent in each update request.

You can mix these commands into a single request if necessary.

Each map must include a name so it can be referenced later, either in a direct request to Solr or in a request

handler definition.

In the following example, we are setting 2 sets of parameters named 'myFacets' and 'myQueries'.

448Apache Solr Reference Guide 5.1

curl http://localhost:8983/solr/techproducts/config/params -H

'Content-type:application/json' -d '{

"set":{

"myFacets":{

"facet":"true",

"facet.limit":5}},

"set":{

"myQueries":{

"defType":"edismax",

"rows": "5",

"df":"text_all"}}

To see the parameters that have been set, you can use the endpoint to read the contents of /config/params

, or use the name in the request:params.json

curl http://localhost:8983/solr/techproducts/config/params

#Or use the params name

curl http://localhost:8983/solr/techproducts/config/params/myQueries

The ParameteruseParams

When making a request, the parameter applies the request parameters set to the request. This isuseParams

translated at request time to the actual params.

For example (using the names we set up in the earlier example, please replace with your own name):

http://localhost/solr/techproducts/select?useParams=myQueries

It is possible to pass more than one parameter set in the same request. For example:

http://localhost/solr/techproducts/select?useParams=myFacets,myQueries

In the above example the param set 'myQueries' is applied on top of 'myFacets'. So, values in 'myQueries'

take precedence over values in 'myFacets'. Additionally, any values passed in the request take precedence over

'useParams' params. This acts like the "defaults" specified in the ' ' definition in <requestHandler> solrconfi

.g.xml

The parameter sets can be used directly in a request handler definition as follows. Please note that the

'useParams' specified here is ignored if the request contains .useParams

<bool name="distrib">false</bool>

</lst>

<str>terms</str>

</arr>

</requestHandler>

449Apache Solr Reference Guide 5.1

To summarize, parameters are applied in this order:

parameters defined in in .<invariants> solrconfig.xml

parameters defined in the request directly.

parameter sets defined in the request, in the order they have been listed with .useParams

parameter sets defined in that have been defined in the request handler.params.json

parameters defined in in .<defaults> solrconfig.xml

Public APIs

The RequestParams Object can be accessed using the method . EachSolrConfig#getRequestParams()

paramset can be accessed by their name using the method RequestParams#getRequestParams(String

.name)

Solr Plugins

Solr allows you to load custom code to perform a variety of tasks within Solr, from custom Request Handlers to

process your searches, to custom Analyzers and Token Filters for your text field. You can even load custom

Field Types. These pieces of custom code are called plugins.

Not everyone will need to create plugins for their Solr instances - what's provided is usually enough for most

applications. However, if there's something that you need, you may want to review the Solr Wiki documentation

on plugins at .SolrPlugins

If you have a plugin you would like to use, and you are running in SolrCloud mode, you can use the Blob Store

API and the Config API to load the jars to Solr. The commands to use are described in the section Adding

. Custom Plugins in SolrCloud Mode

Adding Custom Plugins in SolrCloud Mode

When running Solr in SolrCloud mode and you want to use custom code (such as custom analyzers, tokenizers,

query parsers, and other plugins), it can be difficult to add jars to the classpath on all nodes in your cluster. Using

the and special commands with the , you can upload jars to a special system-levelBlob Store API Config API

collection and dynamically load them at runtime.

Uploading Jar Files

The first step is to use the to upload your jar files. This will to put your jars in the collectiBlob Store API .system

on and distribute them across your SolrCloud nodes. These jars are added to a separate classloader and only

accessible to components that are configured with the property . These components areruntimeLib=true

This Feature is Disabled By Default

In addition to requiring that Solr by running in mode, this feature is also disabled by defaultSolrCloud

unless all Solr nodes are run with the option on startup.-Denable.runtime.lib=true

Before enabling this feature, users should carefully consider the issues discussed in the Securing

section below.Runtime Libraries

450Apache Solr Reference Guide 5.1

loaded lazily because the collection may not be loaded when a particular core is loaded..system

Config API Commands to use Jars as Runtime Libraries

The runtime library feature uses a special set of commands for the to add, update, or remove jar filesConfig API

currently available in the blob store to the list of runtime libraries.

The following commands are used to manage runtime libs:

add-runtimelib

update-runtimelib

delete-runtimelib

curl http://localhost:8983/solr/techproducts/config -H

'Content-type:application/json' -d '{

"add-runtimelib" : {"name": "jarblobname" , "version":2 },

"update-runtimelib" :{"name": "jarblobname" ,"version":3},

"delete-runtimelib" :"jarblobname"

The name to use is the name of the blob that you specified when you uploaded your jar to the blob store. You

should also include the version of the jar found in the blob store that you want to use. These details are added to

. configoverlay.json

The default does not have visibility to the jars that have been defined as runtimeSolrResourceLoader

libraries. There is a classloader that can access these jars which is made available only to those components

which are specially annotated.

Every pluggable component can have an optional extra attribute called , which means thatruntimeLib=true

the components are not loaded at core load time. Instead, they will be loaded on demand. If all the dependent

jars are not available when the component is loaded, an error is thrown.

This example shows creating a ValueSourceParser using a jar that has been loaded to the Blob store.

curl http://localhost:8983/solr/techproducts/config -H

'Content-type:application/json' -d '{

"create-valuesourceparser" : {

"name": "nvl" ,

"runtimeLib": true,

"class":"solr.org.apache.solr.search.function.NvlValueSourceParser ,

"nvlFloatValue":0.0 }

Securing Runtime Libraries

A drawback of this feature is that it could be used to load malicious executable code into the system. However, it

is possible to restrict the system to load only trusted jars using to verify that the executables loaded into thePKI

system are trustworthy.

The following steps will allow you enable security for this feature. The instructions assume you have started all

your Solr nodes with the .-Denable.runtime.lib=true

451Apache Solr Reference Guide 5.1

Step 1: Generate an RSA Private Key

The first step is to generate an RSA private key. The example below uses a 512-bit key, but you should use the

strength appropriate to your needs.

$ openssl genrsa -out priv_key.pem 512

Step 2: Output the Public Key

The public portion of the key should be output in DER format so Java can read it.

$ openssl rsa -in priv_key.pem -pubout -outform DER -out pub_key.der

Step 3: Load the Key to ZooKeeper

The .der files that are output from Step 2 should then be loaded to ZooKeeper under a node so they/keys/exe

are available throughout every node. You can load any number of public keys to that node and all are valid. If a

key is removed from the directory, the signatures of that key will cease to be valid. So, before removing the a

key, make sure to update your runtime library configurations with valid signatures with the update-runtimeli

command.b

At the current time, you can only use the ZooKeeper (or on Windows) script to issuezkCli.sh zkCli.cmd

these commands (the Solr version has the same name, but is not the same). If you are running the embedded

ZooKeeper that is included with Solr, you have this script already; in order to use it, you will need todo not

download a copy of ZooKeeper v3.4.6 from Don't worry about configuring thehttp://zookeeper.apache.org/.

download, you're just trying to get the command line utility script. When you start the script, you will connect to

the embedded ZooKeeper. If you have your own ZooKeeper ensemble running already, you can find the script

in (or if you are using Windows).$ZK_INSTALL/bin/zkCli.sh zkCli.cmd

To load the keys, you will need to connect to ZooKeeper with , create the directories, and then createzkCli.sh

the key file, as in the following example.

# Connect to ZooKeeper

# Replace the server location below with the correct ZooKeeper connect string for

your installation.

$ .bin/zkCli.sh -server localhost:9983

# After connection, you will interact with the ZK prompt.

# Create the directories

[zk: localhost:9983(CONNECTED) 5] create /keys

[zk: localhost:9983(CONNECTED) 5] create /keys/exe

# Now create the public key file in ZooKeeper

# The second path is the path to the .der file on your local machine

[zk: localhost:9983(CONNECTED) 5] create /keys/exe/pub_key.der

/myLocal/pathTo/pub_key.der

After this, any attempt to load a jar will fail. All your jars must be signed with one of your private keys for Solr to

trust it. The process to sign your jars and use the signature is outlined in Steps 4-6.

Step 4: Sign the jar File

452Apache Solr Reference Guide 5.1

Next you need to sign the sha1 digest of your jar file and get the base64 string.

$ openssl dgst -sha1 -sign priv_key.pem myjar.jar | openssl enc -base64

The output of this step will be a string that you will need to add the jar to your classpath in Step 6 below.

Step 5: Load the jar to the Blob Store

Load your jar to the Blob store, using the . This step does not require a signature; you will needBlob Store API

the signature in Step 6 to add it to your classpath.

curl -X POST -H 'Content-Type: application/octet-stream' --data-binary @{filename}

http://localhost:8983/solr/.system/blob/{blobname}

The blob name that you give the jar file in this step will be used as the name in the next step.

Step 6: Add the jar to the Classpath

Finally, add the jar to the classpath using the Config API as detailed above. In this step, you will need to provide

the signature of the jar that you got in Step 4.

curl http://localhost:8983/solr/techproducts/config -H

'Content-type:application/json' -d '{

"add-runtimelib" : {

"name":"blobname",

"version":2,

"sig":"mW1Gwtz2QazjfVdrLFHfbGwcr8xzFYgUOLu68LHqWRDvLG0uLcy1McQ+AzVmeZFBf1yLPDEHBWJb5

KXr8bdbHN/

PYgUB1nsr9pk4EFyD9KfJ8TqeH/ijQ9waa/vjqyiKEI9U550EtSzruLVZ32wJ7smvV0fj2YYhrUaaPzOn9g0

=" }

JVM Settings

Configuring your JVM can be a complex topic. A full discussion is beyond the scope of this document. Luckily,

most modern JVMs are quite good at making the best use of available resources with default settings. The

following sections contain a few tips that may be helpful when the defaults are not optimal for your situation.

For more general information about improving Solr performance, see https://wiki.apache.org/solr/SolrPerformanc

.eFactors

Choosing Memory Heap Settings

The most important JVM configuration settings are those that determine the amount of memory it is allowed to

allocate. There are two primary command-line options that set memory limits for the JVM. These are ,-Xms

which sets the initial size of the JVM's memory heap, and , which sets the maximum size to which the heap-Xmx

is allowed to grow.

If your Solr application requires more heap space than you specify with the option, the heap will grow-Xms

automatically. It's quite reasonable to not specify an initial size and let the heap grow as needed. The only

453Apache Solr Reference Guide 5.1

downside is a somewhat slower startup time since the application will take longer to initialize. Setting the initial

heap size higher than the default may avoid a series of heap expansions, which often results in objects being

shuffled around within the heap, as the application spins up.

The maximum heap size, set with , is more critical. If the memory heap grows to this size, object creation-Xmx

may begin to fail and throw . Setting this limit too low can cause spurious errors inOutOfMemoryException

your application, but setting it too high can be detrimental as well.

It doesn't always cause an error when the heap reaches the maximum size. Before an error is raised, the JVM

will first try to reclaim any available space that already exists in the heap. Only if all garbage collection attempts

fail will your application see an exception. As long as the maximum is big enough, your app will run without error,

but it may run more slowly if forced garbage collection kicks in frequently.

The larger the heap the longer it takes to do garbage collection. This can mean minor, random pauses or, in

extreme cases, "freeze the world" pauses of a minute or more. As a practical matter, this can become a serious

problem for heap sizes that exceed about two gigabytes, even if far more physical memory is available. On

robust hardware, you may get better results running multiple JVMs, rather than just one with a large memory

heap. Some specialized JVM implementations may have customized garbage collection algorithms that do better

with large heaps. Also, Java 7 is expected to have a redesigned GC that should handle very large heaps

efficiently. Consult your JVM vendor's documentation.

When setting the maximum heap size, be careful not to let the JVM consume all available physical memory. If

the JVM process space grows too large, the operating system will start swapping it, which will severely impact

performance. In addition, the operating system uses memory space not allocated to processes for file system

cache and other purposes. This is especially important for I/O-intensive applications, like Lucene/Solr. The larger

your indexes, the more you will benefit from filesystem caching by the OS. It may require some experimentation

to determine the optimal tradeoff between heap space for the JVM and memory space for the OS to use.

On systems with many CPUs/cores, it can also be beneficial to tune the layout of the heap and/or the behavior of

the garbage collector. Adjusting the relative sizes of the generational pools in the heap can affect how often GC

sweeps occur and whether they run concurrently. Configuring the various settings of how the garbage collector

should behave can greatly reduce the overall performance impact when it does run. There is a lot of good

information on this topic available on Sun's website. A good place to start is here: Oracle's Java HotSpot

.Garbage Collection

Use the Server HotSpot VM

If you are using Sun's JVM, add the command-line option when you start Solr. This tells the JVM that it-server

should optimize for a long running, server process. If the Java runtime on your system is a JRE, rather than a full

JDK distribution (including and other development tools), then it is possible that it may not support the javac -s

JVM option. Test this by running and look for as an available option in theerver java -help -server

displayed usage message.

Checking JVM Settings

A great way to see what JVM settings your server is using, along with other useful information, is to use the

admin RequestHandler, . This request handler will display a wealth of server statistics andsolr/admin/system

settings.

You can also use any of the tools that are compatible with the Java Management Extensions (JMX). See the

section in for more information.Using JMX with Solr Managing Solr

454Apache Solr Reference Guide 5.1
Managing Solr
This section describes how to run Solr and how to look at Solr when it is running. It contains the following
sections:
Taking Solr to Production: Describes how to install Solr as a service on Linux for production environments.
Configuring Logging: Describes how to configure logging for Solr.
Enabling SSL: Describes how to configure single-node Solr and SolrCloud to encrypt internal and external
communication using SSL.
Backing Up: Describes backup strategies for your Solr indexes.
Using JMX with Solr: Describes how to use Java Management Extensions with Solr.
Managed Resources: Describes the REST APIs for dealing with resources that various Solr plugins may expose.
Running Solr on HDFS: How to use HDFS to store your Solr indexes and transaction logs.
Taking Solr to Production
This section provides guidance on how to setup Solr to run in production on *nix platforms, such as Ubuntu.
Specifically, we’ll walk through the process of setting up to run a single Solr instance on a Linux host and then
provide tips on how to support multiple Solr nodes running on the same host.
Service Installation Script
Planning your directory structure
Solr Installation Directory
Separate Directory for Writable Files
Create the Solr user
Run the Solr Installation Script
Solr Home Directory
Environment overrides include file
Log settings
init.d script
Progress Check
Fine tune your production setup
Memory and GC Settings
Out-of-Memory Shutdown Hook
SolrCloud
ZooKeeper chroot
Solr Hostname
Override settings in solrconfig.xml
Enable Remote JMX Access
Running multiple Solr nodes per host
Service Installation Script
Solr includes a service installation script ( ) to help you install Solr as abin/install_solr_service.sh
service on Linux. Currently, the script only supports Red Hat, Ubuntu, Debian, and SUSE Linux distributions.
Before running the script, you need to determine a few parameters about your setup. Specifically, you need to

455Apache Solr Reference Guide 5.1

decide where to install Solr and which system user should be the owner of the Solr files and process.

Planning your directory structure

We recommend separating your live Solr files, such as logs and index files, from the files included in the Solr

distribution bundle, as that makes it easier to upgrade Solr and is considered a good practice to follow as a

system administrator.

Solr Installation Directory

By default, the service installation script will extract the distribution archive into . You can change this/opt

location using the option when running the installation script. The script will also create a symbolic link to the-i

versioned directory of Solr. For instance, if you run the installation script for Solr 5.0.0, then the following

directory structure will be used:

/opt/solr-5.0.0

/opt/solr -> /opt/solr-5.0.0

Using a symbolic link insulates any scripts from being dependent on the specific Solr version. If, down the road,

you need to upgrade to a later version of Solr, you can just update the symbolic link to point to the upgraded

version of Solr. We’ll use to refer to the Solr installation directory in the remaining sections of this/opt/solr

page.

Separate Directory for Writable Files

You should also separate writable Solr files into a different directory; by default, the installation script uses /var

, but you can override this location using the option. With this approach, the files in will/solr -d /opt/solr

remain untouched and all files that change while Solr is running will live under ./var/solr

Create the Solr user

Running Solr as is not recommended for security reasons. Consequently, you should determine theroot

username of a system user that will own all of the Solr files and the running Solr process. By default, the

installation script will create the user, but you can override this setting using the -u option. If yoursolr

organization has specific requirements for creating new user accounts, then you should create the user before

running the script. The installation script will make the Solr user the owner of the and di/opt/solr /var/solr

rectories.

You are now ready to run the installation script.

Run the Solr Installation Script

To run the script, you'll need to download the latest Solr 5.x distribution archive and then do the following (NOTE:

replace with the actual version number):solr-5.0.0

$ tar xzf solr-5.0.0.tgz solr-5.0.0/bin/install_solr_service.sh --strip-components=2

The previous command extracts the script from the archive into the currentinstall_solr_service.sh

directory. If installing on Red Hat, please make sure is installed before running the Solr installation script (lsof su

). The installation script must be run as root:do yum install lsof

456Apache Solr Reference Guide 5.1

$ sudo bash ./install_solr_service.sh solr-5.0.0.tgz

By default, the script extracts the distribution archive into , configures Solr to write files into ,/opt /var/solr

and runs Solr as the user. Consequently, the following command produces the same result as the previoussolr

command:

$ sudo bash ./install_solr_service.sh solr-5.0.0.tgz -i /opt -d /var/solr -u solr -s

solr -p 8983

You can customize the service name, installation directories, port, and owner using options passed to the

installation script. To see available options, simply do:

$ sudo bash ./install_solr_service.sh -help

Once the script completes, Solr will be installed as a service and running in the background on your server (on

port 8983). To verify, you can do:

$ sudo service solr status

We'll cover some additional configuration settings you can make to fine-tune your Solr setup in a moment. Before

moving on, let's take a closer look at the steps performed by the installation script. This gives you a better

overview and will help you understand important details about your Solr installation when reading other pages in

this guide; such as when a page refers to Solr home, you'll know exactly where that is on your system.

Solr Home Directory

The Solr home directory (not to be confused with the Solr installation directory) is where Solr reads its main

configuration file and manages index files. By default, the installation script uses .solr.xml /var/solr/data

Take a moment to inspect the contents of the Solr home directory on your system. At a minimum, the home

directory will contain a file. When Solr starts up, the Solr start script passes the location of the homesolr.xml

directory using the system property.-Dsolr.solr.home

Environment overrides include file

The service installation script creates an environment specific include file that overrides defaults used by the bin

script. The main advantage of using an include file is that it provides a single location where all of your/solr

environment-specific overrides are defined. Take a moment to inspect the contents of the /var/solr/solr.in

file, which is the default path setup by the installation script. There are many settings that you can override.sh

using this file. However, at a minimum, this script needs to define the and variablesSOLR_PID_DIR SOLR_HOME

, such as:

SOLR_PID_DIR=/var/solr

SOLR_HOME=/var/solr/data

The variable sets the directory where the start script will write out a file containing the SolrSOLR_PID_DIR

server’s process ID.

Log settings

457Apache Solr Reference Guide 5.1

Solr uses Apache Log4J for logging. The installation script copies /opt/solr/server/resources/log4j.p

to and customizes it for your environment. Specifically it updatesroperties /var/solr/log4j.properties

the Log4J settings to create logs in the directory. Take a moment to verify that the Solr/var/solr/logs

include file is configured to send logs to the correct location by checking the following settings in /var/solr

/solr.in.sh :

LOG4J_PROPS=/var/solr/log4j.properties

SOLR_LOGS_DIR=/var/solr/logs

For more information about Log4J configuration, please see: Configuring Logging

init.d script

When running a service like Solr on Linux, it’s common to setup an init.d script so that system administrators can

control Solr using the service tool, such as: . The installation script creates a very basicservice solr start

init.d script to help you get started. Take a moment to inspect the script and notice that the/etc/init.d/solr

following variables are setup for your environment based on the parameters passed to the installation script:

SOLR_INSTALL_DIR=/opt/solr

SOLR_ENV=/var/solr/solr.in.sh

RUNAS=solr

The and variables should be self-explanatory. The variable sets theSOLR_INSTALL_DIR SOLR_ENV RUNAS

owner of the Solr process, such as ; if you don’t set this value, the script will run Solr as , which is notsolr root

recommended for production. You can use the script to start Solr by doing the following/etc/init.d/solr

as root:

# service solr start

The script also supports the , , and commands. Please keep in mind/etc/init.d/solr stop restart status

that the init script that ships with Solr is very basic and is intended to show you how to setup Solr as a service.

However, it’s also common to use more advanced tools like or to control Solr as a servicesupervisord upstart

on Linux. While showing how to integrate Solr with tools like supervisord is beyond the scope of this guide, the i

script should provide enough guidance to help you get started. Also, the installation script sets thenit.d/solr

Solr service to start automatically when the host machine initializes.

Progress Check

In the next section, we cover some additional environment settings to help you fine-tune your production setup.

However, before we move on, let's review what we've achieved thus far. Specifically, you should be able to

control Solr using . Please verify the following commands work with your setup:/etc/init.d/solr

$ sudo service solr restart

$ sudo service solr status

The status command should give some basic information about the running Solr node that looks similar to:

458Apache Solr Reference Guide 5.1

Solr process PID running on port 8983

{

"version":"5.0.0 - ubuntu - 2014-12-17 19:36:58",

"startTime":"2014-12-19T19:25:46.853Z",

"uptime":"0 days, 0 hours, 0 minutes, 8 seconds",

"memory":"85.4 MB (%17.4) of 490.7 MB"}

If the command is not successful, look for error messages in .status /var/solr/logs/solr.log

Fine tune your production setup

Memory and GC Settings

By default, the script sets the maximum Java heap size to 512M (-Xmx512m), which is fine for gettingbin/solr

started with Solr. For production, you’ll want to increase the maximum heap size based on the memory

requirements of your search application; values between 10 and 20 gigabytes are not uncommon for production

servers. When you need to change the memory settings for your Solr server, use the variableSOLR_JAVA_MEM

in the include file, such as:

SOLR_JAVA_MEM="-Xms10g -Xmx10g"

Also, the include file comes with a set of pre-configured Java Garbage Collection settings that have shown to

work well with Solr for a number of different workloads. However, these settings may not work well for your

specific use of Solr. Consequently, you may need to change the GC settings, which should also be done with the

variable in the include file. For more information about tuning your memoryGC_TUNE /var/solr/solr.in.sh

and garbage collection settings, see: .JVM Settings

Out-of-Memory Shutdown Hook

The script registers the script to be called by the JVM if an OutOfMemoryErrorbin/solr bin/oom_solr.sh

occurs. The script will issue a to the Solr process that experiences the oom_solr.sh kill -9 OutOfMemoryE

. This behavior is recommended when running in SolrCloud mode so that ZooKeeper is immediatelyrror

notified that a node has experienced a non-recoverable error. Take a moment to inspect the contents of the /op

script so that you are familiar with the actions the script will perform if it is invokedt/solr/bin/oom_solr.sh

by the JVM.

SolrCloud

To run Solr in SolrCloud mode, you need to set the variable in the include file to point to yourZK_HOST

ZooKeeper ensemble and set . Running the embedded ZooKeeper is not supported inSOLR_MODE=solrcloud

production environments. For instance, if you have a ZooKeeper ensemble hosted on the following three hosts

on the default client port 2181 (zk1, zk2, and zk3), then you would set:

SOLR_MODE=solrcloud

ZK_HOST=zk1,zk2,zk3

This will cause the script to launch Solr in “cloud” mode.

ZooKeeper chroot

459Apache Solr Reference Guide 5.1

If you're using a ZooKeeper instance that is shared by other systems, it's recommended to isolate the SolrCloud

znode tree using ZooKeeper's chroot support. For instance, to ensure all znodes created by SolrCloud are stored

under , you can put on the end of your connection string, such as:/solr /solr ZK_HOST

ZK_HOST=zk1,zk2,zk3/solr

If using a chroot for the first time, you need to bootstrap the Solr znode tree in ZooKeeper by using the zkcli.s

script, such as:h

$ server/scripts/cloud-scripts/zkcli.sh -zkhost ZK_HOST -cmd bootstrap -solrhome

/var/solr/data

Solr Hostname

Use the variable in the include file to set the hostname of the Solr server.SOLR_HOST

SOLR_HOST=solr1.example.com

Setting the hostname of the Solr server is recommended, especially when running in SolrCloud mode, as this

determines the address of the node when it registers with ZooKeeper.

Override settings in solrconfig.xml

Solr allows configuration properties to be overridden using Java system properties passed at startup using the -

syntax. For instance, in , the default auto soft commit settings are set to:Dproperty=value solrconfig.xml

<maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>

</autoSoftCommit>

In general, whenever you see a property in a Solr configuration file that uses the ${solr.PROPERTY:DEFAULT

syntax, then you know it can be overridden using a Java system property. For instance, to set the_VALUE}

maxTime for soft-commits to be 10 seconds, then you can start Solr with -Dsolr.autoSoftCommit.maxTime

, such as:=10000

$ bin/solr start -Dsolr.autoSoftCommit.maxTime=10000

The script simply passes options starting with on to the JVM during startup. For running inbin/solr -D

production, we recommend setting these properties in the variable defined in the include file.SOLR_OPTS

Keeping with our soft-commit example, in , you would do:/var/solr/solr.in.sh

SOLR_OPTS="$SOLR_OPTS -Dsolr.autoSoftCommit.maxTime=10000"

Enable Remote JMX Access

If you need to attach a JMX-enabled Java profiling tool, such as JConsole or VisualVM, to a remote Solr server,

then you need to enable remote JMX access when starting the Solr server. Simply change the ENABLE_REMOTE

property in the include file to true. You’ll also need to choose a port for the JMX RMI connector to_JMX_OPTS

460Apache Solr Reference Guide 5.1

bind to, such as 18983. For example, if your Solr include script sets:

ENABLE_REMOTE_JMX_OPTS=true

RMI_PORT=18983

The JMX RMI connector will allow Java profiling tools to attach to port 18983. When enabled, the following

properties are passed to the JVM when starting Solr:

-Dcom.sun.management.jmxremote \

-Dcom.sun.management.jmxremote.local.only=false \

-Dcom.sun.management.jmxremote.ssl=false \

-Dcom.sun.management.jmxremote.authenticate=false \

-Dcom.sun.management.jmxremote.port=18983 \

-Dcom.sun.management.jmxremote.rmi.port=18983

We don’t recommend enabling remote JMX access in production, but it can sometimes be useful when doing

performance and user-acceptance testing prior to going into production.

Running multiple Solr nodes per host

The script was designed to allow you to run multiple Solr nodes on the same host. At a minimum, youbin/solr

will need unique Solr home directories for each node you want to run; ideally, each home should be on a

different physical disk so that multiple Solr nodes don’t have to compete with each other when accessing files on

disk. Having different Solr home directories implies that you’ll need a different include file for each node.

Moreover, if using the script to control Solr as a service, then you’ll need a separate script/etc/init.d/solr

for each node. The easiest approach is to use the service installation script to add multiple services on the same

host, such as:

$ sudo bash ./install_solr_service.sh solr-5.0.0.tgz -s solr2 -p 8984

The command shown above will add a service named running on port 8984 using forsolr2 /var/solr2

writable (aka "live") files; the second server will still be owned and run by the user and will use the Solrsolr

distribution files in . After installing the solr2 service, verify it works correctly by doing:/opt

$ sudo service solr2 restart

$ sudo service solr2 status

Configuring Logging

Temporary Logging Settings

You can control the amount of logging output in Solr by using the Admin Web interface. Select the linLOGGING

k. Note that this page only lets you change settings in the running system and is not saved for the next run. (For

more information about the Admin Web interface, see .)Using the Solr Administration User Interface

In addition to the logging options described below, there is a way to configure which request parameters

(such as parameters sent as part of queries) are logged with an additional request parameter called log

. See the section on for more information.ParamsList Common Query Parameters

461Apache Solr Reference Guide 5.1

The Logging screen.

This part of the Admin Web interface allows you to set the logging level for many different log categories.

Fortunately, any categories that are will have the logging level of its parent. This makes it possible tounset

change many categories at once by adjusting the logging level of their parent.

When you select , you see the following menu:Level

The Log Level Menu.

Directories are shown with their current logging levels. The Log Level Menu floats over these. To set a log level

for a particular directory, select it and click the appropriate log level button.

Log levels settings are as follows:

Level Result

FINEST Reports everything.

FINE Reports everything but the least important messages.

CONFIG Reports configuration errors.

INFO Reports everything but normal status.

WARNING Reports all warnings.

SEVERE Reports only the most severe warnings.

462Apache Solr Reference Guide 5.1

OFF Turns off logging.

UNSET Removes the previous log setting.

Multiple settings at one time are allowed.

Permanent Logging Settings

Solr uses for logging and is configured using .Log4J version 1.2 server/resources/log4j.properties

Take a moment to inspect the contents of the file so that you are familiar with itslog4j.properties

structure. By default, Solr log messages will be written to and to stdout (console),server/logs/solr.log

which is fine when you're just getting started with Solr.

When you're ready to deploy Solr in production, we recommend making a few minor changes to the log settings.

Edit and set the property to the location where you want Solr to write log files,log4j.properties solr.log

such as . Note that if you installed Solr as a service using the instructions provided at /var/solr/logs Taking

, then see instead of the default versSolr to Production /var/solr/log4j.properties server/resources

ion.

solr.log=/var/solr/logs

Alternatively, you can use the system property to set the location of the log files, such as:solr.solr.home

solr.log=${solr.solr.home}/../logs

During initialization, Log4J will resolve this to a path based on the system property. Whilesolr.solr.home

you’re customizing the file, we also recommend removing the CONSOLE appender fromlog4j.properties

the rootLogger by changing the property to:log4j.rootLogger

log4j.rootLogger=INFO, file

Also, the default log rotation size threshold of 4MB is too small for production servers and should be increased to

a larger value (such as 100MB or more).

log4j.appender.file.MaxFileSize=100MB

Logging Slow Queries

For high-volume search applications, logging every query can generate a large amount of logs and, depending

on the volume, potentially impact performance. If you mine these logs for additional insights into your application,

then logging every query request may be useful. On the other hand, if you're only concerned about warnings and

error messages related to requests, then you can set the log verbosity to WARN. However, this poses a potential

problem in that you won't know if any queries are slow, as slow queries are still logged at the INFO level. Solr

provides a way to set your log verbosity threshold to WARN and be able to set a latency threshold above which a

request is considered "slow" and log that request at the WARN level to help you identify slow queries in your

application. To enable this behavior, configure the element in the secti<slowQueryThresholdMillis> query

on of solrconfig.xml:

463Apache Solr Reference Guide 5.1

Any queries that take longer than the specified threshold will be logged as "slow" queries at the WARN level.

Enabling SSL

Both SolrCloud and single-node Solr can encrypt communications to and from clients, and in SolrCloud between

nodes, with SSL. This section describes enabling SSL with the example Jetty server using a self-signed

certificate.

For background on SSL certificates and keys, see .http://www.tldp.org/HOWTO/SSL-Certificates-HOWTO/

Basic SSL Setup

Generate a self-signed certificate and a key

Convert the certificate and key to PEM format for use with cURL

Set common SSL related system properties

Run Single Node Solr using SSL

SolrCloud

Configure ZooKeeper

Run SolrCloud with SSL

Example Client Actions

Create a SolrCloud collection using bin/solr

Retrieve SolrCloud cluster status using cURL

Index documents using post.jar

Query using cURL

Index a document using CloudSolrClient

Basic SSL Setup

Generate a self-signed certificate and a key

To generate a self-signed certificate and a single key that will be used to authenticate both the server and the

client, we'll use the JDK command and create a separate keystore. This keystore will also be used as keytool

a truststore below. It's possible to use the keystore that comes with the JDK for these purposes, and to use a

separate truststore, but those options aren't covered here.

Run the commands below in the directory in the binary Solr distribution. It's assumed that youserver/etc/

have the JDK utility on your , and that is also on your . See keytool PATH openssl PATH https://www.openssl.

for OpenSSL binaries for Windows and Solaris.org/related/binaries.html

The " " option allows you to specify all the DNS names and/or IP addresses that will be-ext SAN=... keytool

allowed during hostname verification (but see below for how to skip hostname verification between Solr nodes so

that you don't have to specify all hosts here). In addition to and , this example includeslocalhost 127.0.0.1

a LAN IP address for the machine the Solr nodes will be running on:192.168.1.3

keytool -genkeypair -alias solr-ssl -keyalg RSA -keysize 2048 -keypass secret

-storepass secret -validity 9999 -keystore solr-ssl.keystore.jks -ext

SAN=DNS:localhost,IP:192.168.1.3,IP:127.0.0.1 -dname "CN=localhost,

OU=Organizational Unit, O=Organization, L=Location, ST=State, C=Country"

464Apache Solr Reference Guide 5.1

The above command will create a keystore file named in the current directory.solr-ssl.keystore.jks

Convert the certificate and key to PEM format for use with cURL

cURL isn't capable of using JKS formatted keystores, so the JKS keystore needs to be converted to PEM format,

which cURL understands.

First convert the JKS keystore into PKCS12 format using :keytool

keytool -importkeystore -srckeystore solr-ssl.keystore.jks -destkeystore

solr-ssl.keystore.p12 -srcstoretype jks -deststoretype pkcs12

The keytool application will prompt you to create a destination keystore password and for the source keystore

password, which was set when creating the keystore ("secret" in the example shown above).

Next convert the PKCS12 format keystore, including both the certificate and the key, into PEM format using the o

command: penssl

openssl pkcs12 -in solr-ssl.keystore.p12 -out solr-ssl.pem

If you want to use cURL on OS X Yosemite (10.10), you'll need to create a certificate-only version of the PEM

format, as follows:

openssl pkcs12 -nokeys -in solr-ssl.keystore.p12 -out solr-ssl.cacert.pem

Set common SSL related system properties

The Solr start script is already setup to pass SSL-related Java system properties to the JVM. To activate the SSL

settings, uncomment and update the variable in . (or oSOLR_SSL_OPTS bin/solr.in.sh bin\solr.in.cmd

n Windows). Note, if you setup Solr as a service on Linux using the steps outlined in ,Taking Solr to Production

then make these changes in instead./var/solr/solr.in.sh

SOLR_SSL_OPTS="-Djavax.net.ssl.keyStore=etc/solr-ssl.keystore.jks \

-Djavax.net.ssl.keyStorePassword=secret \

-Djavax.net.ssl.trustStore=etc/solr-ssl.keystore.jks \

-Djavax.net.ssl.trustStorePassword=secret"

When you start Solr, the script includes the settings in and will pass thesebin/solr bin/solr.in.sh

SSL-related system properties to the JVM.

Similarly, when you start Solr on Windows, the script includes the settings in bin\solr.cmd bin\solr.in.cm

- uncomment and update the variable to pass these SSL-related system properties to thed SOLR_SSL_OPTS

JVM:

bin/solr.in.sh example SOLR_SSL_OPTS configuration

465Apache Solr Reference Guide 5.1

set SOLR_SSL_OPTS=-Djavax.net.ssl.keyStore=etc\solr-ssl.keystore.jks ^

-Djavax.net.ssl.keyStorePassword=secret ^

-Djavax.net.ssl.trustStore=etc\solr-ssl.keystore.jks ^

-Djavax.net.ssl.trustStorePassword=secret

Run Single Node Solr using SSL

Start Solr using the command shown below; by default clients will not be required to authenticate:

bin/solr -p 8984

bin\solr.cmd -p 8984

Alternatively, to require clients to authenticate, you can set the system property to jetty.ssl.clientAuth tr

(default is ):ue false

bin/solr -p 8984 -Djetty.ssl.clientAuth=true

bin\solr.cmd -p 8984 -Djetty.ssl.clientAuth=true

Notice that Java system properties set using the syntax are passed through to the JVM by-Dproperty=value

the (or ) script.bin/solr bin\solr.cmd

SolrCloud

This section describes how to run a two-node SolrCloud cluster with no initial collections and a single-node

external ZooKeeper. The commands below assume you have already created the keystore described above.

Configure ZooKeeper

Before you start any SolrCloud nodes, you must configure your solr cluster properties in ZooKeeper, so that

Solr nodes know to communicate via SSL.

This section assumes you have created and started a single-node external ZooKeeper on port 2181 on localhost

bin\solr.in.cmd example SOLR_SSL_OPTS configuration

*nix command

Windows command

*nix command

Windows command

ZooKeeper does not support encrypted communication with clients like Solr. There are several related

JIRA tickets where SSL support is being planned/worked on: ; ; ZOOKEEPER-235 ZOOKEEPER-236 ZO

; and .OKEEPER-733 ZOOKEEPER-1000

466Apache Solr Reference Guide 5.1

- see Setting Up an External ZooKeeper Ensemble

The cluster-wide property needs to be set to before any Solr node starts up. The exampleurlScheme https

below uses the client that comes with the binary Solr distribution to do this:zkcli.sh

server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:2181 -cmd put

/clusterprops.json '{"urlScheme":"https"}'

server\scripts\cloud-scripts\zkcli.bat -zkhost localhost:2181 -cmd put

/clusterprops.json "{""urlScheme"":""https""}"

If you have set up your ZooKeeper cluster to use a , make sure you place the chroot for Solr clusterprops.js

file there instead: (oron ... -cmd put /my_chroot/clusterprops.json '{"urlScheme":"https"}'

on Windows: ... ).-cmd put /my_chroot/clusterprops.json "{""urlScheme"":""https""}"

Run SolrCloud with SSL

Create Solr home directories for two nodes

Create two copies of the directory which will serve as the Solr home directories for each of yourserver/solr/

two SolrCloud nodes:

mkdir cloud

cp -r server/solr cloud/node1

cp -r server/solr cloud/node2

mkdir cloud

xcopy /E server\solr cloud\node1\

xcopy /E server\solr cloud\node2\

Start the first Solr node

Next, start the first Solr node on port 8984; be sure to stop the standalone server if you started it when working

through the previous section on this page:

bin/solr -cloud -s cloud/node1 -z localhost:2181 -p 8984

*nix command

Windows command

*nix commands

Windows commands

*nix command

467Apache Solr Reference Guide 5.1

bin\solr.cmd -cloud -s cloud\node1 -z localhost:2181 -p 8984

Notice the use of the option to set the location of the Solr home directory for node1. Moreover, the start script-s

will set the property to the value of the option ( in this case). You can override this-Djetty.ssl.port -p 8984

behavior by setting the variable in the include file.SOLR_SSL_PORT

If you created your SSL key without all DNS names/IP addresses on which Solr nodes will run, you can tell Solr

to skip hostname verification for inter-Solr-node communications by setting the syssolr.ssl.checkPeerName

tem property to :false

bin/solr -cloud -s cloud/node1 -z localhost:2181 -p 8984

-Dsolr.ssl.checkPeerName=false

bin\solr.cmd -cloud -s cloud\node1 -z localhost:2181 -p 8984

-Dsolr.ssl.checkPeerName=false

Start the second Solr node

Finally, start the second Solr node on port 7574 - again, to skip hostname verification, add -Dsolr.ssl.check

(not shown here):PeerName=false

bin/solr -cloud -s cloud/node2 -z localhost:2181 -p 7574

bin\solr.cmd -cloud -s cloud\node2 -z localhost:2181 -p 7574

Example Client Actions

Windows command

*nix command

Windows command

*nix command

Windows command

468Apache Solr Reference Guide 5.1

Create a SolrCloud collection using bin/solr

Create a 2-shard, replicaFactor=1 collection named mycollection using the default configset

(data_driven_schema_configs):

bin/solr create -c mycollection -shards 2

bin\solr.cmd create -c mycollection -shards 2

The action will pass the set in your include file to the SolrJ code used to create thecreate SOLR_SSL_OPTS

collection.

Retrieve SolrCloud cluster status using cURL

To get the resulting cluster status (again, if you have not enabled client authentication, remove the -E

option):solr-ssl.pem:secret

curl -E solr-ssl.pem:secret --cacert solr-ssl.pem

"https://localhost:8984/solr/admin/collections?action=CLUSTERSTATUS&wt=json&indent=o

You should get a response that looks like this:

cURL on OS X Mavericks (10.9) has degraded SSL support. For more information and workarounds to

allow 1-way SSL, see . cURL on OS X Yosemitehttp://curl.haxx.se/mail/archive-2013-10/0036.html

(10.10) is improved - 2-way SSL is possible - see .http://curl.haxx.se/mail/archive-2014-10/0053.html

The cURL commands in the following sections will not work with the system on OS X Yosemitecurl

(10.10). Instead, the certificate supplied with the param must be in PKCS12 format, and the file-E

supplied with the param must contain only the CA certificate, and no key (see for--cacert above

instructions on creating this file):

curl -E solr-ssl.keystore.p12:secret --cacert solr-ssl.cacert.pem ...

If your operating system does not include cURL, you can download binaries here: http://curl.haxx.se/dow

nload.html

*nix command

Windows command

469Apache Solr Reference Guide 5.1

{

"responseHeader":{

"status":0,

"QTime":2041},

"cluster":{

"collections":{

"mycollection":{

"shards":{

"shard1":{

"range":"80000000-ffffffff",

"state":"active",

"replicas":{"core_node1":{

"state":"active",

"base_url":"https://127.0.0.1:8984/solr",

"core":"mycollection_shard1_replica1",

"node_name":"127.0.0.1:8984_solr",

"leader":"true"}}},

"shard2":{

"range":"0-7fffffff",

"state":"active",

"replicas":{"core_node2":{

"state":"active",

"base_url":"https://127.0.0.1:7574/solr",

"core":"mycollection_shard2_replica1",

"node_name":"127.0.0.1:7574_solr",

"leader":"true"}}}},

"maxShardsPerNode":"1",

"router":{"name":"compositeId"},

"replicationFactor":"1"}},

"properties":{"urlScheme":"https"}}}

Index documents using post.jar

Use to index some example documents to the SolrCloud collection created above:post.jar

cd example/exampledocs

java -Djavax.net.ssl.keyStorePassword=secret

-Djavax.net.ssl.keyStore=../../server/etc/solr-ssl.keystore.jks

-Djavax.net.ssl.trustStore=../../server/etc/solr-ssl.keystore.jks

-Djavax.net.ssl.trustStorePassword=secret

-Durl=https://localhost:8984/solr/mycollection/update -jar post.jar *.xml

Query using cURL

Use cURL to query the SolrCloud collection created above, from a directory containing the PEM formatted

certificate and key created above (e.g. ) - if you have not enabled client authentication (systemexample/etc/

property , then you can remove the option:-Djetty.ssl.clientAuth=true) -E solr-ssl.pem:secret

curl -E solr-ssl.pem:secret --cacert solr-ssl.pem

"https://localhost:8984/solr/mycollection/select?q=*:*&wt=json&indent=on"

Index a document using CloudSolrClient

470Apache Solr Reference Guide 5.1

From a java client using Solrj, index a document. In the code below, the system propertiesjavax.net.ssl.*

are set programmatically, but you could instead specify them on the java command line, as in the exapost.jar

mple above:

System.setProperty("javax.net.ssl.keyStore", "/path/to/solr-ssl.keystore.jks");

System.setProperty("javax.net.ssl.keyStorePassword", "secret");

System.setProperty("javax.net.ssl.trustStore", "/path/to/solr-ssl.keystore.jks");

System.setProperty("javax.net.ssl.trustStorePassword", "secret");

String zkHost = "127.0.0.1:2181";

CloudSolrClient server = new CloudSolrClient(zkHost);

server.setDefaultCollection("mycollection");

SolrInputDocument doc = new SolrInputDocument();

doc.addField("id", "1234");

doc.addField("name", "A lovely summer holiday");

server.add(doc);

server.commit();

Backing Up

If you are worried about data loss, and of course you be, you need a way to back up your Solr indexes soshould

that you can recover quickly in case of catastrophic failure.

Making Backups with the Solr Replication Handler

The easiest way to make back-ups in Solr is to take advantage of the Replication Handler, which is described in

detail in . The Replication Handler's primary purpose is to replicate an index on slave serversIndex Replication

for load-balancing, but the Replication Handler can be used to make a back-up copy of a server's index, even if

no slave servers are in operation.

Once you have configured the Replication Handler in , you can trigger a back-up with ansolrconfig.xml

HTTP command like this:

http:// /solr/collection_name/replication?command=backupmaster_host

For details on configuring the Replication Handler, see .Legacy Scaling and Distribution

Using JMX with Solr

Java Management Extensions (JMX) is a technology that makes it possible for complex systems to be controlled

by tools without the systems and tools having any previous knowledge of each other. In essence, it is a standard

interface by which complex systems can be viewed and manipulated.

Solr, like any other good citizen of the Java universe, can be controlled via a JMX interface. You can enable JMX

support by adding lines to . You can use a JMX client, like jconsole, to connect with Solr.solrconfig.xml

Check out the Wiki page for more information. You may also find the followinghttp://wiki.apache.org/solr/SolrJmx

overview of JMX to be useful: .http://docs.oracle.com/javase/7/docs/technotes/guides/management/agent.html

Configuring JMX

JMX configuration is provided in . Please see the for moresolrconfig.xml JMX Technology Home Page

details.

471Apache Solr Reference Guide 5.1

A attribute can be used when configuring in . If this attribute is set, SolrrootName <jmx /> solrconfig.xml

uses it as the root name for all the MBeans that Solr exposes via JMX. The default name is "solr" followed by the

core name.

Configuring an Existing MBeanServer

The command:

<jmx />

enables JMX support in Solr if and only if an existing MBeanServer is found. Use this if you want to configure

JMX with JVM parameters. Remove this to disable exposing Solr configuration and statistics to JMX. If this is

specified, Solr will try to list all available MBeanServers and use the first one to register MBeans.

Configuring an Existing MBeanServer with agentId

The command:

enables JMX support in Solr if and only if an existing MBeanServer is found matching the given agentId. If

multiple servers are found, the first one is used. If none is found, an exception is raised and depending on the

configuration, Solr may refuse to start.

Configuring a New MBeanServer

The command:

creates a new MBeanServer exposed for remote monitoring at the specific service URL. If the

JMXConnectorServer can't be started (probably because the serviceUrl is bad), an exception is thrown.

Example

Solr's config set uses the simple configuration option. If you startsample_techproducts_configs <jmx />

the example with the necessary JVM system properties to launch an internal MBeanServer, Solr will register with

it and you can connect using a tool like :jconsole

Launch the example with JMX enabled:techproducts

bin/solr -e techproducts -a '-Dcom.sun.management.jmxremote'

Start (provided with the Sun JDK in the bin directory).jconsole

Connect to the " " shown in the list of local processes.start.jar

Enabling/disabling JMX and securing access to MBeanServers is left up to the user by specifying

appropriate JVM parameters and configuration. Please explore the forJMX Technology Home Page

more details.

472Apache Solr Reference Guide 5.1

4. Switch to the "MBeans" tab. You should be able to see " " listed there, at whichsolr/techproducts

point you can drill down and see details of every solr plugin.

Configuring a Remote Connection to Solr JMX

If you want to connect to Solr remotely, you need to pass in some extra parameters, documented here:

http://docs.oracle.com/javase/7/docs/technotes/guides/management/agent.html

Running Solr on HDFS

Solr has support for writing and reading its index and transaction log files to the HDFS distributed filesystem.

This does not use Hadoop Map-Reduce to process Solr data, rather it only uses the HDFS filesystem for index

and transaction log file storage.

Basic Configuration

To use HDFS rather than a local filesystem, you must be using Hadoop 2.x and configure prosolrconfig.xml

perly.

You need to use an HdfsDirectoryFactory and a data dir of the form hdfs://host:port/path

You need to specify an UpdateLog location of the form hdfs://host:port/path

You should specify a lock factory type of ' ' or none.hdfs

With the default configuration files, you can start Solr on HDFS with the following command:

java -Dsolr.directoryFactory=HdfsDirectoryFactory

-Dsolr.lock.type=hdfs

-Dsolr.data.dir=hdfs://host:port/path

-Dsolr.updatelog=hdfs://host:port/path -jar start.jar

SolrCloud Configuration

In SolrCloud mode, it's best to leave the data and update log directories as the defaults Solr comes with and

simply specify the solr.hdfs.home. All dynamically created collections will create the appropriate directories

automatically under the solr.hdfs.home root directory.

Set solr.hdfs.home in the form hdfs://host:port/path

You should specify a lock factory type of ' ' or none.hdfs

With the default configuration files, you can start SolrCloud on HDFS with the following command:

Making JMX connections into machines running behind NATs (e.g. Amazon's EC2 service) is not a

simple task. The system property may help, but running onjava.rmi.server.hostname jconsole

the server itself and using a remote desktop is often the simplest solution. See http://web.archive.org/we

.b/20130525022506/http://jmsbrdy.com/monitoring-java-applications-running-on-ec2-i

473Apache Solr Reference Guide 5.1

java -Dsolr.directoryFactory=HdfsDirectoryFactory

-Dsolr.lock.type=hdfs

-Dsolr.hdfs.home=hdfs://host:port/path

The Block Cache

For performance, the HdfsDirectoryFactory uses a Directory that will cache HDFS blocks. This caching

mechanism is meant to replace the standard file system cache that Solr utilizes so much. By default, this cache

is allocated off heap. This cache will often need to be quite large and you may need to raise the off heap memory

limit for the specific JVM you are running Solr in. For the Oracle/OpenJDK JVMs, the follow is an example

command line parameter that you can use to raise the limit when starting Solr:

-XX:MaxDirectMemorySize=20g

Settings

The HdfsDirectoryFactory has a number of settings.

Solr HDFS Settings

Param Example Value Default Description

solr.hdfs.home hdfs://host:port/path/solr N/A A root location in HDFS for Solr to write collection

data to. Rather than specifying an HDFS location

for the data directory or update log directory, use

this to specify one root location and have

everything automatically created within this HDFS

location.

Block Cache Settings

Param Default Description

solr.hdfs.blockcache.enabled true Enable the blockcache

solr.hdfs.blockcache.read.enabled true Enable the read cache

solr.hdfs.blockcache.write.enabled true Enable the write cache

solr.hdfs.blockcache.direct.memory.allocation true Enable direct memory allocation. If

this is false, heap is used

solr.hdfs.blockcache.slab.count 1 Number of memory slabs to

allocate. Each slab is 128 MB in

size.

solr.hdfs.blockcache.global false Enable/Disable using one global

cache for all SolrCores. The settings

used will be from the first

HdfsDirectoryFactory created.

474Apache Solr Reference Guide 5.1

NRTCachingDirectory Settings

Param Default Description

solr.hdfs.nrtcachingdirectory.enable true Enable the use of

NRTCachingDirectory

solr.hdfs.nrtcachingdirectory.maxmergesizemb 16 NRTCachingDirectory max segment

size for merges

solr.hdfs.nrtcachingdirectory.maxcachedmb 192 NRTCachingDirectory max cache

size

HDFS Client Configuration Settings

solr.hdfs.confdir pass the location of HDFS client configuration files - needed for HDFS HA for example.

Param Default Description

solr.hdfs.confdir N/A Pass the location of HDFS client configuration files - needed for HDFS HA

for example.

Kerberos Authentication Settings

Hadoop can be configured to use the Kerberos protocol to verify user identity when trying to access core

services like HDFS. If your HDFS directories are protected using Kerberos, then you need to configure Solr's

HdfsDirectoryFactory to authenticate using Kerberos in order to read and write to HDFS. To enable Kerberos

authentication from Solr, you need to set the following parameters:

Param Default Description

solr.hdfs.security.kerberos.enabled false Set to true to enable Kerberos authentication

solr.hdfs.security.kerberos.keytabfile N/A A keytab file contains pairs of

Kerberos principals and encrypted keys which

allows for password-less authentication when

Solr attempts to authenticate with secure

Hadoop.

This file will need to be present on all Solr

servers at the same path provided in this

parameter.

solr.hdfs.security.kerberos.principal N/A The Kerberos principal that Solr should use to

authenticate to secure Hadoop; the format of

a typical Kerberos V5 principal is: primary/

instance@realm

Example

475Apache Solr Reference Guide 5.1

</directoryFactory>

If using Kerberos, you will need to add the three Kerberos related properties to the elem<directoryFactory>

ent in solrconfig.xml, such as:

...

<str name="solr.hdfs.security.kerberos.keytabfile">/etc/krb5.keytab</str>

<str name="solr.hdfs.security.kerberos.principal">solr/admin@KERBEROS.COM</str>

</directoryFactory>

Limitations

You must use an 'append-only' Lucene index codec because HDFS is an append only filesystem. The currently

default codec used by Solr is 'append-only' and supported with HDFS.

Automatically Add Replicas in SolrCloud

One benefit to running Solr in HDFS is the ability to automatically add new replicas when the Overseer notices

that a shard has gone down. Because the "gone" index shards are stored in HDFS, the a new core will be

created and the new core will point to the existing indexes in HDFS.

Collections created using on a shared file system have automatic addition of replicasautoAddReplicas=true

enabled. The following settings can be used to override the defaults in the section of .<solrcloud> solr.xml

Param Default Description

autoReplicaFailoverWorkLoopDelay 10000 The time (in ms) between clusterstate inspections by the

Overseer to detect and possibly act on creation of a

replacement replica.

autoReplicaFailoverWaitAfterExpiration 30000 The minimum time (in ms) to wait for initiating

replacement of a replica after first noticing it not being live.

This is important to prevent false positives while stoping

or starting the cluster.

autoReplicaFailoverBadNodeExpiration 60000 The delay (in ms) after which a replica marked as down

would be unmarked.

Temporarily disable autoAddReplicas for the entire cluster

476Apache Solr Reference Guide 5.1

When doing offline maintenance on the cluster and for various other use cases where an admin would like to

temporarily disable auto addition of replicas, the following APIs will disable and re-enable autoAddReplicas for all

:collections in the cluster

Disable auto addition of replicas cluster wide by setting the cluster property to :autoAddReplicas false

http://localhost:8983/solr/admin/collections?action=CLUSTERPROP&name=autoAddReplicas

&val=false

Re-enable auto addition of replicas (for those collections created with autoAddReplica=true) by unsetting the aut

cluster property (when no param is provided, the cluster property is unset):oAddReplicas val

http://localhost:8983/solr/admin/collections?action=CLUSTERPROP&name=autoAddReplicas

477Apache Solr Reference Guide 5.1
SolrCloud
Apache Solr includes the ability to set up a cluster of Solr servers that combines fault tolerance and high
availability. Called  , these capabilities provide distributed indexing and search capabilities, supportingSolrCloud
the following features:
Central configuration for the entire cluster
Automatic load balancing and fail-over for queries
ZooKeeper integration for cluster coordination and configuration.
SolrCloud is flexible distributed search and indexing, without a master node to allocate nodes, shards and
replicas. Instead, Solr uses ZooKeeper to manage these locations, depending on configuration files and
schemas. Documents can be sent to any server and ZooKeeper will figure it out.
In this section, we'll cover everything you need to know about using Solr in SolrCloud mode. We've split up the
details into the following topics:
Getting Started with SolrCloud
How SolrCloud Works
Nodes, Cores, Clusters and Leaders
Shards and Indexing Data in SolrCloud
Distributed Requests
Read and Write Side Fault Tolerance
NRT, Replication, and Disaster Recovery with SolrCloud
SolrCloud Configuration and Parameters
Setting Up an External ZooKeeper Ensemble
Using ZooKeeper to Manage Configuration Files
ZooKeeper Access Control
Collections API
Parameter Reference
Command Line Utilities
SolrCloud with Legacy Configuration Files
You can also find more information on the  .Solr wiki page on SolrCloud
 
Getting Started with SolrCloud
SolrCloud is designed to provide a highly available, fault tolerant environment for distributing your indexed
content and query requests across multiple servers. It's a system in which data is organized into multiple pieces,
or shards, that can be hosted on multiple machines, with replicas providing redundancy for both scalability and
fault tolerance, and a ZooKeeper server that helps manage the overall structure so that both indexing and search
requests can be routed properly.
This section explains SolrCloud and its inner workings in detail, but before you dive in, it's best to have an idea of
what it is you're trying to accomplish. This page provides a simple tutorial to start Solr in SolrCloud mode, so you
can begin to get a sense for how shards interact with each other during indexing and when serving queries. To
that end, we'll use simple examples of configuring SolrCloud on a single machine, which is obviously not a real
production environment, which would include several servers or virtual machines. In a real production
environment, you'll also use the real machine names instead of "localhost" which we've used here.

478Apache Solr Reference Guide 5.1

In this section you will learn how to start a SolrCloud cluster using startup scripts and a specific configset.

SolrCloud Example

Interactive Startup

The script makes it easy to get started with SolrCloud as it walks you through the process ofbin/solr

launching Solr nodes in cloud mode and adding a collection. To get started, simply do:

$ bin/solr -e cloud

This starts an interactive session to walk you through the steps of setting up a simple SolrCloud cluster with

embedded ZooKeeper. The script starts by asking you how many Solr nodes you want to run in your local

cluster, with the default being 2.

Welcome to the SolrCloud example!

This interactive session will help you launch a SolrCloud cluster on your local

workstation.

To begin, how many Solr nodes would you like to run in your local cluster? (specify

1-4 nodes) [2]

The script supports starting up to 4 nodes, but we recommend using the default of 2 when starting out. These

nodes will each exist on a single machine, but will use different ports to mimic operation on different servers.

Next, the script will prompt you for the port to bind each of the Solr nodes to, such as:

Please enter the port for node1 [8983]

Choose any available port for each node; the default for the first node is 8983 and 7574 for the second

node. The script will start each node in order and shows you the command it uses to start the server, such as:

solr start -cloud -s example/cloud/node1/solr -p 8983

The first node will also start an embedded ZooKeeper server bound to port 9983. The Solr home for the first

node is in as indicated by the option.example/cloud/node1/solr -s

After starting up all nodes in the cluster, the script prompts you for the name of the collection to create:

Please provide a name for your new collection: [gettingstarted]

The suggested default is "gettingstarted" but you might want to choose a name more appropriate for your

specific search application.

Next, the script prompts you for the number of shards to distribute the collection across. is covered inSharding

more detail later on, so if you're unsure, we suggest using the default of 2 so that you can see how a collection is

This tutorial assumes that you're already familiar with the basics of using Solr. If you need a refresher,

please see the to get a grounding in Solr concepts. If you load documents as partGetting Started section

of that exercise, you should start over with a fresh Solr installation for these SolrCloud tutorials.

479Apache Solr Reference Guide 5.1

distributed across multiple nodes in a SolrCloud cluster.

Next, the script will prompt you for the number of replicas to create for each shard. is covered inReplication

more detail later in the guide, so if you're unsure, then use the default of 2 so that you can see how replication is

handled in SolrCloud.

Lastly, the script will prompt you for the name of a configuration directory for your collection. You can choose bas

, , or . The configuration directoriesic_configs data_driven_schema_configs sample_techproducts_configs

are pulled from so you can review them beforehand if you wish. The server/solr/configsets/ data_drive

configuration (the default) is useful when you're still designing a schema for your documentsn_schema_configs

and need some flexiblity as you experiment with Solr.

At this point, you should have a new collection created in your local SolrCloud cluster. To verify this, you can run

the status command:

$ bin/solr status

If you encounter any errors during this process, check the Solr log files in and example/cloud/node1/logs e

.xample/cloud/node2/logs

You can see how your collection is deployed across the cluster by visiting the cloud panel in the Solr Admin UI: h

. Solr also provides a way to perform basic diagnostics for a collection using thettp://localhost:8983/solr/#/~cloud

healthcheck command:

$ bin/solr healthcheck -c gettingstarted

The healthcheck command gathers basic information about each replica in a collection, such as number of docs,

current status (active, down, etc), and address (where the replica lives in the cluster).

Documents can now be added to SolrCloud using the .Post Tool

To stop Solr in SolrCloud mode, you would use the script and issue the command, as in:bin/solr stop

$ bin/solr stop -all

Starting with -noprompt

You can also get SolrCloud started with all the defaults instead of the interactive session using the following

command:

$ bin/solr -e cloud -noprompt

Restarting Nodes

You can restart your SolrCloud nodes using the script. For instance, to restart node1 running on portbin/solr

8983 (with an embedded ZooKeeper server), you would do:

$ bin/solr restart -c -p 8983 -s example/cloud/node1/solr

To restart node2 running on port 7574, you can do:

480Apache Solr Reference Guide 5.1

$ bin/solr restart -c -p 7574 -z localhost:9983 -s example/cloud/node2/solr

Notice that you need to specify the ZooKeeper address (-z localhost:9983) when starting node2 so that it can join

the cluster with node1.

Adding a node to a cluster

Adding a node to an existing cluster is a bit advanced and involves a little more understanding of Solr. Once you

startup a SolrCloud cluster using the startup scripts, you can add a new node to it by:

$ mkdir <solr.home for new solr node>

$ cp <existing solr.xml path> <new solr.home>

$ bin/solr start -cloud -s solr.home/solr -p <port num> -z <zk hosts string>

Notice that the above requires you to create a Solr home directory and copy to it.solr.xml

Example (with directory structure) that adds a node to an example started with "bin/solr -e cloud":

$ mkdir -p example/cloud/node3/solr

$ cp server/solr/solr.xml example/cloud/node3/solr

$ bin/solr start -cloud -s example/cloud/node3/solr -p 8987 -z localhost:9983

The previous command will start another Solr node on port 8987 with Solr home set to example/cloud/node3

. The new node will write its log files to ./solr example/cloud/node3/logs

Once you're comfortable with how the SolrCloud example works, we recommend using the process described in

for setting up SolrCloud nodes in production.Taking Solr to Production

How SolrCloud Works

In this section, we'll discuss generally how SolrCloud works, covering these topics:

Nodes, Cores, Clusters and Leaders

Shards and Indexing Data in SolrCloud

Distributed Requests

Read and Write Side Fault Tolerance

NRT, Replication, and Disaster Recovery with SolrCloud

If you are already familiar with SolrCloud concepts and functionality, you can skip to the section covering SolrClo

.ud Configuration and Parameters

Basic SolrCloud Concepts

On a single node, Solr has a that is essentially a single . If you want multiple indexes, you createcore index

multiple cores. With SolrCloud, a single index can span multiple Solr instances. This means that a single index

can be made up of multiple cores on different machines.

The cores that make up one logical index are called a . A collection is a essentially a single index thatcollection

can span many cores, both for index scaling as well as redundancy. If, for instance, you wanted to move your

two-core Solr setup to SolrCloud, you would have 2 collections, each made up of multiple individual cores.

481Apache Solr Reference Guide 5.1

In SolrCloud you can have multiple collections. Collections can be divided into shards. Each can exist inshard

multiple copies; these copies of the same shard are called replicas. One of the replicas within a shard is the lead

, designated by a leader-election process. Each is a physical index, so one replica corresponds to oneer replica

core.

It is important to understand the distinction between a core and a collection. In classic single node Solr, a core is

basically equivalent to a collection in that it presents one logical index. In SolrCloud, the cores on multiple nodes

form a collection. This is still just one logical index, but multiple cores host different shards of the full collection.

So a core encapsulates a single physical index on an instance. A collection is a combination of all of the cores

that together provide a logical index that is distributed across many nodes.

Nodes, Cores, Clusters and Leaders

Nodes and Cores

In SolrCloud, a is Java Virtual Machine instance running Solr, commonly called a server. Each Solr corenode

can also be considered a node. Any node can contain both an instance of Solr and various kinds of data.

A Solr is basically an index of the text and fields found in documents. A single Solr instance can containcore

multiple "cores", which are separate from each other based on local criteria. It might be that they are going to

provide different search interfaces to users (customers in the US and customers in Canada, for example), or they

have security concerns (some users cannot have access to some documents), or the documents are really

different and just won't mix well in the same index (a shoe database and a dvd database).

When you start a new core in SolrCloud mode, it registers itself with ZooKeeper. This involves creating an

Ephemeral node that will go away if the Solr instance goes down, as well as registering information about the

core and how to contact it (such as the base Solr URL, core name, etc). Smart clients and nodes in the cluster

can use this information to determine who they need to talk to in order to fulfill a request.

New Solr cores may also be created and associated with a collection via . Additional cloud-relatedCoreAdmin

parameters are discussed in the page. Terms used for the CREATE action are:Parameter Reference

collection: the name of the collection to which this core belongs. Default is the name of the core.

shard: the shard id this core represents. (Optional: normally you want to be auto assigned a shard id.)

collection.<param>=<value>: causes a property of to be set if a new collection is<param>=<value>

being created. For example, use to point to the config for acollection.configName=<configname>

new collection.

For example:

curl 'http://localhost:8983/solr/admin/cores?

action=CREATE&name=mycore&collection=my_collection&shard=shard2'

Clusters

A cluster is set of Solr nodes managed by ZooKeeper as a single unit. When you have a cluster, you can always

make requests to the cluster and if the request is acknowledged, you can be sure that it will be managed as a

unit and be durable, i.e., you won't lose data. Updates can be seen right after they are made and the cluster can

be expanded or contracted.

482Apache Solr Reference Guide 5.1

Creating a Cluster

A cluster is created as soon as you have more than one Solr instance registered with ZooKeeper. The section G

reviews how to set up a simple cluster.etting Started with SolrCloud

Resizing a Cluster

Clusters contain a settable number of shards. You set the number of shards for a new cluster by passing a

system property, , when you start up Solr. The parameter must be passed on the firstnumShards numShards

startup of any Solr node, and is used to auto-assign which shard each instance should be part of. Once you have

started up more Solr nodes than , the nodes will create replicas for each shard, distributing themnumShards

evenly across the node, as long as they all belong to the same collection.

To add more cores to your collection, simply start the new core. You can do this at any time and the new core

will sync its data with the current replicas in the shard before becoming active.

You can also avoid and manually assign a core a shard ID if you choose.numShards

The number of shards determines how the data in your index is broken up, so you cannot change the number of

shards of the index after initially setting up the cluster.

However, you do have the option of breaking your index into multiple shards to start with, even if you are only

using a single machine. You can then expand to multiple machines later. To do that, follow these steps:

Set up your collection by hosting multiple cores on a single physical machine (or group of machines).

Each of these shards will be a leader for that shard.

When you're ready, you can migrate shards onto new machines by starting up a new replica for a given

shard on each new machine.

Remove the shard from the original machine. ZooKeeper will promote the replica to the leader for that

shard.

Leaders and Replicas

The concept of a is similar to that of when thinking of traditional Solr replication. The leader isleader master

responsible for making sure the are up to date with the same information stored in the leader.replicas

However, with SolrCloud, you don't simply have one master and one or more "slaves", instead you likely have

distributed your search and index traffic to multiple machines. If you have bootstrapped Solr with ,numShards=2

for example, your indexes are split across both shards. In this case, both shards are considered leaders. If you

start more Solr nodes after the initial two, these will be automatically assigned as replicas for the leaders.

Replicas are assigned to shards in the order they are started the first time they join the cluster. This is done in a

round-robin manner, unless the new node is manually assigned to a shard with the parameter duringshardId

startup. This parameter is used as a system property, as in , the value of which is the ID number-DshardId=1

of the shard the new node should be attached to.

On subsequent restarts, each node joins the same shard that it was assigned to the first time the node was

started (whether that assignment happened manually or automatically). A node that was previously a replica,

however, may become the leader if the previously assigned leader is not available.

Consider this example:

Node A is started with the bootstrap parameters, pointing to a stand-alone ZooKeeper, with the numShard

parameter set to 2.s

483Apache Solr Reference Guide 5.1

Node B is started and pointed to the stand-alone ZooKeeper.

Nodes A and B are both shards, and have fulfilled the 2 shard slots we defined when we started Node A. If we

look in the Solr Admin UI, we'll see that both nodes are considered leaders (indicated with a solid blank circle).

Node C is started and pointed to the stand-alone ZooKeeper.

Node C will automatically become a replica of Node A because we didn't specify any other shard for it to belong

to, and it cannot become a new shard because we only defined two shards and those have both been taken.

Node D is started and pointed to the stand-alone ZooKeeper.

Node D will automatically become a replica of Node B, for the same reasons why Node C is a replica of Node A.

Upon restart, suppose that Node C starts before Node A. What happens? Node C will become the leader, while

Node A becomes a replica of Node C.

Shards and Indexing Data in SolrCloud

When your data is too large for one node, you can break it up and store it in sections by creating one or more sh

. Each is a portion of the logical index, or core, and it's the set of all nodes containing that section of theards

index.

A shard is a way of splitting a core over a number of "servers", or nodes. For example, you might have a shard

for data that represents each state, or different categories that are likely to be searched independently, but are

often combined.

Before SolrCloud, Solr supported Distributed Search, which allowed one query to be executed across multiple

shards, so the query was executed against the entire Solr index and no documents would be missed from the

search results. So splitting the core across shards is not exclusively a SolrCloud concept. There were, however,

several problems with the distributed approach that necessitated improvement with SolrCloud:

Splitting of the core into shards was somewhat manual.

There was no support for distributed indexing, which meant that you needed to explicitly send documents

to a specific shard; Solr couldn't figure out on its own what shards to send documents to.

There was no load balancing or failover, so if you got a high number of queries, you needed to figure out

where to send them and if one shard died it was just gone.

SolrCloud fixes all those problems. There is support for distributing both the index process and the queries

automatically, and ZooKeeper provides failover and load balancing. Additionally, every shard can also have

multiple replicas for additional robustness.

In SolrCloud there are no masters or slaves. Instead, there are leaders and replicas. Leaders are automatically

elected, initially on a first-come-first-served basis, and then based on the Zookeeper process described at http://z

.ookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection.

If a leader goes down, one of its replicas is automatically elected as the new leader. As each node is started, it's

assigned to the shard with the fewest replicas. When there's a tie, it's assigned to the shard with the lowest shard

ID.

When a document is sent to a machine for indexing, the system first determines if the machine is a replica or a

leader.

If the machine is a replica, the document is forwarded to the leader for processing.

If the machine is a leader, SolrCloud determines which shard the document should go to, forwards the

484Apache Solr Reference Guide 5.1

document the leader for that shard, indexes the document for this shard, and forwards the index notation

to itself and any replicas.

Document Routing

Solr offers the ability to specify the router implementation used by a collection by specifying the prouter.name

arameter when . If you use the " " router, you can send documents with acreating your collection compositeId

prefix in the document ID which will be used to calculate the hash Solr uses to determine the shard a document

is sent to for indexing. The prefix can be anything you'd like it to be (it doesn't have to be the shard name, for

example), but it must be consistent so Solr behaves consistently. For example, if you wanted to co-locate

documents for a customer, you could use the customer name or ID as the prefix. If your customer is "IBM", for

example, with a document with the ID "12345", you would insert the prefix into the document id field:

"IBM!12345". The exclamation mark ('!') is critical here, as it distinguishes the prefix used to determine which

shard to direct the document to.

Then at query time, you include the prefix(es) into your query with the parameter (i.e., _route_ q=solr&_rout

) to direct queries to specific shards. In some situations, this may improve query performance becausee_=IBM!

it overcomes network latency when querying all the shards.

The router supports prefixes containing up to 2 levels of routing. For example: a prefix routingcompositeId

first by region, then by customer: "USA!IBM!12345"

If you do not want to influence how documents are stored, you don't need to specify a prefix in your document ID.

If you created the collection and defined the "implicit" router at the time of creation, you can additionally define a

parameter to use a field from each document to identify a shard where the document belongs. Ifrouter.field

the field specified is missing in the document, however, the document will be rejected. You could also use the _r

parameter to name a specific shard.oute_

Shard Splitting

When you create a collection in SolrCloud, you decide on the initial number shards to be used. But it can be

difficult to know in advance the number of shards that you need, particularly when organizational requirements

can change at a moment's notice, and the cost of finding out later that you chose wrong can be high, involving

creating new cores and re-indexing all of your data.

The ability to split shards is in the Collections API. It currently allows splitting a shard into two pieces. The

existing shard is left as-is, so the split action effectively makes two copies of the data as new shards. You can

delete the old shard at a later time when you're ready.

More details on how to use shard splitting is in the section on the .Collections API

Ignoring Commits from Client Applications in SolrCloud

In most cases, when running in SolrCloud mode, indexing client applications should not send explicit commit

requests. Rather, you should configure auto commits with and auto soft-commits toopenSearcher=false

make recent updates visible in search requests. This ensures that auto commits occur on a regular schedule in

the cluster. To enforce a policy where client applications should not send explicit commits, you should update all

The parameter replaces , which has been deprecated and will be removed in a_route_ shard.keys

future Solr release.

485Apache Solr Reference Guide 5.1

client applications that index data into SolrCloud. However, that is not always feasible, so Solr provides the Ignor

eCommitOptimizeUpdateProcessorFactory, which allows you to ignore explicit commits and/or

optimize requests from client applications without having refactor your client application code. To

activate this request processor you'll need to add the following to your solrconfig.xml:

</processor>

</updateRequestProcessorChain>

As shown in the example above, the processor will return 200 to the client but will ignore the commit / optimize

request. Notice that you need to wire-in the implicit processors needed by SolrCloud as well, since this custom

chain is taking the place of the default chain.

In the following example, the processor will raise an exception with a 403 code with a customized error message:

<str name="responseMessage">Thou shall not issue a commit!</str>

</processor>

</updateRequestProcessorChain>

Lastly, you can also configure it to just ignore optimize and let commits pass thru by doing:

<str name="responseMessage">Thou shall not issue an optimize, but commits are

OK!</str>

</processor>

</updateRequestProcessorChain>

Distributed Requests

Limiting Which Shards are Queried

One of the advantages of using SolrCloud is the ability to distribute requests among various shards that may or

may not contain the data that you're looking for. You have the option of searching over all of your data or just

parts of it.

Querying all shards for a collection should look familiar; it's as though SolrCloud didn't even come into play:

http://localhost:8983/solr/gettingstarted/select?q=*:*

486Apache Solr Reference Guide 5.1

If, on the other hand, you wanted to search just one shard, you can specify that shard, as in:

http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=localhost:7574/solr

If you want to search a group of shards, you can specify them together:

http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=localhost:7574/solr,lo

calhost:8983/solr

Or you can specify a list of servers to choose from for load balancing purposes by using the pipe symbol (|):

http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=localhost:7574/solr|lo

calhost:7500/solr

(If you have explicitly created your shards using ZooKeeper and have shard IDs, you can use those IDs rather

than server addresses.)

Configuring the ShardHandlerFactory

You can directly configure aspects of the concurrency and thread-pooling used within distributed search in Solr.

This allows for finer grained control and you can tune it to target your own specific requirements. The default

configuration favors throughput over latency.

To configure the standard handler, provide a configuration like this in the solrconfig.xml:

</shardHandler>

</requestHandler>

The parameters that can be specified are as follows:

Parameter Default Explanation

socketTimeout 0 (use OS default) The amount of time in ms that a socket is allowed to

wait.

connTimeout 0 (use OS default) The amount of time in ms that is accepted for binding /

connecting a socket

maxConnectionsPerHost 20 The maximum number of concurrent connections that

is made to each individual shard in a distributed search.

maxConnections 10000 The total maximum number of concurrent connections

in distributed searches.

corePoolSize 0 The retained lowest limit on the number of threads

used in coordinating distributed search.

487Apache Solr Reference Guide 5.1

maximumPoolSize Integer.MAX_VALUE The maximum number of threads used for coordinating

distributed search.

maxThreadIdleTime 5 seconds The amount of time to wait for before threads are

scaled back in response to a reduction in load.

sizeOfQueue -1 If specified, the thread pool will use a backing queue

instead of a direct handoff buffer. High throughput

systems will want to configure this to be a direct hand

off (with -1). Systems that desire better latency will

want to configure a reasonable size of queue to handle

variations in requests.

fairnessPolicy false Chooses the JVM specifics dealing with fair policy

queuing, if enabled distributed searches will be handled

in a First in First out fashion at a cost to throughput. If

disabled throughput will be favored over latency.

Configuring statsCache implementation

Document and term statistics are needed in order to calculate relevancy using the vector space model. Solr

provides four implementations out of the box when it comes to document stats calculation:

LocalStatsCache: This only uses local term and document statistics to compute relevance. In cases

with uniform term distribution across shards, this works reasonably well.

This option is the default if no is configured.<statsCache>

ExactStatsCache: This implementation uses global values for document frequency.

ExactSharedStatsCache: This is exactly like the exact stats cache in it's functionality but the global

stats are reused for subsequent requests with the same terms.

LRUStatsCache: This implementation uses an LRU cache to hold global stats, which are shared

between requests.

The implementation can be selected by setting in . For example, the<statsCache> solrconfig.xml

following line makes Solr use the implementation:ExactStatsCache

Avoiding Distributed Deadlock

Each shard serves top-level query requests and then makes sub-requests to all of the other shards. Care should

be taken to ensure that the max number of threads serving HTTP requests is greater than the possible number

of requests from both top-level clients and other shards. If this is not the case, the configuration may result in a

distributed deadlock.

For example, a deadlock might occur in the case of two shards, each with just a single thread to service HTTP

requests. Both threads could receive a top-level request concurrently, and make sub-requests to each other.

Because there are no more remaining threads to service requests, the incoming requests will be blocked until the

other pending requests are finished, but they will not finish since they are waiting for the sub-requests. By

ensuring that Solr is configured to handle a sufficient number of threads, you can avoid deadlock situations like

this.

488Apache Solr Reference Guide 5.1

Prefer Local Shards

Solr allows you to pass an optional boolean parameter named to indicate that apreferLocalShards

distributed query should prefer local replicas of a shard when available. In other words, if a query includes prefe

, then the query controller will look for local replicas to service the query instead ofrLocalShards=true

selecting replicas at random from across the cluster. This is useful when a query requests many fields or large

fields to be returned per document because it avoids moving large amounts of data over the network when it is

available locally. In addition, this feature can be useful for minimizing the impact of a problematic replica with

degraded performance, as it reduces the likelihood that the degraded replica will be hit by other healthy replicas.

Lastly, it follows that the value of this feature diminishes as the number of shards in a collection increases

because the query controller will have to direct the query to non-local replicas for most of the shards. In other

words, this feature is mostly useful for optimizing queries directed towards collections with a small number of

shards and many replicas. Also, this option should only be used if you are load balancing requests across all

nodes that host replicas for the collection you are querying, as Solr's CloudSolrClient will do. If not

load-balancing, this feature can introduce a hotspot in the cluster since queries won't be evenly distributed

across the cluster.

Read and Write Side Fault Tolerance

Read Side Fault Tolerance

With earlier versions of Solr, you had to set up your own load balancer. Now each individual node load balances

requests across the replicas in a cluster. You still need a load balancer on the 'outside' that talks to the cluster, or

you need a smart client (Solr provides a smart Java Solrj client called CloudSolrClient).

A smart client understands how to read and interact with ZooKeeper and only requests the ZooKeeper

ensemble's address to start discovering to which nodes it should send requests.

Each distributed search request is executed against all shards for a collection unless limited by the user with the '

' or '_route_' parameters. If one or more shards queried are unavailable then the default is to fail theshards

request. However, there are many use-cases where partial results are acceptable and so Solr provides a

boolean parameter (default ' '). If then partial results mayshards.tolerant false shards.tolerant=true

be returned. If the returned response does not contain results from all the appropriate shards then the response

header contains a special flag called ' '. The client can specify ' ' along with the 'partialResults shards.info

' parameter to retrieve more fine-grained details.shards.tolerant

Example response with flag set to 'true':partialResults

489Apache Solr Reference Guide 5.1

{

"responseHeader": {

"status": 0,

"partialResults": true,

"QTime": 20,

"params": {

"wt": "json"

}

"response": {

"numFound": 77,

"start": 0,

"docs": [ ]

}

Write Side Fault Tolerance

SolrCloud supports near real-time actions, elasticity, high availability, and fault tolerance. What this means,

basically, is that when you have a large cluster, you can always make requests to the cluster, and if a request is

acknowledged you are sure it will be durable; i.e., you won't lose data. Updates can be seen right after they are

made and the cluster can be expanded or contracted.

Recovery

A Transaction Log is created for each node so that every change to content or organization is noted. The log is

used to determine which content in the node should be included in a replica. When a new replica is created, it

refers to the Leader and the Transaction Log to know which content to include. If it fails, it retries.

Since the Transaction Log consists of a record of updates, it allows for more robust indexing because it includes

redoing the uncommitted updates if indexing is interrupted.

If a leader goes down, it may have sent requests to some replicas and not others. So when a new potential

leader is identified, it runs a synch process against the other replicas. If this is successful, everything should be

consistent, the leader registers as active, and normal actions proceed. If a replica is too far out of synch, the

system asks for a full replication/replay-based recovery.

If an update fails because cores are reloading schemas and some have finished but others have not, the leader

tells the nodes that the update failed and starts the recovery procedure.

Achieved Replication Factor

When using a replication factor greater than one, an update request may succeed on the shard leader but fail on

one or more of the replicas. For instance, consider a collection with one shard and replication factor of three. In

this case, you have a shard leader and two additional replicas. If an update request succeeds on the leader but

fails on both replicas, for whatever reason, the update request is still considered successful from the perspective

of the client. The replicas that missed the update will sync with the leader when they recover.

Behind the scenes, this means that Solr has accepted updates that are only on one of the nodes (the current

leader). Solr supports the optional parameter on update requests that cause the server to return themin_rf

achieved replication factor for an update request in the response. For the example scenario described above, if

Solr Response with partialResults

490Apache Solr Reference Guide 5.1

the client application included min_rf >= 1, then Solr would return rf=1 in the Solr response header because the

request only succeeded on the leader. The update request will still be accepted as the parameter onlymin_rf

tells Solr that the client application wishes to know what the achieved replication factor was for the update

request. In other words, min_rf does not mean Solr will enforce a minimum replication factor as Solr does not

support rolling back updates that succeed on a subset of replicas.

On the client side, if the achieved replication factor is less than the acceptable level, then the client application

can take additional measures to handle the degraded state. For instance, a client application may want to keep a

log of which update requests were sent while the state of the collection was degraded and then resend the

updates once the problem has been resolved. In short, is an optional mechanism for a client applicationmin_rf

to be warned that an update request was accepted while the collection is in a degraded state.

NRT, Replication, and Disaster Recovery with SolrCloud

SolrCloud and Replication

Replication ensures redundancy for your data, and enables you to send an update request to any node in the

shard. If that node is a replica, it will forward the request to the leader, which then forwards it to all existing

replicas, using versioning to make sure every replica has the most up-to-date version. This architecture enables

you to be certain that your data can be recovered in the event of a disaster, even if you are using Near Real

Time searching.

Near Real Time Searching

If you want to use the support, enable auto soft commits in your fileNearRealtimeSearch solrconfig.xml

before storing it into Zookeeper. Otherwise you can send explicit soft commits to the cluster as you need.

SolrCloud doesn't work very well with separated data clusters connected by an expensive pipe. The root problem

is that SolrCloud's architecture sends documents to all the nodes in the cluster (on a per-shard basis), and that

architecture is really dictated by the NRT functionality.

Imagine that you have a set of servers in China and one in the US that are aware of each other. Assuming 5

replicas, a single update to a shard may make multiple trips over the expensive pipe before it's all done, probably

slowing indexing speed unacceptably.

So the SolrCloud recommendation for this situation is to maintain these clusters separately; nodes in China don't

even know that nodes exist in the US and vice-versa. When indexing, you send the update request to one node

in the US and one in China and all the node-routing after that is local to the separate clusters. Requests can go

to any node in either country and maintain a consistent view of the data.

However, if your US cluster goes down, you have to re-synchronize the down cluster with up-to-date information

from China. The process requires you to replicate the index from China to the repaired US installation and then

get everything back up and working.

Disaster Recovery for an NRT system

Use of Near Real Time (NRT) searching affects the way that systems using SolrCloud behave during disaster

recovery.

The procedure outlined below assumes that you are maintaining separate clusters, as described above.

Consider, for example, an event in which the US cluster goes down (say, because of a hurricane), but the China

491Apache Solr Reference Guide 5.1

cluster is intact. Disaster recovery consists of creating the new system and letting the intact cluster create a

replicate for each shard on it, then promoting those replicas to be leaders of the newly created US cluster.

Here are the steps to take:

Take the downed system offline to all end users.

Take the indexing process offline.

Repair the system.

Bring up one machine per shard in the repaired system as part of the ZooKeeper cluster on the good

system, and wait for replication to happen, creating a replica on that machine. (SoftCommits will not be

repeated, but data will be pulled from the transaction logs if necessary.)

Bring the machines of the repaired cluster down, and reconfigure them to be a separate Zookeeper cluster

again, optionally adding more replicas for each shard.

Make the repaired system visible to end users again.

Start the indexing program again, delivering updates to both systems.

SolrCloud Configuration and Parameters

In this section, we'll cover the various configuration options for SolrCloud.

The following sections cover these topics:

Setting Up an External ZooKeeper Ensemble

Using ZooKeeper to Manage Configuration Files

ZooKeeper Access Control

Collections API

Parameter Reference

Command Line Utilities

SolrCloud with Legacy Configuration Files

Setting Up an External ZooKeeper Ensemble

Although Solr comes bundled with Apache ZooKeeper, you should consider yourself discouraged from using this

internal ZooKeeper in production, because shutting down a redundant Solr instance will also shut down its

ZooKeeper server, which might not be quite so redundant. Because a ZooKeeper ensemble must have a quorum

of more than half its servers running at any given time, this can be a problem.

The solution to this problem is to set up an external ZooKeeper ensemble. Fortunately, while this process can

seem intimidating due to the number of powerful options, setting up a simple ensemble is actually quite

straightforward. The basic steps are as follows:

Download Apache ZooKeeper

The first step in setting up Apache ZooKeeper is, of course, to download the software. It's available from http://zo

.okeeper.apache.org/releases.html

SolrCloud will automatically use old-style replication for the bulk load. By temporarily having only

one replica, you'll minimize data transfer across a slow connection.

492Apache Solr Reference Guide 5.1

Setting Up a Single ZooKeeper

Create the instance

Creating the instance is a simple matter of extracting the files into a specific target directory. The actual directory

itself doesn't matter, as long as you know where it is, and where you'd like to have ZooKeeper store its internal

data.

Configure the instance

The next step is to configure your ZooKeeper instance. To do that, create the following file: <ZOOKEEPER_HOME>

. To this file, add the following information:/conf/zoo.cfg

tickTime=2000

dataDir=/var/lib/zookeeper

clientPort=2181

The parameters are as follows:

tickTime: Part of what ZooKeeper does is to determine which servers are up and running at any given time, and

the minimum session time out is defined as two "ticks". The parameter specifies, in miliseconds, howtickTime

long each tick should be.

dataDir: This is the directory in which ZooKeeper will store data about the cluster. This directory should start out

empty.

clientPort: This is the port on which Solr will access ZooKeeper.

Once this file is in place, you're ready to start the ZooKeeper instance.

Run the instance

To run the instance, you can simply use the script provided, as with thisZOOKEEPER_HOME/bin/zkServer.sh

command: zkServer.sh start

Again, ZooKeeper provides a great deal of power through additional configurations, but delving into them is

beyond the scope of this tutorial. For more information, see the ZooKeeper page. For thisGetting Started

example, however, the defaults are fine.

Point Solr at the instance

Pointing Solr at the ZooKeeper instance you've created is a simple matter of using the parameter when using-z

the bin/solr script. For example, in order to point the Solr instance to the ZooKeeper you've started on port 2181,

this is what you'd need to do:

Starting example with Zookeeper already running at port 2181 (with all other defaults):cloud

When using stand-alone ZooKeeper, you need to take care to keep your version of ZooKeeper updated

with the latest version distributed with Solr. Since you are using it as a stand-alone application, it does

not get upgraded when you upgrade Solr.

Solr currently uses Apache ZooKeeper v3.4.6.

493Apache Solr Reference Guide 5.1

bin/solr start -e cloud -z localhost:2181 -noprompt

Add a node pointing to an existing ZooKeeper at port 2181:

bin/solr start -cloud -s <path to solr home for new node> -p 8987 -z localhost:2181

NOTE: When you are not using an example to start solr, make sure you upload the configuration set to

zookeeper before creating the collection.

Shut down ZooKeeper

To shut down ZooKeeper, use the zkServer script with the "stop" command: .zkServer.sh stop

Setting up a ZooKeeper Ensemble

With an external ZooKeeper ensemble, you need to set things up just a little more carefully as compared to the

Getting Started example.

The difference is that rather than simply starting up the servers, you need to configure them to know about and

talk to each other first. So your original file might look like this:zoo.cfg

dataDir=/var/lib/zookeeperdata/1

clientPort=2181

initLimit=5

syncLimit=2

server.1=localhost:2888:3888

server.2=localhost:2889:3889

server.3=localhost:2890:3890

Here you see three new parameters:

initLimit: Amount of time, in ticks, to allow followers to connect and sync to a leader. In this case, you have 5

ticks, each of which is 2000 milliseconds long, so the server will wait as long as 10 seconds to connect and sync

with the leader.

syncLimit: Amount of time, in ticks, to allow followers to sync with ZooKeeper. If followers fall too far behind a

leader, they will be dropped.

server.X: These are the IDs and locations of all servers in the ensemble, the ports on which they communicate

with each other. The server ID must additionally stored in the file and be located in the <dataDir>/myid dataD

of each ZooKeeper instance. The ID identifies each server, so in the case of this first instance, you wouldir

create the file with the content "1"./var/lib/zookeeperdata/1/myid

Now, whereas with Solr you need to create entirely new directories to run multiple instances, all you need for a

new ZooKeeper instance, even if it's on the same machine for testing purposes, is a new configuration file. To

complete the example you'll create two more configuration files.

The file should have the content:<ZOOKEEPER_HOME>/conf/zoo2.cfg

494Apache Solr Reference Guide 5.1

tickTime=2000

dataDir=c:/sw/zookeeperdata/2

clientPort=2182

initLimit=5

syncLimit=2

server.1=localhost:2888:3888

server.2=localhost:2889:3889

server.3=localhost:2890:3890

You'll also need to create :<ZOOKEEPER_HOME>/conf/zoo3.cfg

tickTime=2000

dataDir=c:/sw/zookeeperdata/3

clientPort=2183

initLimit=5

syncLimit=2

server.1=localhost:2888:3888

server.2=localhost:2889:3889

server.3=localhost:2890:3890

Finally, create your files in each of the directories so that each server knows which instance it is.myid dataDir

The id in the file on each machine must match the "server.X" definition. So, the ZooKeeper instance (ormyid

machine) named "server.1" in the above example, must have a file containing the value "1". The filemyid myid

can be any integer between 1 and 255, and must match the server IDs assigned in the file.zoo.cfg

To start the servers, you can simply explicitly reference the configuration files:

cd <ZOOKEEPER_HOME>

bin/zkServer.sh start zoo.cfg

bin/zkServer.sh start zoo2.cfg

bin/zkServer.sh start zoo3.cfg

Once these servers are running, you can reference them from Solr just as you did before:

bin/solr start -e cloud -z localhost:2181,localhost:2182,localhost:2183 -noprompt

For more information on getting the most power from your ZooKeeper installation, check out the ZooKeeper

.Administrator's Guide

Using ZooKeeper to Manage Configuration Files

With SolrCloud your configuration files (particularly and ) are kept insolrconfig.xml schema.xml

ZooKeeper. These files are uploaded in either of the following cases:

When you start a SolrCloud example using the bin/solr script.

When you create a collection using the bin/solr script.

Manually upload a configuration set to ZooKeeper.

Startup Bootstrap

When you try SolrCloud for the first time using the , the related configset gets uploaded tobin/solr -e cloud

495Apache Solr Reference Guide 5.1

zookeeper automatically and is linked with the newly created collection.

The below command would start SolrCloud with the default collection name (gettingstarted) and default configset

(data_driven_schema_configs) uploaded and linked to it.

$ bin/solr -e cloud -noprompt

You can also explicitly upload a configuration directory when creating a collection using the bin/solr

script with the option, such as:-d

$ bin/solr create -c mycollection -d data_driven_schema_configs

The create command will upload a copy of the configuration directory todata_driven_schema_configs

ZooKeeper under . Refer to the page for more details/configs/mycollection Solr Start Script Reference

about the create command for creating collections.

Once a configuration directory has been uploaded to ZooKeeper, you can update them using the ZooKeeper

.Command Line Interface (zkCLI)

Uploading configs using zkcli or SolrJ

In production situations, can also be uploaded to ZooKeeper independent of collection creation usingConfig Sets

either Solr's or the java method.zkcli.sh script, CloudSolrClient.uploadConfig

The below command can be used to upload a new configset using the zkcli script.

$ sh zkcli.sh -cmd upconfig -zkhost <host:port> -confname <name for configset>

-solrhome <solrhome> -confdir <path to directory with configset>

Managing Your SolrCloud Configuration Files

To update or change your SolrCloud configuration files:

Download the latest configuration files from ZooKeeper, using the source control checkout process.

Make your changes.

Commit your changed file to source control.

Push the changes back to ZooKeeper.

Reload the collection so that the changes will be in effect.

More information about the ZooKeeper Command Line Utility to help manage changes to configuration files, can

be found in the section on .Command Line Utilities

ZooKeeper Access Control

It's a good idea to keep these files under version control.

By default, is not one of the Solr configuration files managed by ZooKeeper. If you would likesolr.xml

to keep your in ZooKeeper, you can push it to ZooKeeper with the utility (usingsolr.xml zkcli.sh

the command). See the section for more information.putfile Command Line Utilities

496Apache Solr Reference Guide 5.1

This page describes using ZooKeeper access control lists (ACLs) with Solr. For information about ZooKeeper

ACLs, see http://zookeeper.apache.org/doc/r3.4.6/zookeeperProgrammers.html#sc_ZooKeeperAccessControl

Motivation/Scope

How it works

Example usages

Motivation/Scope

SolrCloud uses ZooKeeper for shared information and for coordination.

This section describes how to configure Solr to add more restrictive ACLs to the ZooKeeper content it creates,

and how to tell Solr about the credentials required to access the content in ZooKeeper. You will have to activate

it to use this functionality because by default Solr behavior is still open-unsafe ACL everywhere, using no

credentials.

Changing Solr-related content in ZooKeeper might damage a SolrCloud cluster. For example:

Changing configuration might cause Solr to fail or behave in an unintended way.

Changing cluster state information into something wrong or inconsistent might very well make a SolrCloud

cluster behave strangely.

Adding a delete-collection job to be carried out by the Overseer will cause data to be deleted from the

cluster.

You may want to enable ZooKeeper ACLs with Solr if you grant access to your ZooKeeper ensemble to entities

you do not trust, or if you want to reduce risk of bad actions resulting from, e.g.:

Malware that found its way into your system.

Other systems using the same ZooKeeper ensemble (a "bad thing" might be done by accident).

You might even want to limit read-access, if you think there is stuff in ZooKeeper that not everyone should know

about. Or you might just in general work on need-to-know-basis.

Protecting ZooKeeper itself could mean many different things. This section is about protecting Solr content

. ZooKeeper content basically lives persisted on disk and (partly) in memory of the ZooKeeperin ZooKeeper

processes. -This section is not about protecting ZooKeeper data at storage or ZooKeeper process levels

that's for ZooKeeper to deal with.

But this content is also available to "the outside" via the ZooKeeper API. Outside processes can connect to

ZooKeeper and create/update/delete/read content; for example, a Solr node in a SolrCloud cluster wants to

create/update/delete/read, and a SolrJ client wants to read from the cluster. It is the responsibility of the outside

processes that create/update content to setup ACLs on the content. ACLs describe who is allowed to read,

update, delete, create, etc. Each piece of information (znode/content) in ZooKeeper has its own set of ACLs, and

inheritance or sharing is not possible. The default behavior in Solr is to add one ACL on all the content it creates

- one ACL that gives anyone the permission to do anything (in ZooKeeper terms this is called "the open-unsafe

ACL").

How it works

We want to be able to:

Control the credentials Solr uses for its ZooKeeper connections. The credentials are used to get

permission to perform operations in ZooKeeper.

Control which ACLs Solr will add to znodes (ZooKeeper files/folders) it creates in ZooKeeper.

497Apache Solr Reference Guide 5.1

3. Control it "from the outside", so that you do not have to modify and/or recompile Solr code to turn this on.

Solr nodes, clients and tools (like e.g. ZkCLI) always uses a java class called to deal with their SolrZkClient

ZooKeeper stuff. The implementation of the solution described here is all about changing . If youSolrZkClient

"stole" and use in your application, the descriptions below will be true for your application too.SolrZkClient

Controlling credentials

You control which credentials will be used by setting the " " system propertydefaultZkCredentialsProvider

to the name of a class (on the classpath) implementing the following interface:

package org.apache.solr.common.cloud;

public interface ZkCredentialsProvider {

public class ZkCredentials {

String scheme;

byte[] auth;

public ZkCredentials(String scheme, byte[] auth) {

super();

this.scheme = scheme;

this.auth = auth;

}

String getScheme() {

return scheme;

}

byte[] getAuth() {

return auth;

}

Collection<ZkCredentials> getCredentials();

}

SolrZkClient determines which credentials to use by calling (once per ) the SolrZkClient getCredential

method of the class named by the " " system property. If the systems() defaultZkCredentialsProvider

property hasn't been set, or if the named class implementing the interface can't be found on the classpath, Solr

uses the default implementation.ZkClient

Out of the box implementations

You can always make you own implementation, but Solr comes with:

org.apache.solr.common.cloud.DefaultZkCredentialsProvider

Its returns a list of length zero - no credentials used. This is the default and isgetCredentials()

used if you do not set the " " system property.defaultZkACLProvider

org.apache.solr.common.cloud.VMParamsSingleSetCredentialsDigestZkCredentialsPr

ovider

This one lets you define your credentials using system properties. It supports at most one set of

credentials.

The schema is "digest". The username and password are defined by system properties "zkDigest

" and " ", respectively. This set of credentials will be added to theUsername zkDigestPassword

list of credentials returned by if both username and password are provided.getCredentials()

498Apache Solr Reference Guide 5.1

If the one set of credentials above is not added to the list, this implementation will fall back to

default behavior and use the (empty) credentials list from .DefaultZkCredentialsProvider

Controlling ACLs

You control which ACLs will be added by setting the " " system property to the name ofdefaultZkACLProvider

a class (on the classpath) implementing the following interface:

package org.apache.solr.common.cloud;

public interface ZkACLProvider {

List<ACL> getACLsToAdd(String zNodePath);

}

When wants to create a new znode, it determines which ACLs to put on the znode by calling theSolrZkClient

method of the class named by the " " system property. If thegetACLsToAdd() defaultZkACLProvider

system property has not been set, or if it does not name a class on the classpath implementing the interface, the

default implementation is used.

Out of the box implementations

You can always make you own implementation, but Solr comes with:

org.apache.solr.common.cloud.DefaultZkACLProvider

It returns a list of length one for all -s. The single ACL entry in the list is "open-unsafe".zNodePath

This is the default and is used if you do not set the " " system property.defaultZkACLProvider

org.apache.solr.common.cloud.VMParamsAllAndReadonlyDigestZkACLProvider

This one lets you define your ACLs using system properties. Its implementationgetACLsToAdd()

does not use for anything, so all znodes will get the same set of ACLs. It supports addingzNodePath

one or both of:

A user that is allowed to do everything.

The permission is " " (corresponding to all of , , , and ALL CREATE READ WRITE DELETE

), and the schema is "digest". The username and password are defined byADMIN

system properties " " and " ", respectively.zkDigestUsername zkDigestPassword

This ACL will not be added to the list of ACLs unless both username and password

are provided.

A user that is only allowed to perform read operations.

The permission is " " and the schema is "digest". The username and passwordREAD

are defined by system properties " " and "zkDigestReadonlyUsername zkDigestR

", respectively. This ACL will not be added to the list of ACLseadonlyPassword

unless both username and password are provided.

If neither of the above ACLs is added to the list, the (empty) ACL list of willDefaultZkACLProvider

be used by default.

Notice the overlap in system property names with credentials provider VMParamsSingleSetCredentialsDig

(described above). This is to let the two providers collaborate in a nice andestZkCredentialsProvider

499Apache Solr Reference Guide 5.1

perhaps common way: we always protect access to content by limiting to two users - an admin-user and a

readonly-user - AND we always connect with credentials corresponding to this same admin-user, basically so

that we can do anything to the content/znodes we create ourselves.

You can give the readonly credentials to "clients" of your SolrCloud cluster - e.g. to be used by SolrJ clients.

They will be able to read whatever is necessary to run a functioning SolrJ client, but they will not be able to

modify any content in ZooKeeper.

Example usages

Let's say that you want all Solr-related content in ZooKeeper protected. You want an "admin" user that is able to

do anything to the content in ZooKeeper - this user will be used for initializing Solr content in ZooKeeper and for

server-side Solr nodes. You also want a "readonly" user that is only able to read content from ZooKeeper - this

user will be handed over to "clients".

In the examples below:

The "admin" user's username/password is admin-user/admin-password.

The "readonly" user's username/password is readonly-user/readonly-password.

To use ZkCLI:

SOLR_ZK_PROVIDERS="-DdefaultZkCredentialsProvider=org.apache.solr.common.cloud.VMPar

amsSingleSetCredentialsDigestZkCredentialsProvider \

-DdefaultZkACLProvider=org.apache.solr.common.cloud.VMParamsAllAndReadonlyDigestZkAC

LProvider"

SOLR_ZK_CREDS_AND_ACLS="-DzkDigestUsername=admin-user

-DzkDigestPassword=admin-password \

-DzkDigestReadonlyUsername=readonly-user

-DzkDigestReadonlyPassword=readonly-password"

java ... $SOLR_ZK_PROVIDERS $SOLR_ZK_CREDS_AND_ACLS ... org.apache.solr.cloud.ZkCLI

-cmd ...

For operations using , add the following at the bottom of :bin/solr bin/solr.in.sh

SOLR_ZK_PROVIDERS="-DdefaultZkCredentialsProvider=org.apache.solr.common.cloud.VMPar

amsSingleSetCredentialsDigestZkCredentialsProvider \

-DdefaultZkACLProvider=org.apache.solr.common.cloud.VMParamsAllAndReadonlyDigestZkAC

LProvider"

SOLR_ZK_CREDS_AND_ACLS="-DzkDigestUsername=admin-user

-DzkDigestPassword=admin-password \

-DzkDigestReadonlyUsername=readonly-user

-DzkDigestReadonlyPassword=readonly-password"

SOLR_OPTS="$SOLR_OPTS $SOLR_ZK_PROVIDERS $SOLR_ZK_CREDS_AND_ACLS"

For operations using , add the following at the bottom of :bin\solr.cmd bin\solr.in.cmd

500Apache Solr Reference Guide 5.1

set

SOLR_ZK_PROVIDERS=-DdefaultZkCredentialsProvider=org.apache.solr.common.cloud.VMPara

msSingleSetCredentialsDigestZkCredentialsProvider ^

-DdefaultZkACLProvider=org.apache.solr.common.cloud.VMParamsAllAndReadonlyDigestZkAC

LProvider

set SOLR_ZK_CREDS_AND_ACLS=-DzkDigestUsername=admin-user

-DzkDigestPassword=admin-password ^

-DzkDigestReadonlyUsername=readonly-user

-DzkDigestReadonlyPassword=readonly-password

set SOLR_OPTS=%SOLR_OPTS% %SOLR_ZK_PROVIDERS% %SOLR_ZK_CREDS_AND_ACLS%

To start your own "clients" (using SolrJ):

SOLR_ZK_PROVIDERS="-DdefaultZkCredentialsProvider=org.apache.solr.common.cloud.VMPar

amsSingleSetCredentialsDigestZkCredentialsProvider"

SOLR_ZK_CREDS_AND_ACLS="-DzkDigestUsername=readonly-user

-DzkDigestPassword=readonly-password"

java ... $SOLR_ZK_PROVIDERS $SOLR_ZK_CREDS_AND_ACLS ...

Or since you yourself are writing the code creating the -s, you might want to override theSolrZkClient

provider implementations at the code level instead.

Collections API

The Collections API is used to enable you to create, remove, or reload collections, but in the context of SolrCloud

you can also use it to create collections with a specific number of shards and replicas.

API Entry Points

The base URL for all API calls below is .http://<hostname>:<port>/solr

/admin/collections?action=CREATE: a collectioncreate

: a collection/admin/collections?action=RELOAD reload

: a shard into two new shards/admin/collections?action=SPLITSHARD split

: a new shard/admin/collections?action=CREATESHARD create

: an inactive shard/admin/collections?action=DELETESHARD delete

: for a collection/admin/collections?action=CREATEALIAS create or modify an alias

: for a collection/admin/collections?action=DELETEALIAS delete an alias

: a collection/admin/collections?action=DELETE delete

: of a shard/admin/collections?action=DELETEREPLICA delete a replica

: of a shard/admin/collections?action=ADDREPLICA add a replica

: /admin/collections?action=CLUSTERPROP Add/edit/delete a cluster-wide property

: /admin/collections?action=MIGRATE Migrate documents to another collection

: to a node in the cluster /admin/collections?action=ADDROLE Add a specific role

: /admin/collections?action=REMOVEROLE Remove an assigned role

: /admin/collections?action=OVERSEERSTATUS Get status and statistics of the overseer

501Apache Solr Reference Guide 5.1

: /admin/collections?action=CLUSTERSTATUS Get cluster status

: of a previous asynchronous request /admin/collections?action=REQUESTSTATUS Get the status

: /admin/collections?action=LIST List all collections

: to a replica specified by/admin/collections?action=ADDREPLICAPROP Add an arbitrary property

collection/shard/replica

: from a replica specified/admin/collections?action=DELETEREPLICAPROP Delete an arbitrary property

by collection/shard/replica

: , one per shard,/admin/collections?action=BALANCESHARDUNIQUE Distribute an arbitrary property

across the nodes in a collection

: based on the "preferredLeader"/admin/collections?action=REBALANCELEADERS Distribute leader role

assignments

Create a Collection

/admin/collections?action=CREATE&name= &numShards= &replicationFactor=name number numbe

&maxShardsPerNode= &createNodeSet= &collection.configName=r number nodelist configname

Input

Query Parameters

Key Type Required Default Description

name string Yes The name of the collection to be created.

router.name string No compositeId The router name that will be used. The router

defines how documents will be distributed

among the shards. The value can be either imp

, which uses an internal default hash, or licit co

, which allows defining the specificmpositeId

shard to assign documents to. When using the

'implicit' router, the parameter isshards

required. When using the 'compositeId' router,

the parameter is required. FornumShards

more information, see also the section Docume

.nt Routing

numShards integer No empty The number of shards to be created as part of

the collection. This is a required parameter

when using the 'compositeId' router.

shards string No empty A comma separated list of shard names, e.g.,

shard-x,shard-y,shard-z . This is a required

parameter when using the 'implicit' router.

replicationFactor integer No 1 The number of replicas to be created for each

shard.

502Apache Solr Reference Guide 5.1

maxShardsPerNode integer No 1 When creating collections, the shards and/or

replicas are spread across all available (i.e.,

live) nodes, and two replicas of the same shard

will never be on the same node. If a node is not

live when the CREATE operation is called, it

will not get any parts of the new collection,

which could lead to too many replicas being

created on a single live node. Defining maxSha

sets a limit on the number ofrdsPerNode

replicas CREATE will spread to each node. If

the entire collection can not be fit into the live

nodes, no collection will be created at all.

createNodeSet string No empty Allows defining the nodes to spread the new

collection across. If not provided, the CREATE

operation will create shard-replica spread

across all live Solr nodes. The format is a

comma-separated list of node_names, such as

localhost:8983_solr, localhost:8984

._solr, localhost:8985_solr

createNodeSet.shuffle boolean No true Controls wether or not the shard-replicas

created for this collection will be assigned to the

nodes specified by the createNodeSet in a

sequential manner, or if the list of nodes should

be shuffled prior to creating individual replicas.

A 'false' value makes the results of a collection

creation predictible and gives more exact

control over the location of the individual

shard-replicas, but 'true' can be a better choice

for ensuring replicas are distributed evenly

across nodes.

Ignored if createNodeSet is not also specified.

collection.configName string No empty Defines the name of the configurations (which

must already be stored in ZooKeeper) to use

for this collection. If not provided, Solr will

default to the collection name as the

configuration name.

503Apache Solr Reference Guide 5.1

router.field string No empty If this field is specified, the router will look at the

value of the field in an input document to

compute the hash and identify a shard instead

of looking at the field. If the fielduniqueKey

specified is null in the document, the document

will be rejected. Please note that RealTime Get

or retrieval by id would also require the

parameter (or ) to avoid_route_ shard.keys

a distributed search.

property. =name value string No Set core property to . See thename value

section for details onDefining core.properties

supported properties and values.

autoAddReplicas boolean No false When set to true, enables auto addition of

replicas on shared file systems. See the section

for more details onautoAddReplicas Settings

settings and overrides.

async string No Request ID to track this action which will be pro

.cessed asynchronously

Output

Output Content

The response will include the status of the request and the new core names. If the status is anything other than

"success", an error message will explain why the request failed.

Examples

Input

http://localhost:8983/solr/admin/collections?action=CREATE&name=newCollection&numSha

rds=2&replicationFactor=1

Output

504Apache Solr Reference Guide 5.1

</lst>

<lst>

</lst>

<str name="core">newCollection_shard1_replica1</str>

</lst>

<lst>

</lst>

<str name="core">newCollection_shard2_replica1</str>

</lst>

</response>

Reload a Collection

/admin/collections?action=RELOAD&name= name

The RELOAD action is used when you have changed a configuration in ZooKeeper.

Input

Query Parameters

Key Type Required Description

name string Yes The name of the collection to reload.

Output

Output Content

The response will include the status of the request and the cores that were reloaded. If the status is anything

other than "success", an error message will explain why the request failed.

Examples

Input

http://localhost:8983/solr/admin/collections?action=RELOAD&name=newCollection

Output

505Apache Solr Reference Guide 5.1

</lst>

</lst>

</lst>

</response>

Split a Shard

/admin/collections?action=SPLITSHARD&collection= &shard=name shardID

Splitting a shard will take an existing shard and break it into two pieces which are written to disk as two (new)

shards. The original shard will continue to contain the same data as-is but it will start re-routing requests to the

new shards. The new shards will have as many replicas as the original shard. A soft commit is automatically

issued after splitting a shard so that documents are made visible on sub-shards. An explicit commit (hard or soft)

is not necessary after a split operation because the index is automatically persisted to disk during the split

operation.

This command allows for seamless splitting and requires no downtime. A shard being split will continue to accept

query and indexing requests and will automatically start routing them to the new shards once this operation is

complete. This command can only be used for SolrCloud collections created with "numShards" parameter,

meaning collections which rely on Solr's hash-based routing mechanism.

The split is performed by dividing the original shard's hash range into two equal partitions and dividing up the

documents in the original shard according to the new sub-ranges.

One can also specify an optional 'ranges' parameter to divide the original shard's hash range into arbitrary hash

range intervals specified in hexadecimal. For example, if the original hash range is 0-1500 then adding the

parameter: ranges=0-1f4,1f5-3e8,3e9-5dc will divide the original shard into three shards with hash range 0-500,

501-1000 and 1001-1500 respectively.

Another optional parameter 'split.key' can be used to split a shard using a route key such that all documents of

the specified route key end up in a single dedicated sub-shard. Providing the 'shard' parameter is not required in

this case because the route key is enough to figure out the right shard. A route key which spans more than one

shard is not supported. For example, suppose split.key=A! hashes to the range 12-15 and belongs to shard

'shard1' with range 0-20 then splitting by this route key would yield three sub-shards with ranges 0-11, 12-15 and

16-20. Note that the sub-shard with the hash range of the route key may also contain documents for other route

keys whose hash ranges overlap.

506Apache Solr Reference Guide 5.1

Shard splitting can be a long running process. In order to avoid timeouts, you should run this as an asynchronou

s call.

Input

Query Parameters

Key Type Required Description

collection string Yes The name of the collection that includes the shard to be split.

shard string Yes The name of the shard to be split.

ranges string No A comma-separated list of hash ranges in hexadecimal e.g.

ranges=0-1f4,1f5-3e8,3e9-5dc

split.key string No The key to use for splitting the index

property.nam

=e value

string No Set core property to . See the section name value Defining core.properties

for details on supported properties and values.

async string No Request ID to track this action which will be processed asynchronously

Output

Output Content

The output will include the status of the request and the new shard names, which will use the original shard as

their basis, adding an underscore and a number. For example, "shard1" will become "shard1_0" and "shard1_1".

If the status is anything other than "success", an error message will explain why the request failed.

Examples

Input

Split shard1 of the "anotherCollection" collection.

http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=anotherCol

lection&shard=shard1

Output

507Apache Solr Reference Guide 5.1

</lst>

<lst>

</lst>

<str name="core">anotherCollection_shard1_1_replica1</str>

</lst>

<lst>

</lst>

<str name="core">anotherCollection_shard1_0_replica1</str>

</lst>

<lst>

</lst>

<lst>

</lst>

<lst>

</lst>

<lst>

</lst>

<str name="core">anotherCollection_shard1_1_replica1</str>

<str name="status">EMPTY_BUFFER</str>

</lst>

<lst>

</lst>

<str name="core">anotherCollection_shard1_0_replica1</str>

<str name="status">EMPTY_BUFFER</str>

</lst>

</response>

508Apache Solr Reference Guide 5.1

Create a Shard

Shards can only created with this API for collections that use the 'implicit' router. Use SPLITSHARD for

collections using the 'compositeId' router. A new shard with a name can be created for an existing 'implicit'

collection.

/admin/collections?action=CREATESHARD&shard= &collection=shardName name

Input

Query Parameters

Key Type Required Description

collection string Yes The name of the collection that includes the shard that will be splitted.

shard string Yes The name of the shard to be created.

createNodeSet string No Allows defining the nodes to spread the new collection across. If not

provided, the CREATE operation will create shard-replica spread across

all live Solr nodes. The format is a comma-separated list of

node_names, such as localhost:8983_solr, localhost:8984_s

.olr, localhost:8985_solr

property.name

=value

string No Set core property to . See the section name value Defining

for details on supported properties and values.core.properties

Output

Output Content

The output will include the status of the request. If the status is anything other than "success", an error message

will explain why the request failed.

Examples

Input

Create 'shard-z' for the "anImplicitCollection" collection.

http://localhost:8983/solr/admin/collections?action=CREATESHARD&collection=anImplici

tCollection&shard=shard-z

Output

</lst>

</response>

Delete a Shard

Deleting a shard will unload all replicas of the shard and remove them from . It will onlyclusterstate.json

remove shards that are inactive, or which have no range given for custom sharding.

509Apache Solr Reference Guide 5.1

/admin/collections?action=DELETESHARD&shard= &collection=shardID name

Input

Query Parameters

Key Type Required Description

collection string Yes The name of the collection that includes the shard to be deleted.

shard string Yes The name of the shard to be deleted.

Output

Output Content

The output will include the status of the request. If the status is anything other than "success", an error message

will explain why the request failed.

Examples

Input

Delete 'shard1' of the "anotherCollection" collection.

http://localhost:8983/solr/admin/collections?action=DELETESHARD&collection=anotherCo

llection&shard=shard1

Output

</lst>

</lst>

</response>

Create or modify an Alias for a Collection

The action will create a new alias pointing to one or more collections. If an alias by the sameCREATEALIAS

name already exists, this action will replace the existing alias, effectively acting like an atomic "MOVE"

command.

/admin/collections?action=CREATEALIAS&name= &collections=name collectionlist

Input

Query Parameters

Key Type Required Description

510Apache Solr Reference Guide 5.1

name string Yes The alias name to be created.

collections string Yes The list of collections to be aliased, separated by commas.

Output

Output Content

The output will simply be a responseHeader with details of the time it took to process the request. To confirm the

creation of the alias, you can look in the Solr Admin UI, under the Cloud section and find the file.aliases.json

Examples

Input

Create an alias named "testalias" and link it to the collections named "anotherCollection" and "testCollection".

http://localhost:8983/solr/admin/collections?action=CREATEALIAS&name=testalias&colle

ctions=anotherCollection,testCollection

Output

</lst>

</response>

Delete a Collection Alias

/admin/collections?action=DELETEALIAS&name=name

Input

Query Parameters

Key Type Required Description

name string Yes The name of the alias to delete.

Output

Output Content

The output will simply be a responseHeader with details of the time it took to process the request. To confirm the

removal of the alias, you can look in the Solr Admin UI, under the Cloud section, and find the filealiases.json

Examples

Input

Remove the alias named "testalias".

http://localhost:8983/solr/admin/collections?action=DELETEALIAS&name=testalias

Output

511Apache Solr Reference Guide 5.1

</lst>

</response>

Delete a Collection

/admin/collections?action=DELETE&name=collection

Input

Query Parameters

Key Type Required Description

name string Yes The name of the collection to delete.

Output

Output Content

The response will include the status of the request and the cores that were deleted. If the status is anything other

than "success", an error message will explain why the request failed.

Examples

Input

Delete the collection named "newCollection".

http://localhost:8983/solr/admin/collections?action=DELETE&name=newCollection

Output

</lst>

</lst>

</lst>

</response>

512Apache Solr Reference Guide 5.1

Delete a Replica

/admin/collections?action=DELETEREPLICA&collection= &shard= &replica=collection shard r

eplica

Delete a replica from a given collection and shard. If the corresponding core is up and running the core is

unloaded and the entry is removed from the clusterstate. If the node/core is down , the entry is taken off the

clusterstate and if the core comes up later it is automatically unregistered.

Input

Query Parameters

Key Type Required Description

collection string Yes The name of the collection.

shard string Yes The name of the shard that includes the replica to be removed.

replica string Yes The name of the replica to remove.

onlyIfDown boolean no When set to 'true' will not take any action if the replica is active. Default

'false'

Examples

Input

http://localhost:8983/solr/admin/collections?action=DELETEREPLICA&collection=test2&s

hard=shard2&replica=core_node3

Output

Output Content

<lst name="responseHeader"><int name="status">0</int><int

name="QTime">110</int></lst>

</response>

Add Replica

/admin/collections?action=ADDREPLICA&collection= &shard= &node=collection shard solr_no

de_name

Add a replica to a shard in a collection. The node name can be specified if the replica is to be created in a

specific node

Input

Query Parameters

Key Type Required Description

collection string Yes The name of the collection.

513Apache Solr Reference Guide 5.1

shard string Yes* The name of the shard to which replica is to be added.

If shard is not specified, then _route_ must be.

_route_ string No* If the exact shard name is not known, users may pass the _route_ value

and the system would identify the name of the shard.

Ignored if the shard param is also specified.

node string No The name of the node where the replica should be created

instanceDir string No The instanceDir for the core that will be created

dataDir string No The directory in which the core should be created

property.na

=me value

string No Set core property to . See .name value Defining core.properties

async string No Request ID to track this action which will be processed asynchronously

Examples

Input

http://localhost:8983/solr/admin/collections?action=ADDREPLICA&collection=test2&shar

d=shard2&node=192.167.1.2:8983_solr

Output

Output Content

</lst>

<lst>

</lst>

<str name="core">test2_shard2_replica4</str>

</lst>

</response>

Cluster Properties

/admin/collections?action=CLUSTERPROP&name= &val=propertyName propertyValue

Add, edit or delete a cluster-wide property.

Input

Query Parameters

514Apache Solr Reference Guide 5.1

Key Type Required Description

name string Yes The name of the property. The two supported properties names are urlScheme

and . Other names are rejected with an error.autoAddReplicas

val string Yes The value of the property. If the value is empty or null, the property is unset.

Output

Output Content

The response will include the status of the request and the properties that were updated or removed. If the status

is anything other than "0", an error message will explain why the request failed.

Examples

Input

http://localhost:8983/solr/admin/collections?action=CLUSTERPROP&name=urlScheme&val=h

ttps

Output

</lst>

</response>

Migrate Documents to Another Collection

/admin/collections?action=MIGRATE&collection= &split.key= &target.collectionname key1!

= &forward.timeout=60target_collection

The MIGRATE command is used to migrate all documents having the given routing key to another collection.

The source collection will continue to have the same data as-is but it will start re-routing write requests to the

target collection for the number of seconds specified by the forward.timeout parameter. It is the responsibility of

the user to switch to the target collection for reads and writes after the ‘migrate’ command completes.

The routing key specified by the ‘split.key’ parameter may span multiple shards on both the source and the target

collections. The migration is performed shard-by-shard in a single thread. One or more temporary collections

may be created by this command during the ‘migrate’ process but they are cleaned up at the end automatically.

This is a synchronous operation and therefore keeping a large read timeout on the invocation is advised. The

request may still timeout due to inherent limitations of the Collection APIs but that doesn’t necessarily mean that

the operation has failed. Users should check logs, cluster state, source and target collections before invoking the

operation again.

This command works only with collections having the compositeId router. The target collection must not receive

any writes during the time the migrate command is running otherwise some writes may be lost.

Please note that the migrate API does not perform any de-duplication on the documents so if the target collection

contains documents with the same uniqueKey as the documents being migrated then the target collection will

end up with duplicate documents.

515Apache Solr Reference Guide 5.1

Input

Query Parameters

Key Type Required Description

collection string Yes The name of the source collection from which documents will be split.

target.collection string Yes The name of the target collection to which documents will be migrated.

split.key string Yes The routing key prefix. For example, if uniqueKey is a!123, then you

would use .split.key=a!

forward.timeout int No The timeout, in seconds, until which write requests made to the source

collection for the given will be forwarded to the targetsplit.key

shard. The default is 60 seconds.

property.name

=value

string No Set core property to . See the section name value Defining

for details on supported properties and values.core.properties

async string No Request ID to track this action which will be .processed asynchronously

Output

Output Content

The response will include the status of the request.

Examples

Input

http://localhost:8983/solr/admin/collections?action=MIGRATE&collection=test1&split.k

ey=a!&target.collection=test2

Output

</lst>

<lst>

</lst>

<str name="core">test2_shard1_0_replica1</str>

<str name="status">BUFFERING</str>

</lst>

<lst>

</lst>

<str name="core">split_shard1_0_temp_shard1_0_shard1_replica1</str>

</lst>

516Apache Solr Reference Guide 5.1

<lst>

</lst>

<lst>

</lst>

<lst>

</lst>

<str name="core">split_shard1_0_temp_shard1_0_shard1_replica2</str>

</lst>

<lst>

</lst>

<lst>

</lst>

<lst>

</lst>

<str name="core">test2_shard1_0_replica1</str>

<str name="status">EMPTY_BUFFER</str>

</lst>

</lst>

</lst>

<lst>

</lst>

<str name="core">test2_shard1_1_replica1</str>

<str name="status">BUFFERING</str>

</lst>

517Apache Solr Reference Guide 5.1

<lst>

</lst>

<str name="core">split_shard1_1_temp_shard1_1_shard1_replica1</str>

</lst>

<lst>

</lst>

<lst>

</lst>

<lst>

</lst>

<str name="core">split_shard1_1_temp_shard1_1_shard1_replica2</str>

</lst>

<lst>

</lst>

<lst>

</lst>

<lst>

</lst>

<str name="core">test2_shard1_1_replica1</str>

<str name="status">EMPTY_BUFFER</str>

</lst>

</lst>

</lst>

518Apache Solr Reference Guide 5.1

</lst>

</response>

Add Role

/admin/collections?action=ADDROLE&role= &node=roleName nodeName

Assign a role to a given node in the cluster. The only supported role as of 4.7 is 'overseer' . Use this API to

dedicate a particular node as Overseer. Invoke it multiple times to add more nodes. This is useful in large

clusters where an Overseer is likely to get overloaded . If available, one among the list of nodes which are

assigned the 'overseer' role would become the overseer. The system would assign the role to any other node if

none of the designated nodes are up and running

Input

Query Parameters

Key Type Required Description

role string Yes The name of the role. The only supported role as of now is .overseer

node string Yes The name of the node. It is possible to assign a role even before that node is

started.

Output

Output Content

The response will include the status of the request and the properties that were updated or removed. If the status

is anything other than "0", an error message will explain why the request failed.

Examples

Input

http://localhost:8983/solr/admin/collections?action=ADDROLE&role=overseer&node=192.1

67.1.2:8983_solr

Output

</lst>

</response>

Remove Role

/admin/collections?action=REMOVEROLE&role= &node=roleName nodeName

Remove an assigned role. This API is used to undo the roles assigned using ADDROLE operation

Input

519Apache Solr Reference Guide 5.1

Query Parameters

Key Type Required Description

role string Yes The name of the role. The only supported role as of now is .overseer

node string Yes The name of the node.

Output

Output Content

The response will include the status of the request and the properties that were updated or removed. If the status

is anything other than "0", an error message will explain why the request failed.

Examples

Input

http://localhost:8983/solr/admin/collections?action=REMOVEROLE&role=overseer&node=19

2.167.1.2:8983_solr

Output

</lst>

</response>

Overseer status and statistics

/admin/collections?action=OVERSEERSTATUS

Returns the current status of the overseer, performance statistics of various overseer APIs as well as last 10

failures per operation type.

Examples

Input:

http://localhost:8983/solr/admin/collections?action=OVERSEERSTATUS&wt=json

{

"responseHeader":{

"status":0,

"QTime":33},

"leader":"127.0.1.1:8983_solr",

"overseer_queue_size":0,

"overseer_work_queue_size":0,

"overseer_collection_queue_size":2,

"overseer_operations":[

"createcollection",{

"requests":2,

"errors":0,

520Apache Solr Reference Guide 5.1

"totalTime":1.010137,

"avgRequestsPerMinute":0.7467088842794136,

"5minRateRequestsPerMinute":7.525069023276674,

"15minRateRequestsPerMinute":10.271274280947182,

"avgTimePerRequest":0.5050685,

"medianRequestTime":0.5050685,

"75thPctlRequestTime":0.519016,

"95thPctlRequestTime":0.519016,

"99thPctlRequestTime":0.519016,

"999thPctlRequestTime":0.519016},

"removeshard",{

"requests":1,

"errors":0,

"totalTime":0.26784,

"avgRequestsPerMinute":0.4639267176178192,

"5minRateRequestsPerMinute":8.179027994326175,

"15minRateRequestsPerMinute":10.560587086130052,

"avgTimePerRequest":0.26784,

"medianRequestTime":0.26784,

"75thPctlRequestTime":0.26784,

"95thPctlRequestTime":0.26784,

"99thPctlRequestTime":0.26784,

"999thPctlRequestTime":0.26784},

"updateshardstate",{

"requests":1,

"errors":0,

"totalTime":0.609256,

"avgRequestsPerMinute":0.43725644039684236,

"5minRateRequestsPerMinute":8.043840552427673,

"15minRateRequestsPerMinute":10.502079828515368,

"avgTimePerRequest":0.609256,

"medianRequestTime":0.609256,

"75thPctlRequestTime":0.609256,

"95thPctlRequestTime":0.609256,

"99thPctlRequestTime":0.609256,

"999thPctlRequestTime":0.609256},

"state",{

"requests":29,

"errors":0,

"totalTime":25.777765,

"avgRequestsPerMinute":8.911471494053579,

"5minRateRequestsPerMinute":16.77961791015292,

"15minRateRequestsPerMinute":21.299616774565774,

"avgTimePerRequest":0.888888448275862,

"medianRequestTime":0.646322,

"75thPctlRequestTime":0.7662585,

"95thPctlRequestTime":4.9277995,

"99thPctlRequestTime":6.687749,

"999thPctlRequestTime":6.687749},

"createshard",{

"requests":2,

"errors":0,

"totalTime":0.328155,

"avgRequestsPerMinute":0.8384528317300947,

"5minRateRequestsPerMinute":15.560264184036232,

"15minRateRequestsPerMinute":20.772071869612244,

"avgTimePerRequest":0.1640775,

"medianRequestTime":0.1640775,

"75thPctlRequestTime":0.198494,

521Apache Solr Reference Guide 5.1

"95thPctlRequestTime":0.198494,

"99thPctlRequestTime":0.198494,

"999thPctlRequestTime":0.198494},

"leader",{

"requests":15,

"errors":0,

"totalTime":1.850757,

"avgRequestsPerMinute":4.664791390089222,

"5minRateRequestsPerMinute":15.267394345445812,

"15minRateRequestsPerMinute":20.61365640511346,

"avgTimePerRequest":0.1233838,

"medianRequestTime":0.095369,

"75thPctlRequestTime":0.190858,

"95thPctlRequestTime":0.245846,

"99thPctlRequestTime":0.245846,

"999thPctlRequestTime":0.245846},

"deletecore",{

"requests":2,

"errors":0,

"totalTime":0.1644,

"avgRequestsPerMinute":0.9277190814105167,

"5minRateRequestsPerMinute":16.35805598865235,

"15minRateRequestsPerMinute":21.121174172260105,

"avgTimePerRequest":0.0822,

"medianRequestTime":0.0822,

"75thPctlRequestTime":0.114723,

"95thPctlRequestTime":0.114723,

"99thPctlRequestTime":0.114723,

"999thPctlRequestTime":0.114723}],

"collection_operations":[

"overseerstatus",{

"requests":5,

"errors":0,

"totalTime":16.602856,

"avgRequestsPerMinute":1.8002951096636433,

"5minRateRequestsPerMinute":7.878245556506509,

"15minRateRequestsPerMinute":10.39984320341109,

"avgTimePerRequest":3.3205712000000003,

"medianRequestTime":3.42046,

"75thPctlRequestTime":4.0594019999999995,

"95thPctlRequestTime":4.563145,

"99thPctlRequestTime":4.563145,

"999thPctlRequestTime":4.563145},

"createalias",{

"requests":1,

"errors":0,

"totalTime":101.364917,

"avgRequestsPerMinute":8.304550290288862,

"5minRateRequestsPerMinute":12.0,

"15minRateRequestsPerMinute":12.0,

"avgTimePerRequest":101.364917,

"medianRequestTime":101.364917,

"75thPctlRequestTime":101.364917,

"95thPctlRequestTime":101.364917,

"99thPctlRequestTime":101.364917,

"999thPctlRequestTime":101.364917},

"splitshard",{

"requests":1,

"errors":1,

522Apache Solr Reference Guide 5.1

"recent_failures":[{

"request":{

"operation":"splitshard",

"shard":"shard2",

"collection":"example1"},

"response":[

"Operation splitshard caused

exception:","org.apache.solr.common.SolrException:org.apache.solr.common.SolrExcepti

on: No shard with the specified name exists: shard2",

"exception",{

"msg":"No shard with the specified name exists: shard2",

"rspCode":400}]}],

"totalTime":5905.432835,

"avgRequestsPerMinute":0.8198143044809885,

"5minRateRequestsPerMinute":8.043840552427673,

"15minRateRequestsPerMinute":10.502079828515368,

"avgTimePerRequest":2952.7164175,

"medianRequestTime":2952.7164175000003,

"75thPctlRequestTime":5904.384052,

"95thPctlRequestTime":5904.384052,

"99thPctlRequestTime":5904.384052,

"999thPctlRequestTime":5904.384052},

"createcollection",{

"requests":2,

"errors":0,

"totalTime":6294.35359,

"avgRequestsPerMinute":0.7466431055563431,

"5minRateRequestsPerMinute":7.5271593686145355,

"15minRateRequestsPerMinute":10.271591296400848,

"avgTimePerRequest":3147.176795,

"medianRequestTime":3147.1767950000003,

"75thPctlRequestTime":3387.162793,

"95thPctlRequestTime":3387.162793,

"99thPctlRequestTime":3387.162793,

"999thPctlRequestTime":3387.162793},

"deleteshard",{

"requests":1,

"errors":0,

"totalTime":320.071335,

"avgRequestsPerMinute":0.4637771550349566,

"5minRateRequestsPerMinute":8.179027994326175,

"15minRateRequestsPerMinute":10.560587086130052,

"avgTimePerRequest":320.071335,

"medianRequestTime":320.071335,

"75thPctlRequestTime":320.071335,

"95thPctlRequestTime":320.071335,

"99thPctlRequestTime":320.071335,

"999thPctlRequestTime":320.071335}],

"overseer_queue":[

"peek_wait100",{

"totalTime":2775.554755,

"avgRequestsPerMinute":12.440395120289685,

"5minRateRequestsPerMinute":18.487470843855192,

"15minRateRequestsPerMinute":22.052847430688917,

"avgTimePerRequest":69.388868875,

"medianRequestTime":101.1499165,

"75thPctlRequestTime":101.43390225,

"95thPctlRequestTime":101.9976678,

"99thPctlRequestTime":102.037032,

523Apache Solr Reference Guide 5.1

"999thPctlRequestTime":102.037032},

"peek_wait_forever",{

"totalTime":63247.861899,

"avgRequestsPerMinute":11.64420509572364,

"5minRateRequestsPerMinute":31.572546097788198,

"15minRateRequestsPerMinute":41.688934561096204,

"avgTimePerRequest":1664.4174183947368,

"medianRequestTime":636.5281970000001,

"75thPctlRequestTime":1629.3317682499999,

"95thPctlRequestTime":13220.58495709999,

"99thPctlRequestTime":16293.17735,

"999thPctlRequestTime":16293.17735},

"remove",{

"totalTime":92.528385,

"avgRequestsPerMinute":15.979782864505227,

"5minRateRequestsPerMinute":33.37988956147563,

"15minRateRequestsPerMinute":42.49548598991928,

"avgTimePerRequest":1.7793920192307693,

"medianRequestTime":1.769479,

"75thPctlRequestTime":2.22114175,

"95thPctlRequestTime":3.148778999999998,

"99thPctlRequestTime":4.393077,

"999thPctlRequestTime":4.393077},

"poll",{

"totalTime":94.686248,

"avgRequestsPerMinute":15.97973186166097,

"5minRateRequestsPerMinute":33.37988956147563,

"15minRateRequestsPerMinute":42.49548598991928,

"avgTimePerRequest":1.8208893846153844,

"medianRequestTime":1.819817,

"75thPctlRequestTime":2.266558,

"95thPctlRequestTime":3.2130298999999978,

"99thPctlRequestTime":4.433906,

"999thPctlRequestTime":4.433906}],

"overseer_internal_queue":[

"peek",{

"totalTime":0.516668,

"avgRequestsPerMinute":0.30642572162118586,

"5minRateRequestsPerMinute":6.696421749240565,

"15minRateRequestsPerMinute":9.879502985109362,

"avgTimePerRequest":0.516668,

"medianRequestTime":0.516668,

"75thPctlRequestTime":0.516668,

"95thPctlRequestTime":0.516668,

"99thPctlRequestTime":0.516668,

"999thPctlRequestTime":0.516668},

"offer",{

"totalTime":51.784521,

"avgRequestsPerMinute":15.979724576198302,

"5minRateRequestsPerMinute":33.37988956147563,

"15minRateRequestsPerMinute":42.49548598991928,

"avgTimePerRequest":0.9958561730769231,

"medianRequestTime":0.8628875,

"75thPctlRequestTime":1.1464622500000001,

"95thPctlRequestTime":1.6499188,

"99thPctlRequestTime":6.091519,

"999thPctlRequestTime":6.091519},

"remove",{

"totalTime":143.130248,

524Apache Solr Reference Guide 5.1

"avgRequestsPerMinute":27.6584163855513,

"5minRateRequestsPerMinute":64.95243565926378,

"15minRateRequestsPerMinute":84.18442055101546,

"avgTimePerRequest":1.5903360888888889,

"medianRequestTime":1.660893,

"75thPctlRequestTime":2.35234925,

"95thPctlRequestTime":3.19950245,

"99thPctlRequestTime":5.01803,

"999thPctlRequestTime":5.01803},

"poll",{

"totalTime":147.837065,

"avgRequestsPerMinute":27.65837219382363,

"5minRateRequestsPerMinute":64.95243565926378,

"15minRateRequestsPerMinute":84.18442055101546,

"avgTimePerRequest":1.6426340555555554,

"medianRequestTime":1.6923249999999999,

"75thPctlRequestTime":2.40090275,

"95thPctlRequestTime":3.2569366,

"99thPctlRequestTime":5.062005,

"999thPctlRequestTime":5.062005}],

"collection_queue":[

"remove_event",{

"totalTime":37.638197,

"avgRequestsPerMinute":3.9610733603305124,

"5minRateRequestsPerMinute":9.122591857306068,

"15minRateRequestsPerMinute":10.928990808126446,

"avgTimePerRequest":3.421654272727273,

"medianRequestTime":3.411283,

"75thPctlRequestTime":4.212892,

"95thPctlRequestTime":4.720874,

"99thPctlRequestTime":4.720874,

"999thPctlRequestTime":4.720874},

"peek_wait_forever",{

"totalTime":183048.91735,

"avgRequestsPerMinute":3.677073912023291,

"5minRateRequestsPerMinute":1.5867138429776346,

"15minRateRequestsPerMinute":0.6561136902644256,

"avgTimePerRequest":15254.076445833334,

"medianRequestTime":6745.20675,

"75thPctlRequestTime":27662.958113499997,

525Apache Solr Reference Guide 5.1

"95thPctlRequestTime":49871.380589,

"99thPctlRequestTime":49871.380589,

"999thPctlRequestTime":49871.380589}]}

Cluster Status

/admin/collections?action=CLUSTERSTATUS

Fetch the cluster status including collections, shards, replicas as well as collection aliases and cluster properties.

Input

Query Parameters

Key Type Required Description

collection string No The collection name for which information is requested. If omitted, information

on all collections in the cluster will be returned.

shard string No The shard(s) for which information is requested. Multiple shard names can be

specified as a comma separated list.

Output

Output Content

The response will include the status of the request and the cluster status.

Examples

Input

http://localhost:8983/solr/admin/collections?action=clusterstatus&wt=json

Output

{

"responseHeader":{

"status":0,

"QTime":333},

"cluster":{

"collections":{

"collection1":{

"shards":{

"shard1":{

"range":"80000000-ffffffff",

"state":"active",

"replicas":{

"core_node1":{

"state":"active",

"core":"collection1",

"node_name":"127.0.1.1:8983_solr",

"base_url":"http://127.0.1.1:8983/solr",

"leader":"true"},

"core_node3":{

"state":"active",

526Apache Solr Reference Guide 5.1

"core":"collection1",

"node_name":"127.0.1.1:8900_solr",

"base_url":"http://127.0.1.1:8900/solr"}}},

"shard2":{

"range":"0-7fffffff",

"state":"active",

"replicas":{

"core_node2":{

"state":"active",

"core":"collection1",

"node_name":"127.0.1.1:7574_solr",

"base_url":"http://127.0.1.1:7574/solr",

"leader":"true"},

"core_node4":{

"state":"active",

"core":"collection1",

"node_name":"127.0.1.1:7500_solr",

"base_url":"http://127.0.1.1:7500/solr"}}}},

"maxShardsPerNode":"1",

"router":{"name":"compositeId"},

"replicationFactor":"1",

"autoCreated":"true",

"aliases":["both_collections"]},

"collection2":{

"shards":{

"shard1":{

"range":"80000000-d554ffff",

"state":"active",

"replicas":{"core_node1":{

"state":"active",

"core":"collection2_shard1_replica1",

"node_name":"127.0.1.1:8983_solr",

"base_url":"http://127.0.1.1:8983/solr",

"leader":"true"}}},

"shard2":{

"range":"d5550000-2aa9ffff",

"state":"active",

"replicas":{"core_node2":{

"state":"active",

"core":"collection2_shard2_replica1",

"node_name":"127.0.1.1:7500_solr",

"base_url":"http://127.0.1.1:7500/solr",

"leader":"true"}}},

"shard3":{

"range":"2aaa0000-7fffffff",

"state":"active",

"replicas":{"core_node3":{

"state":"active",

"core":"collection2_shard3_replica1",

"node_name":"127.0.1.1:8900_solr",

"base_url":"http://127.0.1.1:8900/solr",

"leader":"true"}}}},

"maxShardsPerNode":"1",

"router":{"name":"compositeId"},

"replicationFactor":"1",

"autoAddReplicas":"false",

"aliases":["both_collections"]}},

"aliases":{"both_collections":"collection1,collection2"},

"roles":{"overseer":["127.0.1.1:8983_solr",

527Apache Solr Reference Guide 5.1

"127.0.1.1:7574_solr"]},

"live_nodes":["127.0.1.1:7574_solr",

528Apache Solr Reference Guide 5.1

"127.0.1.1:7500_solr",

"127.0.1.1:8983_solr",

"127.0.1.1:8900_solr"]}}

Request Status

/admin/collections?action=REQUESTSTATUS&requestid=request-id

Request the status of an already submitted call. This call is also used to clear upAsynchronous Collection API

the stored statuses (See below).

Input

Query Parameters

Key Type Required Description

requestid string Yes The user defined request-id for the request. This can be used to track the

status of the submitted asynchronous task. -1 is a special request id which is

used to cleanup the stored states for all of the already completed/failed tasks.

Examples

Input: Valid Request Status

http://localhost:8983/solr/admin/collections?action=REQUESTSTATUS&requestid=1000

Output

</lst>

<str name="state">completed</str>

<str name="msg">found 1000 in completed tasks</str>

</lst>

</response>

Input: Invalid RequestId

http://localhost:8983/solr/admin/collections?action=REQUESTSTATUS&requestid=1004

Output

529Apache Solr Reference Guide 5.1

</lst>

<str name="state">notfound</str>

<str name="msg">Did not find taskid [1004] in any tasks queue</str>

</lst>

</response>

Input: Clearing up all the stored statuses

http://localhost:8983/solr/admin/collections?action=REQUESTSTATUS&requestid=-1

List Collections

/admin/collections?action=LIST

Fetch the names of the collections in the cluster.

Example

Input

http://localhost:8983/solr/admin/collections?action=LIST&wt=json

Output

{

"responseHeader":{

"status":0,

"QTime":2011},

"collections":["collection1",

"example1",

"example2"]}

Add Replica Property

/admin/collections?action=ADDREPLICAPROP&collection=collectionName&shard=shardName

&replica=replicaName&property=propertyName&property.value=value

Assign an arbitrary property to a particular replica and give it the value specified. If the property already exists, it

will be overwritten with the new value.

Input

Query Parameters

Key Type Required Description

collection string Yes The name of the collection this replica belongs to.

530Apache Solr Reference Guide 5.1

shard string Yes The name of the shard the replica belongs to.

replica string Yes The replica, e.g. core_node1.

property (1) string Yes The property to add. Note: this will have the literal 'property.'

prepended to distinguish it from system-maintained properties. So

these two forms are equivalent:

property=special

and

property=property.special

property.value string Yes The value to assign to the property.

shardUnique

(1)

Boolean No default: false. If true, then setting this property in one replica will

remove the property from all other replicas in that shard.

(1) There is one pre-defined property "preferredLeader" for which shardUnique is forced to 'true' and an error

returned if shardUnique is explicitly set to 'false'. PreferredLeader is a boolean property, any value assigned that

is not equal (case insensitive) to 'true' will be interpreted as 'false' for preferredLeader.

Output

Output Content

The response will include the status of the request. If the status is anything other than "0", an error message will

explain why the request failed.

Examples

Input: This command would set the preferredLeader (property.preferredLeader) to true on core_node1, and

remove that property from any other replica in the shard.

http://localhost:8983/solr/admin/collections?action=ADDREPLICAPROP&shard=shard1&coll

ection=collection1&replica=core_node1&property=preferredLeader&property.value=true

Output

</lst>

</response>

Input: This pair of commands will set the "testprop" (property.testprop) to 'value1' and 'value2' respectively for

two nodes in the same shard.

http://localhost:8983/solr/admin/collections?action=ADDREPLICAPROP&shard=shard1&coll

ection=collection1&replica=core_node1&property=testprop&property.value=value1

http://localhost:8983/solr/admin/collections?action=ADDREPLICAPROP&shard=shard1&coll

ection=collection1&replica=core_node3&property=property.testprop&property.value=valu

531Apache Solr Reference Guide 5.1

Input: This pair of commands would result in core_node_3 having the testprop (property.testprop) value set

because the second command specifies shardUnique=true, which would cause the property to be removed from

core_node_1.

http://localhost:8983/solr/admin/collections?action=ADDREPLICAPROP&shard=shard1&coll

ection=collection1&replica=core_node1&property=testprop&property.value=value1

http://localhost:8983/solr/admin/collections?action=ADDREPLICAPROP&shard=shard1&coll

ection=collection1&replica=core_node3&property=testprop&property.value=value2&shardU

nique=true

Delete Replica Property

/admin/collections?action=DELETEREPLICAPROP&collection=collectionName&shard=shardNa

&replica= &property=me replicaName propertyName

Deletes an arbitrary property from a particular replica.

Input

Query Parameters

Key Type Required Description

collection string Yes The name of the collection this replica belongs to

shard string Yes The name of the shard the replica belongs to.

replica string Yes The replica, e.g. core_node1.

property string Yes The property to add. Note: this will have the literal 'property.' prepended to

distinguish it from system-maintained properties. So these two forms are

equivalent:

property=special

and

property=property.special

Output

Output Content

The response will include the status of the request. If the status is anything other than "0", an error message will

explain why the request failed.

Examples

Input: This command would delete the preferredLeader (property.preferredLeader) from core_node1.

http://localhost:8983/solr/admin/collections?action=DELETEREPLICAPROP&shard=shard1&c

ollection=collection1&replica=core_node1&property=preferredLeader

532Apache Solr Reference Guide 5.1

Output:

</lst>

</response>

Balance a Property

/admin/collections?action=BALANCESHARDUNIQUE&collection= &property=collectionName pro

pertyName

Insures that a particular property is distributed evenly amongst the physical nodes that make up a collection. If

the property already exists on a replica, every effort is made to leave it there. If the property is on any replicanot

on a shard one is chosen and the property is added.

Input

Query Parameters

Key Type Required Description

collection string Yes The name of the collection to balance the property in.

property string Yes The property to balance. The literal "property." is prepended to this

property if not specified explicitly.

onlyactivenodes boolean No Defaults to true. Normally, the property is instantiated on active

nodes only. If this parameter is specified as "false", then inactive

nodes are also included for distribution.

shardUnique boolean No Something of a safety valve. There is one pre-defined property

(preferredLeader) that defaults this value to "true". For all other

properties that are balanced, this must be set to "true" or an error

message is returned.

Output

Output Content

The response will include the status of the request. If the status is anything other than "0", an error message will

explain why the request failed.

Examples

Input: Either of these commands would put the "preferredLeader" property on one replica in every shard in the

"collection1" collection.

533Apache Solr Reference Guide 5.1

http://localhost:8983/solr/admin/collections?action=BALANCEUNIQUE&collection=collect

ion1&property=preferredLeader

http://localhost:8983/solr/admin/collections?action=BALANCEUNIQUE&collection=collect

ion1&property=property.preferredLeader

Output:

</lst>

</response>

Examining the clusterstate after issuing this call should show exactly one replica in each shard that has this

property.

Rebalance Leaders

Reassign leaders in a collection according to the preferredLeader property across active nodes.

/admin/collections?action=REBALANCELEADERS&collection=collectionName

Assigns leaders in a collection according to the preferredLeader property on active nodes. This command should

be run after the preferredLeader property has been assigned via the BALANCESHARDUNIQUE or

ADDREPLICAPROP commands. NOTE: it is not that all shards in a collection have a preferredLeaderrequired

property. Rebalancing will only attempt to reassign leadership to those replicas that have the preferredLeader

property set to "true" are not currently the shard leader are currently active.and and

Input

Query Parameters

Key Type Required Description

collection string Yes The name of the collection to rebalance preferredLeaders on.

maxAtOnce string No The maximum number of reassignments to have queue up at once.

Values <=0 are use the default value Integer.MAX_VALUE. When

this number is reached, the process waits for one or more leaders to

be successfully assigned before adding more to the queue.

maxWaitSeconds string No Defaults to 60. This is the timeout value when waiting for leaders to

be reassigned. NOTE: if maxAtOnce is less than the number of

reassignments that will take place, this is the maximum interval that

any wait for at least one reassignment. For example, if 10single

reassignments are to take place and maxAtOnce is 1 and

maxWaitSeconds is 60, the upper bound on the time that the

command may wait is 10 minutes.

Output

534Apache Solr Reference Guide 5.1

Output Content

The response will include the status of the request. If the status is anything other than "0", an error message will

explain why the request failed.

Examples

Input: Either of these commands would cause all the active replicas that had the "preferredLeader" property set

and were already the preferred leader to become leaders.not

http://localhost:8983/solr/admin/collections?action=REBALANCELEADERS&collection=coll

ection1

http://localhost:8983/solr/admin/collections?action=REBALANCELEADERS&collection=coll

ection1&maxAtOnce=5&maxWaitSeconds=30

Output: In this example, two replicas in the "alreadyLeaders" section already had the leader assigned to the

same node as the preferredLeader property so no action was taken. The replica in the "inactivePreferreds"

section had the preferredLeader property set but the node was down and no action was taken. The three nodes

in the "successes" section were made leaders because they had the preferredLeader property set but were not

leaders and they were active.

535Apache Solr Reference Guide 5.1

</lst>

<str name="status">success</str>

<str name="msg">Already leader</str>

</lst>

<str name="status">success</str>

<str name="msg">Already leader</str>

</lst>

<str name="status">skipped</str>

<str name="msg">Node is a referredLeader, but it's inactive. Skipping</str>

</lst>

<str name="status">success</str>

Assigned 'Collection: 'collection1', Shard: 'shard3', Core:

'collection1_shard3_replica1', BaseUrl:

'http://192.168.1.167:8983/solr'' to be leader

</str>

</lst>

<str name="status">success</str>

Assigned 'Collection: 'collection1', Shard: 'shard5', Core:

'collection1_shard5_replica3', BaseUrl:

'http://192.168.1.167:7200/solr'' to be leader

</str>

</lst>

<str name="status">success</str>

Assigned 'Collection: 'collection1', Shard: 'shard4', Core:

'collection1_shard4_replica2', BaseUrl:

'http://192.168.1.167:7300/solr'' to be leader

</str>

</lst>

</response>

Examining the clusterstate after issuing this call should show that every live node that has the "preferredLeader"

property should also have the "leader" property set to .true

Asynchronous Calls

536Apache Solr Reference Guide 5.1

Since some collection API calls can be long running tasks e.g. Shard Split, you can optionally have the calls run

asynchronously. Specifying enables you to make an asynchronous call, the status of whichasync=<request-id>

can be, at any point requested using the call.REQUESTSTATUS

As of now, the REQUESTSTATUS does not automatically cleanup the tracking data structures i.e. the status of

completed/failed tasks stays stored in ZooKeeper unless cleared manually. Sending a REQUESTSTATUS call

with of clears the stored statuses.requestid -1

Example

Input

http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=collection

1&shard=shard1&async=1000

Output

</lst>

</response>

Parameter Reference

Cluster Parameters

numShards Defaults

to 1

The number of shards to hash documents to. There must be one leader per shard

and each leader can have N replicas.

SolrCloud Instance Parameters

These are set in , but by default the and parameters are set up to also work withsolr.xml host hostContext

system properties.

host Defaults to

the first local

host address

found

If the wrong host address is found automatically, you can override the host

address with this parameter.

hostPort Defaults to

the port

specified via

bin/solr

,-p <port>

or if not8983

specified.

The port that Solr is running on. This value is only used when is-DzkRun

specified without a value (see below), to calculate the default port on which

embedded ZooKeeper will run. n the shipped with Solr, the Isolr.xml hos

system property is not referenced, and so is ignored. If you want totPort

run Solr on a non-default port, use rather thanbin/solr -p <port>

specifying .-DhostPort

537Apache Solr Reference Guide 5.1

hostContext Defaults to so

The context path for the Solr web application.

SolrCloud Instance ZooKeeper Parameters

zkRun Defaults to localhost:<hostP

ort+1000>

Causes Solr to run an embedded version of

ZooKeeper. Set to the address of ZooKeeper on

this node; this allows us to know who you are in

the list of addresses in the connect string.zkHost

Use (with no value) to get the default-DzkRun

value.

zkHost No default The host address for ZooKeeper. Usually this is a

comma-separated list of addresses to each node

in your ZooKeeper ensemble.

zkClientTimeout Defaults to 15000 The time a client is allowed to not talk to

ZooKeeper before its session expires.

zkRun and are set up using system properties. is set up in by default,zkHost zkClientTimeout solr.xml

but can also be set using a system property.

SolrCloud Core Parameters

shardId Defaults to being automatically assigned based

on numShards

Allows you to specify the id used to group

cores into shards.

shardId can be configured in for each core element as an attribute.solr.xml

Additional cloud related parameters are discussed in .Solr Cores and solr.xml

Command Line Utilities

Solr's Administration page (found by default at ), provides a section withhttp://hostname:8983/solr/

menu items for monitoring indexing and performance statistics, information about index distribution and

replication, and information on all threads running in the JVM at the time. There is also a section where you can

run queries, and an assistance area.

In addition, SolrCloud provides its own administration page (found at ), as wellhttp://localhost:8983/solr/#/~cloud

as a few tools available via a ZooKeeper Command Line Utility (CLI). The CLI scripts found in server/script

let you upload configuration information to ZooKeeper, in the same two ways that weres/cloud-scripts

shown in the examples in . It also provides a few other commands that let you linkParameter Reference

collection sets to collections, make ZooKeeper paths or clear them, and download configurations from

ZooKeeper to the local filesystem.

538Apache Solr Reference Guide 5.1

Using Solr's ZooKeeper CLI

Both (for Unix environments) and (for Windows environments) support the followingzkcli.sh zkcli.bat

command line options:

Short Parameter

Usage

Meaning

-cmd <arg> CLI Command to be executed: , , , bootstrap upconfig downconfig linkconfi

, , , , , , or . g makepath get getfile put putfile list clear

This parameter is mandatory

-z -zkhost

ZooKeeper host address.

This parameter is for all CLI commands.mandatory

-c -collection

<name>

For : name of the collection.linkconfig

-d -confdir

<path>

For : a directory of configuration files.upconfig

-h -help Display help text.

-n -confname

<arg>

For , : name of the configuration set.upconfig linkconfig

-r -runzk

<port>

Run ZooKeeper internally by passing the Solr run port; only for clusters on one

machine.

-s -solrhome

<path>

For or when using : the solrhome location.bootstrap -runzk mandatory

The short form parameter options may be specified with a single dash (eg: ).-c mycollection

The long form parameter options may be specified using either a single dash (eg: -collection

) or a double dash (eg: )mycollection --collection mycollection

ZooKeeper CLI Examples

Below are some examples of using the CLI which assume you have already started the SolrCloudzkcli.sh

example ( )bin/solr -e cloud -noprompt

If you are on Windows machine, simply replace with in these examples.zkcli.sh zkcli.bat

Upload a configuration directory

Solr's zkcli.sh vs ZooKeeper's zkCli.sh

The provided by Solr is not the same as the . zkcli.sh included in ZooKeeper distributionszkCli.sh

ZooKeeper's provides a completely general, application-agnostic shell for manipulating datazkCli.sh

in ZooKeeper. Solr's – discussed in this section – is specific to Solr, and has command linezkcli.sh

arguments specific to dealing with Solr data in ZooKeeper.

539Apache Solr Reference Guide 5.1

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983 \

-cmd upconfig -confname my_new_config -confdir

server/solr/configsets/basic_configs/conf

Put arbitrary data into a new ZooKeeper file

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983 \

-cmd put /my_zk_file.txt 'some data'

Put a local file into a new ZooKeeper file

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983 \

-cmd putfile /my_zk_file.txt /tmp/my_local_file.txt

Link a collection to a configuration set

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983 \

-cmd linkconfig -collection gettingstarted -confname my_new_config

SolrCloud with Legacy Configuration Files

All of the required configuration is already set up in the sample configurations shipped with Solr. You only need

to add the following if you are migrating old configuration files. Do not remove these files and parameters from a

new Solr instance if you intend to use Solr in SolrCloud mode.

These properties exist in 3 files: , , and .schema.xml solrconfig.xml solr.xml

1. In , you must have a field defined:schema.xml _version_

<field name="_version_" type="long" indexed="true" stored="true"

multiValued="false"/>

2. In , you must have an defined. This should be defined in the solrconfig.xml UpdateLog updateHandler

section.

...

</updateLog>

...

</updateHandler>

3. The is part of the default update chain and is automatically injected into any ofDistributedUpdateProcessor

your custom update chains, so you don't actually need to make any changes for this capability. However, should

540Apache Solr Reference Guide 5.1

you wish to add it explicitly, you can still add it to the file as part of an solrconfig.xml updateRequestProc

. For example:essorChain

</updateRequestProcessorChain>

If you do not want the DistributedUpdateProcessFactory auto-injected into your chain (for example, if you want to

use SolrCloud functionality, but you want to distribute updates yourself) then specify the NoOpDistributingUp

update processor factory in your chain:dateProcessorFactory

</updateRequestProcessorChain>

In the update process, Solr skips updating processors that have already been run on other nodes.

541Apache Solr Reference Guide 5.1

Legacy Scaling and Distribution

This section describes how to set up distribution and replication in Solr. It is considered "legacy" behavior, since

while it is still supported in Solr, the SolrCloud functionality described in the previous chapter is where the current

development is headed. However, if you don't need all that SolrCloud delivers, search distribution and index

replication may be sufficient.

This section covers the following topics:

Introduction to Scaling and Distribution: Conceptual information about distribution and replication in Solr.

Distributed Search with Index Sharding: Detailed information about implementing distributed searching in Solr.

Index Replication: Detailed information about replicating your Solr indexes.

Combining Distribution and Replication: Detailed information about replicating shards in a distributed index.

Merging Indexes: Information about combining separate indexes in Solr.

Introduction to Scaling and Distribution

Both Lucene and Solr were designed to scale to support large implementations with minimal custom coding. This

section covers:

distributing an index across multiple servers

replicating an index on multiple servers

merging indexes

If you need full scale distribution of indexes and queries, as well as replication, load balancing and failover, you

may want to use SolrCloud. Full details on configuring and using SolrCloud is available in the section .SolrCloud

What Problem Does Distribution Solve?

If searches are taking too long or the index is approaching the physical limitations of its machine, you should

consider distributing the index across two or more Solr servers.

To distribute an index, you divide the index into partitions called shards, each of which runs on a separate

machine. Solr then partitions searches into sub-searches, which run on the individual shards, reporting results

collectively. The architectural details underlying index sharding are invisible to end users, who simply experience

faster performance on queries against very large indexes.

What Problem Does Replication Solve?

Replicating an index is useful when:

You have a large search volume which one machine cannot handle, so you need to distribute searches

across multiple read-only copies of the index.

There is a high volume/high rate of indexing which consumes machine resources and reduces search

performance on the indexing machine, so you need to separate indexing and searching.

You want to make a backup of the index (see ).Backing Up

Distributed Search with Index Sharding

It is highly recommended that you use when needing to scale up or scale out. The setup describedSolrCloud

542Apache Solr Reference Guide 5.1

below is legacy and was used prior to the existence of SolrCloud. SolrCloud provides for a truly distributed set of

features with support for things like automatic routing, leader election, optimistic concurrency and other sanity

checks that are expected out of a distributed system.

Everything on this page is specific to legacy setup of distributed search. Users trying out SolrCloud should not

follow any of the steps or information below.

Update reorders (i.e., replica A may see update X then Y, and replica B may see update Y then X). deleteByQu

also handles reorders the same way, to ensure replicas are consistent. All replicas of a shard are consistent,ery

even if the updates arrive in a different order on different replicas.

Distributing Documents across Shards

When not using SolrCloud, it is up to you to get all your documents indexed on each shard of your server farm.

Solr supports distributed indexing (routing) in it's true form only in the SolrCloud mode.

In the legacy distributed mode, Solr does not calculate universal term/doc frequencies. For most large-scale

implementations, it is not likely to matter that Solr calculates TD/IDF at the shard level. However, if your

collection is heavily skewed in its distribution across servers, you may find misleading relevancy results in your

searches. In general, it is probably best to randomly distribute documents to your shards.

Executing Distributed Searches with the Parametershards

If a query request includes the parameter, the Solr server distributes the request across all the shardsshards

listed as arguments to the parameter. The parameter uses this syntax:shards

host : port / base_url [, host : port / base_url ]*

For example, the parameter below causes the search to be distributed across two Solr servers: anshards solr1

d , both of which are running on port 8983:solr2

http://localhost:8983/solr/core1/select?shards=solr1:8983/solr/core1,solr2:8983/so

lr/core1&indent=true&q=ipod+solr

Rather than require users to include the shards parameter explicitly, it is usually preferred to configure this

parameter as a default in the RequestHandler section of .solrconfig.xml

Currently, only query requests are distributed. This includes requests to the standard request handler (and

subclasses such as the DisMax RequestHandler), and any other handler (org.apache.solr.handler.comp

) using standard components that support distributed search.onent.searchHandler

As in SolrCloud mode, when , distributed responses will include information about theshards.info=true

shard (where each shard represents a logically different index or physical location)

The following components support distributed search:

The component, which returns documents matching a queryQuery

The component, which processes facet.query and facet.field requests where facets are sorted byFacet

count (the default).

Do not add the parameter to the standard requestHandler; otherwise, search queries may entershards

an infinite loop. Instead, define a new requestHandler that uses the parameter, and passshards

distributed search requests to that handler.

543Apache Solr Reference Guide 5.1

The component, which enables Solr to include "highlighted" matches in field values.Highlighting

The component, which returns simple statistics for numeric fields within the DocSet.Stats

The component, which helps with debugging.Debug

Limitations to Distributed Search

Distributed searching in Solr has the following limitations:

Each document indexed must have a unique key.

If Solr discovers duplicate document IDs, Solr selects the first document and discards subsequent ones.

The index for distributed searching may become momentarily out of sync if a commit happens between

the first and second phase of the distributed search. This might cause a situation where a document that

once matched a query and was subsequently changed may no longer match the query but will still be

retrieved. This situation is expected to be quite rare, however, and is only possible for a single query

request.

The number of shards is limited by number of characters allowed for GET method's URI; most Web

servers generally support at least 4000 characters, but many servers limit URI length to reduce their

vulnerability to Denial of Service (DoS) attacks.

TF/IDF computations are per shard. This may not matter if content is well (randomly) distributed.

Shard information can be returned with each document in a distributed search by including fl=id,

in the search request. This returns the shard URL.[shard]

In a distributed search, the data directory from the core descriptor overrides any data directory in solrco

nfig.xml.

Update commands may be sent to any server with distributed indexing configured correctly. Document

adds and deletes are forwarded to the appropriate server/shard based on a hash of the unique document

id. commands and commands are sent to every server in .commit deleteByQuery shards

Avoiding Distributed Deadlock

Like in SolrCloud mode, inter-shard requests could lead to a distributed deadlock. It can be avoided by following

the instructions .here

Testing Index Sharding on Two Local Servers

For simple functionality testing, it's easiest to just set up two local Solr servers on different ports. (In a production

environment, of course, these servers would be deployed on separate machines.)

Make two Solr home directories:

mkdir example/nodes

mkdir example/nodes/node1

# Copy solr.xml into this solr.home

cp server/solr/solr.xml example/nodes/node1/.

# Repeat the above steps for the second node

mkdir example/nodes/node2

cp server/solr/solr.xml example/nodes/node2/.

Start the two Solr instances

544Apache Solr Reference Guide 5.1

# Start first node on port 8983

bin/solr start -s example/nodes/node1 -p 8983

# Start second node on port 8984

bin/solr start -s example/nodes/node2 -p 8984

Create a core on both the nodes with the sample_techproducts_configs.

bin/solr create_core -c core1 -p 8983 -d sample_techproducts_configs

# Create a core on the Solr node running on port 8984

bin/solr create_core -c core1 -p 8984 -d sample_techproducts_configs

In the third window, index an example document to each of the server:

bin/post -c core1 example/exampledocs/monitor.xml -port 8983

bin/post -c core1 example/exampledocs/monitor2.xml -port 8984

Search on the node on port 8983:

curl http://localhost:8983/solr/core1/select?q=*:*&wt=xml&indent=true

This should bring back one document.

Search on the node on port 8984:

curl http://localhost:8984/solr/core1/select?q=*:*&wt=xml&indent=true

This should also bring back a single document.

Now do a distributed search across both servers with your browser or In the example below, ancurl.

extra parameter 'fl' is passed to restrict the returned fields to id and name.

curl

http://localhost:8983/solr/core1/select?q=*:*&indent=true&shards=localhost:898

3/solr/core1,localhost:8984/solr/core1&fl=id,name

This should contain both the document as shown below:

545Apache Solr Reference Guide 5.1

<str

name="shards">localhost:8983/solr/core1,localhost:8984/solr/core1</str>

</lst>

<doc>

<str name="name">Dell Widescreen UltraSharp 3007WFP</str>

</doc>

<doc>

<str name="name">ViewSonic VA902B - flat panel display - TFT - 19"</str>

</doc>

</result>

</response>

Index Replication

Index Replication distributes complete copies of a master index to one or more slave servers. The master server

continues to manage updates to the index. All querying is handled by the slaves. This division of labor enables

Solr to scale to provide adequate responsiveness to queries against large search volumes.

The figure below shows a Solr configuration using index replication. The master server's index is replicated on

the slaves.

A Solr index can be replicated across multiple slave servers, which then process requests.

Topics covered in this section:

Index Replication in Solr

Replication Terminology

Configuring the ReplicationHandler

546Apache Solr Reference Guide 5.1

Setting Up a Repeater with the ReplicationHandler

Commit and Optimize Operations

Slave Replication

HTTP API Commands for the ReplicationHandler

Distribution and Optimization

Index Replication in Solr

Solr includes a Java implementation of index replication that works over HTTP:

The configuration affecting replication is controlled by a single file, solrconfig.xml

Supports the replication of configuration files as well as index files

Works across platforms with same configuration

No reliance on OS-dependent file system features (eg: hard links)

Tightly integrated with Solr; an admin page offers fine-grained control of each aspect of replication

The Java-based replication feature is implemented as a request handler. Configuring replication is

therefore similar to any normal request handler.

Replication Terminology

The table below defines the key terms associated with Solr replication.

Term Definition

Index A Lucene index is a directory of files. These files make up the searchable and returnable data

of a Solr Core.

Distribution The copying of an index from the master server to all slaves. The distribution process takes

advantage of Lucene's index file structure.

Inserts and

Deletes

As inserts and deletes occur in the index, the directory remains unchanged. Documents are

always inserted into newly created files. Documents that are deleted are not removed from the

files. They are flagged in the file, deletable, and are not removed from the files until the index is

optimized.

Replication In SolrCloud

Although there is no explicit concept of "master/slave" nodes in a cluster, the SolrCloud ReplicationH

discussed on this page is still used by SolrCloud as needed to support "shard recovery" – butandler

this is done in a peer to peer manner. When using SolrCloud, the must beReplicationHandler

available via the path. Solr does this implicitly unless overridden explicitly in your /replication solrc

, but If you wish to override the default behavior, make certain that you do not explicitly setonfig.xml

any of the "master" or "slave" configuration options mentioned below, or they will interfere with normal

SolrCloud operation.

547Apache Solr Reference Guide 5.1

Master and

Slave

A Solr replication master is a single node which receives all updates initially and keeps

everything organized. Solr replication slave nodes receive no updates directly, instead all

changes (such as inserts, updates, deletes, etc.) are made against the single master node.

Changes made on the master are distributed to all the slave nodes which service all query

requests from the clients.

Update An update is a single change request against a single Solr instance. It may be a request to

delete a document, add a new document, change a document, delete all documents matching

a query, etc. Updates are handled synchronously within an individual Solr instance.

Optimization A process that compacts the index and merges segments in order to improve query

performance. Optimization should only be run on the master nodes. An optimized index may

give query performance gains compared to an index that has become fragmented over a

period of time with many updates. Distributing an optimized index requires a much longer time

than the distribution of new segments to an un-optimized index.

Segments A self contained subset of an index consisting of some documents and data structures related

to the inverted index of terms in those documents.

mergeFactor A parameter that controls the number of segments in an index. For example, when

mergeFactor is set to 3, Solr will fill one segment with documents until the limit

maxBufferedDocs is met, then it will start a new segment. When the number of segments

specified by mergeFactor is reached (in this example, 3) then Solr will merge all the segments

into a single index file, then begin writing new documents to a new segment.

Snapshot A directory containing hard links to the data files of an index. Snapshots are distributed from

the master nodes when the slaves pull them, "smart copying" any segments the slave node

does not have in snapshot directory that contains the hard links to the most recent index data

files.

Configuring the ReplicationHandler

In addition to configuration options specific to the master/slave roles, there are a fewReplicationHandler

special configuration options that are generally supported (even when using SolrCloud).

maxNumberOfBackups an integer value dictating the maximum number of backups this node will keep

on disk as it receives commands.backup

Similar to most other request handlers in Solr you may configure a set of "defaults, invariants, and/or

" parameters corresponding with any request parameters supported by the appends ReplicationHandl

when .er processing commands

Configuring the Replication RequestHandler on a Master Server

Before running a replication, you should set the following parameters on initialization of the handler:

Name Description

replicateAfter String specifying action after which replication should occur. Valid values are

commit, optimize, or startup. There can be multiple values for this parameter. If you

use "startup", you need to have a "commit" and/or "optimize" entry also if you want

to trigger replication on future commits or optimizes.

548Apache Solr Reference Guide 5.1

backupAfter String specifying action after which a backup should occur. Valid values are

commit, optimize, or startup. There can be multiple values for this parameter. It is

not required for replication, it just makes a backup.

maxNumberOfBackups Integer specifying how many backups to keep. This can be used to delete all but

the most recent N backups.

confFiles The configuration files to replicate, separated by a comma.

commitReserveDuration If your commits are very frequent and your network is slow, you can tweak this

parameter to increase the amount of time taken to download 5Mb from the master

to a slave. The default is 10 seconds.

The example below shows a possible 'master' configuration for the , including a fixedReplicationHandler

number of backups and an invariant setting for the request parameter to prevent slavesmaxWriteMBPerSec

from saturating it's network interface

<str name="replicateAfter">optimize</str>

<str name="backupAfter">optimize</str>

<str name="confFiles">schema.xml,stopwords.txt,elevate.xml</str>

</lst>

</lst>

</requestHandler>

Replicating solrconfig.xml

In the configuration file on the master server, include a line like the following:

<str name="confFiles">solrconfig_slave.xml:solrconfig.xml,x.xml,y.xml</str>

This ensures that the local configuration will be saved as on thesolrconfig_slave.xml solrconfig.xml

slave. All other files will be saved with their original names.

On the master server, the file name of the slave configuration file can be anything, as long as the name is

correctly identified in the string; then it will be saved as whatever file name appears after the colonconfFiles

':'.

Configuring the Replication RequestHandler on a Slave Server

The code below shows how to configure a ReplicationHandler on a slave.

549Apache Solr Reference Guide 5.1

<!--fully qualified url for the replication handler of master. It is

possible to pass on this as

a request param for the fetchindex command-->

<str

name="masterUrl">http://remote_host:port/solr/core_name/replication</str>

<!--Interval in which the slave should poll master .Format is HH:mm:ss . If

this is absent slave does not

poll automatically.

But a fetchindex can be triggered from the admin or the http API -->

<!--to use compression while transferring the index files. The possible

values are internal|external

if the value is 'external' make sure that your master Solr has the settings

to honor the

accept-encoding header.

See here for details: http://wiki.apache.org/solr/SolrHttpCompression

If it is 'internal' everything will be taken care of automatically.

USE THIS ONLY IF YOUR BANDWIDTH IS LOW . THIS CAN ACTUALLY SLOWDOWN

REPLICATION IN A LAN-->

<str name="compression">internal</str>

<!--The following values are used when the slave connects to the master to

download the index files.

Default values implicitly set as 5000ms and 10000ms respectively. The user

DOES NOT need to specify

these unless the bandwidth is extremely low or if there is an extremely

high latency-->

<!-- If HTTP Basic authentication is enabled on the master, then the slave

can be

configured with the following -->

<str name="httpBasicAuthUser">username</str>

<str name="httpBasicAuthPassword">password</str>

</lst>

</requestHandler>

Setting Up a Repeater with the ReplicationHandler

A master may be able to serve only so many slaves without affecting performance. Some organizations have

deployed slave servers across multiple data centers. If each slave downloads the index from a remote data

center, the resulting download may consume too much network bandwidth. To avoid performance degradation in

550Apache Solr Reference Guide 5.1

cases like this, you can configure one or more slaves as repeaters. A repeater is simply a node that acts as both

a master and a slave.

To configure a server as a repeater, the definition of the Replication in the requestHandler solrconfi

file must include file lists of use for both masters and slaves.g.xml

Be sure to set the parameter to commit, even if is set to optimizereplicateAfter replicateAfter

on the main master. This is because on a repeater (or any slave), a commit is called only after the index is

downloaded. The optimize command is never called on slaves.

Optionally, one can configure the repeater to fetch compressed files from the master through the

compression parameter to reduce the index download time.

Here is an example of a ReplicationHandler configuration for a repeater:

<str name="replicateAfter">commit</str>

<str name="confFiles">schema.xml,stopwords.txt,synonyms.txt</str>

</lst>

<str

name="masterUrl">http://master.solr.company.com:8983/solr/core_name/replication</str

</lst>

</requestHandler>

Commit and Optimize Operations

When a commit or optimize operation is performed on the master, the RequestHandler reads the list of file

names which are associated with each commit point. This relies on the parameter in thereplicateAfter

configuration to decide which types of events should trigger replication.

Setting on the Master Description

commit Triggers replication whenever a commit is performed on the master index.

optimize Triggers replication whenever the master index is optimized.

startup Triggers replication whenever the master index starts up.

The replicateAfter parameter can accept multiple arguments. For example:

<str name="replicateAfter">startup</str>

<str name="replicateAfter">commit</str>

<str name="replicateAfter">optimize</str>

Slave Replication

The master is totally unaware of the slaves. The slave continuously keeps polling the master (depending on the

parameter) to check the current index version of the master. If the slave finds out that thepollInterval

master has a newer version of the index it initiates a replication process. The steps are as follows:

The slave issues a command to get the list of the files. This command returns the names offilelist

551Apache Solr Reference Guide 5.1

the files as well as some metadata (for example, size, a lastmodified timestamp, an alias if any).

The slave checks with its own index if it has any of those files in the local index. It then runs the filecontent

command to download the missing files. This uses a custom format (akin to the HTTP chunked encoding)

to download the full content or a part of each file. If the connection breaks in between, the download

resumes from the point it failed. At any point, the slave tries 5 times before giving up a replication

altogether.

The files are downloaded into a temp directory, so that if either the slave or the master crashes during the

download process, no files will be corrupted. Instead, the current replication will simply abort.

After the download completes, all the new files are moved to the live index directory and the file's

timestamp is same as its counterpart on the master.

A commit command is issued on the slave by the Slave's ReplicationHandler and the new index is loaded.

Replicating Configuration Files

To replicate configuration files, list them using using the parameter. Only files found in the direconfFiles conf

ctory of the master's Solr instance will be replicated.

Solr replicates configuration files only when the index itself is replicated. That means even if a configuration file is

changed on the master, that file will be replicated only after there is a new commit/optimize on master's index.

Unlike the index files, where the timestamp is good enough to figure out if they are identical, configuration files

are compared against their checksum. The files (on master and slave) are judged to be identical ifschema.xml

their checksums are identical.

As a precaution when replicating configuration files, Solr copies configuration files to a temporary directory before

moving them into their ultimate location in the conf directory. The old configuration files are then renamed and

kept in the same directory. The ReplicationHandler does not automatically clean up these old files.conf/

If a replication involved downloading of at least one configuration file, the ReplicationHandler issues a

core-reload command instead of a commit command.

Resolving Corruption Issues on Slave Servers

If documents are added to the slave, then the slave is no longer in sync with its master. However, the slave will

not undertake any action to put itself in sync, until the master has new index data. When a commit operation

takes place on the master, the index version of the master becomes different from that of the slave. The slave

then fetches the list of files and finds that some of the files present on the master are also present in the local

index but with different sizes and timestamps. This means that the master and slave have incompatible indexes.

To correct this problem, the slave then copies all the index files from master to a new index directory and asks

the core to load the fresh index from the new directory.

HTTP API Commands for the ReplicationHandler

You can use the HTTP commands below to control the ReplicationHandler's operations.

Command Description

http:// /solr/ /replicmaster_host:port core_name

ation?command=enablereplication

Enables replication on the master for all its slaves.

http:// /solr/ /replicmaster_host:port core_name

ation?command=disablereplication

Disables replication on the master for all its slaves.

552Apache Solr Reference Guide 5.1

http:// /solr/ /replication?cohost:port core_name

mmand=indexversion

Returns the version of the latest replicatable index on the

specified master or slave.

http:// /solr/ /replicatislave_host:port core_name

on?command=fetchindex

Forces the specified slave to fetch a copy of the index from its

master.

If you like, you can pass an extra attribute such as masterUrl

or compression (or any other parameter which is specified in

the tag) to do a one time replication<lst name="slave">

from a master. This obviates the need for hard-coding the

master in the slave.

http:// /solr/ /replicatislave_host:port core_name

on?command=abortfetch

Aborts copying an index from a master to the specified slave.

http:// /solr/ /replicatislave_host:port core_name

on?command=enablepoll

Enables the specified slave to poll for changes on the master.

http:// /solr/ /replicatislave_host:port core_name

on?command=disablepoll

Disables the specified slave from polling for changes on the

master.

http:// /solr/ /replicatislave_host:port core_name

on?command=details

Retrieves configuration details and current status.

http:// /solr/ /replication?cohost:port core_name

mmand=filelist&indexversion=<index-version-

>number

Retrieves a list of Lucene files present in the specified host's

index. You can discover the version number of the index by

running the command.indexversion

http:// /solr/ /replicmaster_host:port core_name

ation?command=backup

Creates a backup on master if there are committed index data

in the server; otherwise, does nothing. This command is

useful for making periodic backups.

supported request parameters:

numberToKeep: request parameter can be used with the

backup command unless the initimaxNumberOfBackups

alization parameter has been specified on the handler – in

which case is always used andmaxNumberOfBackups

attempts to use the request parameternumberToKeep

will cause an error.

name : (optional) Backup name . The snapshot will be

created in a directory called snapshot.<name> within the

data directory of the core . By default the name is

generated using date in format. If yyyyMMddHHmmssSSS

parameter is passed , that would be usedlocation

instead of the data directory

location : Backup location

553Apache Solr Reference Guide 5.1

http:// master_host:port /solr/ /replicore_name

cation?command=deletebackup

Delete any backup created using the command .backup

request parameters:

name snapshot.<name> must exist .If not, an error is

thrown

location: Location where the snapshot is created

Distribution and Optimization

Optimizing an index is not something most users should generally worry about - but in particular users should be

aware of the impacts of optimizing an index when using the .ReplicationHandler

The time required to optimize a master index can vary dramatically. A small index may be optimized in minutes.

A very large index may take hours. The variables include the size of the index and the speed of the hardware.

Distributing a newly optimized index may take only a few minutes or up to an hour or more, again depending on

the size of the index and the performance capabilities of network connections and disks. During optimization the

machine is under load and does not process queries very well. Given a schedule of updates being driven a few

times an hour to the slaves, we cannot run an optimize with every committed snapshot.

Copying an optimized index means that the index will need to be transferred during the next snappull. Thisentire

is a large expense, but not nearly as huge as running the optimize everywhere. Consider this example: on a

three-slave one-master configuration, distributing a newly-optimized index takes approximately 80 seconds .total

Rolling the change across a tier would require approximately ten minutes per machine (or machine group). If this

optimize were rolled across the query tier, and if each slave node being optimized were disabled and not

receiving queries, a rollout would take at least twenty minutes and potentially as long as an hour and a half.

Additionally, the files would need to be synchronized so that the the optimize, snappull would not thinkfollowing

that the independently optimized files were different in any way. This would also leave the door open to

independent corruption of indexes instead of each being a perfect copy of the master.

Optimizing on the master allows for a straight-forward optimization operation. No query slaves need to be taken

out of service. The optimized index can be distributed in the background as queries are being normally serviced.

The optimization can occur at any time convenient to the application providing index updates.

While optimizing may have some benefits in some situations, a rapidly changing index will not retain those

benefits for long, and since optimization is an intensive process, it may be better to consider other options, such

as lowering the merge factor (discussed in the section on ).Index Configuration

Combining Distribution and Replication

When your index is too large for a single machine and you have a query volume that single shards cannot keep

up with, it's time to replicate each shard in your distributed search setup.

The idea is to combine distributed search with replication. As shown in the figure below, a combined

distributed-replication configuration features a master server for each shard and then 1- slaves that aren

replicated from the master. As in a standard replicated configuration, the master server handles updates and

optimizations without adversely affecting query handling performance.

Query requests should be load balanced across each of the shard slaves. This gives you both increased query

554Apache Solr Reference Guide 5.1

handling capacity and fail-over backup if a server goes down.

A Solr configuration combining both replication and master-slave distribution.

None of the master shards in this configuration know about each other. You index to each master, the index is

replicated to each slave, and then searches are distributed across the slaves, using one slave from each

master/slave shard.

For high availability you can use a load balancer to set up a virtual IP for each shard's set of slaves. If you are

new to load balancing, HAProxy ( ) is a good open source software load-balancer. If a slavehttp://haproxy.1wt.eu/

server goes down, a good load-balancer will detect the failure using some technique (generally a heartbeat

system), and forward all requests to the remaining live slaves that served with the failed slave. A single virtual IP

should then be set up so that requests can hit a single IP, and get load balanced to each of the virtual IPs for the

search slaves.

With this configuration you will have a fully load balanced, search-side fault-tolerant system (Solr does not yet

support fault-tolerant indexing). Incoming searches will be handed off to one of the functioning slaves, then the

slave will distribute the search request across a slave for each of the shards in your configuration. The slave will

issue a request to each of the virtual IPs for each shard, and the load balancer will choose one of the available

slaves. Finally, the results will be combined into a single results set and returned. If any of the slaves go down,

they will be taken out of rotation and the remaining slaves will be used. If a shard master goes down, searches

can still be served from the slaves until you have corrected the problem and put the master back into production.

Merging Indexes

If you need to combine indexes from two different projects or from multiple servers previously used in a

distributed configuration, you can use either the IndexMergeTool included in or the lucene-misc CoreAdminH

.andler

To merge indexes, they must meet these requirements:

The two indexes must be compatible: their schemas should include the same fields and they should

analyze fields the same way.

555Apache Solr Reference Guide 5.1

The indexes must not include duplicate data.

Optimally, the two indexes should be built using the same schema.

Using IndexMergeTool

To merge the indexes, do the following:

Find the lucene-core and lucene-misc JAR files that your version of Solr is using. You can do this by

copying your file somewhere and unpacking it ( ). These two JAR filessolr.war jar xvf solr.war

should be in . They are probably called something like and WEB-INF/lib lucene-core-VERSION.jar

.lucene-misc-VERSION.jar

Copy them somewhere easy to find.

Make sure that both indexes you want to merge are closed.

Issue this command:

java -cp /path/to/lucene-core-VERSION.jar:/path/to/lucene-misc-VERSION.jar

org/apache/lucene/misc/IndexMergeTool

/path/to/newindex

/path/to/old/index1

/path/to/old/index2

This will create a new index at that contains both index1 and index2./path/to/newindex

Copy this new directory to the location of your application's solr index (move the old one aside first, of

course) and start Solr.

For example:

java -cp /tmp/lucene-core-4.4.0.jar:

/tmp/lucene-misc-4.4.0.jar org/apache/lucene/misc/IndexMergeTool

./newindex

./solr_home/core1/data/index

./solr_home/core2/data/index

Using CoreAdmin

The command of the can be used to merge indexes into a new core – eitherMERGEINDEXES CoreAdminHandler

from one or more arbitrary directories or by merging from one or more existing core names.indexDir srcCore

See the section for details.CoreAdminHandler

556Apache Solr Reference Guide 5.1

Client APIs

This section discusses the available client APIs for Solr. It covers the following topics:

Introduction to Client APIs: A conceptual overview of Solr client APIs.

Choosing an Output Format: Information about choosing a response format in Solr.

Using JavaScript: Explains why a client API is not needed for JavaScript responses.

Using Python: Information about Python and JSON responses.

Client API Lineup: A list of all Solr Client APIs, with links.

Using SolrJ: Detailed information about SolrJ, an API for working with Java applications.

Using Solr From Ruby: Detailed information about using Solr with Ruby applications.

MBean Request Handler: Describes the MBean request handler for programmatic access to Solr server statistics

and information.

Introduction to Client APIs

At its heart, Solr is a Web application, but because it is built on open protocols, any type of client application can

use Solr.

HTTP is the fundamental protocol used between client applications and Solr. The client makes a request and

Solr does some work and provides a response. Clients use requests to ask Solr to do things like perform queries

or index documents.

Client applications can reach Solr by creating HTTP requests and parsing the HTTP responses. Client APIs

encapsulate much of the work of sending requests and parsing responses, which makes it much easier to write

client applications.

Clients use Solr's five fundamental operations to work with Solr. The operations are query, index, delete, commit,

and optimize.

Queries are executed by creating a URL that contains all the query parameters. Solr examines the request URL,

performs the query, and returns the results. The other operations are similar, although in certain cases the HTTP

request is a POST operation and contains information beyond whatever is included in the request URL. An index

operation, for example, may contain a document in the body of the request.

Solr also features an EmbeddedSolrServer that offers a Java API without requiring an HTTP connection. For

details, see .Using SolrJ

Choosing an Output Format

Many programming environments are able to send HTTP requests and retrieve responses. Parsing the

responses is a slightly more thorny problem. Fortunately, Solr makes it easy to choose an output format that will

be easy to handle on the client side.

Specify a response format using the parameter in a query. The available response formats are documented inwt

.Response Writers

Most client APIs hide this detail for you, so for many types of client applications, you won't ever have to specify a

557Apache Solr Reference Guide 5.1
 parameter. In JavaScript, however, the interface to Solr is a little closer to the metal, so you will need to addwt
this parameter yourself.
Client API Lineup
The Solr Wiki contains a list of client APIs at  .http://wiki.apache.org/solr/IntegratingSolr
Here is the list of client APIs, current at this writing (November 2011):
Name Environment URL
SolRuby Ruby https://github.com/rsolr/rsolr
DelSolr Ruby https://github.com/avvo/delsolr
acts_as_solr Rails http://acts-as-solr.rubyforge.org/, http://rubyforge.org/projects/background-s
olr/
Flare Rails http://wiki.apache.org/solr/Flare
SolPHP PHP http://wiki.apache.org/solr/SolPHP
SolrJ Java http://wiki.apache.org/solr/SolJava
Python API Python http://wiki.apache.org/solr/SolPython
PySolr Python http://code.google.com/p/pysolr/
SolPerl Perl http://wiki.apache.org/solr/SolPerl
Solr.pm Perl http://search.cpan.org/~garafola/Solr-0.03/lib/Solr.pm
SolrForrest Forrest/Cocoon http://wiki.apache.org/solr/SolrForrest
SolrSharp C# http://www.codeplex.com/solrsharp
SolColdfusion ColdFusion http://solcoldfusion.riaforge.org/
SolrNet .NET http://code.google.com/p/solrnet/
AJAX Solr AJAX http://github.com/evolvingweb/ajax-solr/wiki
Using JavaScript
Using Solr from JavaScript clients is so straightforward that it deserves a special mention. In fact, it is so
straightforward that there is no client API. You don't need to install any packages or configure anything.
HTTP requests can be sent to Solr using the standard   mechanism.XMLHttpRequest
Out of the box, Solr can send  , which are easily interpreted inJavaScript Object Notation (JSON) responses
JavaScript. Just add   to the request URL to have responses sent as JSON.wt=json
For more information and an excellent example, read the SolJSON page on the Solr Wiki:
http://wiki.apache.org/solr/SolJSON
Using Python

558Apache Solr Reference Guide 5.1

Solr includes an output format specifically for , but is a little more robust.Python JSON output

Simple Python

Making a query is a simple matter. First, tell Python you will need to make HTTP connections.

from urllib2 import *

Now open a connection to the server and get a response. The query parameter tells Solr to return results in awt

format that Python can understand.

connection = urlopen(

'http://localhost:8983/solr/collection_name/select?q=cheese&wt=python')

response = eval(connection.read())

Now interpreting the response is just a matter of pulling out the information that you need.

print response['response']['numFound'], "documents found."

# Print the name of each document.

for document in response['response']['docs']:

print " Name =", document['name']

Python with JSON

JSON is a more robust response format, but you will need to add a Python package in order to use it. At a

command line, install the simplejson package like this:

$ sudo easy_install simplejson

Once that is done, making a query is nearly the same as before. However, notice that the wt query parameter is

now json, and the response is now digested by .simplejson.load()

from urllib2 import *

import simplejson

connection =

urlopen('http://localhost:8983/solr/collection_name/select?q=cheese&wt=json')

response = simplejson.load(connection)

print response['response']['numFound'], "documents found."

# Print the name of each document.

for document in response['response']['docs']:

print " Name =", document['name']

Using SolrJ

SolrJ is an API that makes it easy for Java applications to talk to Solr. SolrJ hides a lot of the details of

connecting to Solr and allows your application to interact with Solr with simple high-level methods.

559Apache Solr Reference Guide 5.1

The center of SolrJ is the package, which contains just five main classes.org.apache.solr.client.solrj

Begin by creating a , which represents the Solr instance you want to use. Then send SolrClient SolrReques

or and get back SolrResponses.ts SolrQuerys

SolrClient is abstract, so to connect to a remote Solr instance, you'll actually create an instance of either Htt

, or . Both communicate with Solr via HTTP, the difference is that pSolrClient CloudSolrClient HttpSolr

is configured using an explicit Solr URL, while is configured using the zkHost StringClient CloudSolrClient

for a clusterSolrCloud

String urlString = "http://localhost:8983/solr/techproducts";

SolrClient solr = new HttpSolrClient(urlString);

String zkHostString = "zkServerA:2181,zkServerB:2181,zkServerC:2181/solr";

SolrClient solr = new CloudSolrClient(zkHostString);

Once you have a , you can use it by calling methods like , , and .SolrClient query() add() commit()

Building and Running SolrJ Applications

The SolrJ API is included with Solr, so you do not have to download or install anything else. However, in order to

build and run applications that use SolrJ, you have to add some libraries to the classpath.

At build time, the examples presented with this section require to be in the classpath.solr-solrj-x.y.z.jar

At run time, the examples in this section require the libraries found in the 'dist/solrj-lib' directory.

The Ant script bundled with this sections' examples includes the libraries as appropriate when building and

running.

You can sidestep a lot of the messing around with the JAR files by using Maven instead of Ant. All you will need

to do to include SolrJ in your application is to put the following dependency in the project's :pom.xml

<groupId>org.apache.solr</groupId>

<artifactId>solr-solrj</artifactId>

</dependency>

If you are worried about the SolrJ libraries expanding the size of your client application, you can use a code

obfuscator like to remove APIs that you are not using.ProGuard

Setting XMLResponseParser

SolrJ uses a binary format, rather than XML, as its default format. Users of earlier Solr releases who wish to

continue working with XML must explicitly set the parser to the XMLResponseParser, like so:

server.setParser(new XMLResponseParser());

Single node Solr client

SolrCloud client

560Apache Solr Reference Guide 5.1

Performing Queries

Use to have Solr search for results. You have to pass a object that describes the query,query() SolrQuery

and you will get back a QueryResponse (from the package).org.apache.solr.client.solrj.response

SolrQuery has methods that make it easy to add parameters to choose a request handler and send

parameters to it. Here is a very simple example that uses the default request handler and sets the parameter:q

SolrQuery parameters = new SolrQuery();

parameters.set("q", mQueryString);

To choose a different request handler, for example, just set the parameter like this:qt

parameters.set("qt", "/spellCheckCompRH");

Once you have your set up, submit it with :SolrQuery query()

QueryResponse response = solr.query(parameters);

The client makes a network connection and sends the query. Solr processes the query, and the response is sent

and parsed into a .QueryResponse

The is a collection of documents that satisfy the query parameters. You can retrieve theQueryResponse

documents directly with and you can call other methods to find out information aboutgetResults()

highlighting or facets.

SolrDocumentList list = response.getResults();

Indexing Documents

Other operations are just as simple. To index (add) a document, all you need to do is create a SolrInputDocu

and pass it along to the 's method. This example assumes that the SolrClient objectment SolrClient add()

called 'solr' is already created based on the examples shown earlier.

SolrInputDocument document = new SolrInputDocument();

document.addField("id", "552199");

document.addField("name", "Gouda cheese wheel");

document.addField("price", "49.99");

UpdateResponse response = solr.add(document);

// Remember to commit your changes!

solr.commit();

Uploading Content in XML or Binary Formats

SolrJ lets you upload content in XML and binary formats instead of the default XML format. Use the following to

upload using binary format, which is the same format SolrJ uses to fetch results.

561Apache Solr Reference Guide 5.1

server.setRequestWriter(new BinaryRequestWriter());

Using the ConcurrentUpdateSolrClient

When implementing java applications that will be bulk loading a lot of documents at once, ConcurrentUpdateS

is an alternative to consider instead of using . The olrClient HttpSolrClient ConcurrentUpdateSolrCl

buffers all added documents and writes them into open HTTP connections. This class is thread safe.ient

Although any SolrClient request can be made with this implementation, it is only recommended to use the Concu

for requests.rrentUpdateSolrClient /update

EmbeddedSolrServer

The class provides an implementation of the client API talking directly to EmbeddedSolrServer SolrClient

an micro-instance of Solr running directly in your Java application. This embedded approach is not

recommended in most cases and fairly limited in the set of features it supports – in particular it can not be used

with or . exists primarily to help facilitate testing.SolrCloud Index Replication EmbeddedSolrServer

For information on how to use please review the SolrJ JUnit tests in the EmbeddedSolrServer org.apache.

package of the Solr source release.solr.client.solrj.embedded

Related Topics

SolrJ API documentation

Solr Wiki page on SolrJ

Indexing and Basic Data Operations

Using Solr From Ruby

Solr has an optional Ruby response format that extends its in the following ways to allow theJSON output

response to be safely eval'd by Ruby's interpreter:

Ruby's single quoted strings are used to prevent possible string exploits

\ and ' are the only two characters escaped...

unicode escapes not used... data is written as raw UTF-8

nil used for null

=> used as the key/value separator in maps

Here's an example Ruby response from Solr, for http://localhost:8983/solr/techproducts/select?q=iPod&wt=ruby

(with Solr launching using `bin/solr start -e techproducts`):&indent=on

{

'responseHeader'=>{

'status'=>0,

'QTime'=>0,

'params'=>{

'q'=>'iPod',

'indent'=>'on',

'wt'=>'ruby'}},

'response'=>{'numFound'=>3,'start'=>0,'docs'=>[

{

'id'=>'IW-02',

'name'=>'iPod & iPod Mini USB 2.0 Cable',

562Apache Solr Reference Guide 5.1

'manu'=>'Belkin',

'manu_id_s'=>'belkin',

'cat'=>['electronics',

'connector'],

'features'=>['car power adapter for iPod, white'],

'weight'=>2.0,

'price'=>11.5,

'price_c'=>'11.50,USD',

'popularity'=>1,

'inStock'=>false,

'store'=>'37.7752,-122.4232',

'manufacturedate_dt'=>'2006-02-14T23:55:59Z',

'_version_'=>1491038048794705920},

{

'id'=>'F8V7067-APL-KIT',

'name'=>'Belkin Mobile Power Cord for iPod w/ Dock',

'manu'=>'Belkin',

'manu_id_s'=>'belkin',

'cat'=>['electronics',

'connector'],

'features'=>['car power adapter, white'],

'weight'=>4.0,

'price'=>19.95,

'price_c'=>'19.95,USD',

'popularity'=>1,

'inStock'=>false,

'store'=>'45.18014,-93.87741',

'manufacturedate_dt'=>'2005-08-01T16:30:25Z',

'_version_'=>1491038048792608768},

{

'id'=>'MA147LL/A',

'name'=>'Apple 60 GB iPod with Video Playback Black',

'manu'=>'Apple Computer Inc.',

'manu_id_s'=>'apple',

'cat'=>['electronics',

'music'],

'features'=>['iTunes, Podcasts, Audiobooks',

'Stores up to 15,000 songs, 25,000 photos, or 150 hours of video',

'2.5-inch, 320x240 color TFT LCD display with LED backlight',

'Up to 20 hours of battery life',

'Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video',

'Notes, Calendar, Phone book, Hold button, Date display, Photo wallet,

Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility,

Playback speed control, Rechargeable capability, Battery level indication'],

'includes'=>'earbud headphones, USB cable',

'weight'=>5.5,

'price'=>399.0,

'price_c'=>'399.00,USD',

'popularity'=>10,

'inStock'=>true,

'store'=>'37.7752,-100.0232',

563Apache Solr Reference Guide 5.1

'manufacturedate_dt'=>'2005-10-12T08:00:00Z',

'_version_'=>1491038048799948800}]

}}

Here is a simple example of how one may query Solr using the Ruby response format:

require 'net/http'

h = Net::HTTP.new('localhost', 8983)

http_response = h.get('/solr/techproducts/select?q=iPod&wt=ruby')

rsp = eval(http_response.body)

puts 'number of matches = ' + rsp['response']['numFound'].to_s

#print out the name field for each returned document

rsp['response']['docs'].each { |doc| puts 'name field = ' + doc['name'] }

For simple interactions with Solr, this may be all you need! If you are building complex interactions with Solr, then

consider the libraries mentioned at https://wiki.apache.org/solr/Ruby%20Response%20Format

MBean Request Handler

The MBean Request Handler offers programmatic access to the information provided on the page ofPlugin/Stats

the Admin UI.

The MBean Request Handler accepts the following parameters:

Parameter Type Default Description

key multivalued all Restricts results by object key.

cat multivalued all Restricts results by category name.

stats boolean false Specifies whether statistics are returned with results. You can override

the parameter on a per-field basis.stats

wt multivalued xml The output format. This operates the same as the parameter in awt

.query

Examples

The following examples assume you are running Solr's example configuration:techproducts

bin/solr start -e techproducts

To return information about the CACHE category only:

http://localhost:8983/solr/techproducts/admin/mbeans?cat=CACHE

To return information and statistics about the CACHE category only, formatted in JSON:

http://localhost:8983/solr/techproducts/admin/mbeans?stats=true&cat=CACHE&indent=t

rue&wt=json

564Apache Solr Reference Guide 5.1

To return information for everything, and statistics for everything except the :fieldCache

http://localhost:8983/solr/techproducts/admin/mbeans?stats=true&f.fieldCache.stats

=false

To return information and statistics for the only:fieldCache

http://localhost:8983/solr/techproducts/admin/mbeans?key=fieldCache&stats=true

565Apache Solr Reference Guide 5.1

Further Assistance

There is a very active user community around Solr and Lucene. The solr-user mailing list, and #solr IRC channel

are both great resource for asking questions.

To view the mailing list archives, subscribe to the list, or join the IRC channel, please see https://lucene.apache.o

rg/solr/resources#community

566Apache Solr Reference Guide 5.1
Solr Glossary
Where possible, terms are linked to relevant parts of the Solr Reference Guide for more information.
Jump to a letter:
A           G H   J K         P         U V   X Y B C D E F I L M N O Q R S T W Z
A
Atomic updates
An approach to updating only one or more fields of a document, instead of reindexing the entire document.
B
Boolean operators
These control the inclusion or exclusion of keywords in a query by using operators such as AND, OR, and NOT.
C
Cluster
In Solr, a cluster is a set of Solr nodes managed as a unit. They may contain many cores, collections, shards,
and/or replicas. See also  .#SolrCloud
Collection
In Solr, one or more documents grouped together in a single logical index. A collection must have a single
schema, but can be spread across multiple cores.
In  , a group of cores managed together as part of a SolrCloud installation.#ZooKeeper
Commit
To make document changes permanent in the index. In the case of added documents, they would be searchable
after a  .commit
Core
An individual Solr instance (represents a logical index). Multiple cores can run on a single node. See also #SolrCl
.oud
Core reload
To re-initialize Solr after changes to  ,   or other configuration files.schema.xml solrconfig.xml
D
Distributed search
Distributed search is one where queries are processed across more than one  .shard

567Apache Solr Reference Guide 5.1

Document

A group of and their values. Documents are the basic unit of data in a . Documents are assignedfields collection

to using standard hashing, or by specifically assigning a shard within the document ID. Documents areshards

versioned after each write operation.

Ensemble

A term to indicate multiple ZooKeeper instances running simultaneously.#ZooKeeper

Facet

The arrangement of search results into categories based on indexed terms.

Field

The content to be indexed/searched along with metadata defining how the content should be processed by Solr.

Inverse document frequency (IDF)

A measure of the general importance of a term. It is calculated as the number of total Documents divided by the

number of Documents that a particular word occurs in the collection. See and http://en.wikipedia.org/wiki/Tf-idf th

for more info on TF-IDF based scoring and Lucene scoring in particular. Seee Lucene TFIDFSimilarity javadocs

also .#Term frequency

Inverted index

A way of creating a searchable index that lists every word and the documents that contain those words, similar to

an index in the back of a book which lists words and the pages on which they can be found. When performing

keyword searches, this method is considered more efficient than the alternative, which would be to create a list

of documents paired with every word used in each document. Since users search using terms they expect to be

in documents, finding the term before the document saves processing resources and time.

Leader

The main node for each shard that routes document adds, updates, or deletes to other replicas in the same

shard - this is a transient responsibility assigned to a node via an election, if the current Shard Leader goes

down, a new node will be elected to take it's place. See also .#SolrCloud

Metadata

568Apache Solr Reference Guide 5.1

Literally, . Metadata is information about a document, such as it's title, author, or location.data about data

Natural language query

A search that is entered as a user would normally speak or write, as in, "What is aspirin?"

Node

A JVM instance running Solr. Also known as a Solr server.

Optimistic concurrency

Also known as "optimistic locking", this is an approach that allows for updates to documents currently in the

index while retaining locking or version control.

Overseer

A single node in SolrCloud that is responsible for processing actions involving the entire cluster. It keeps track of

the state of existing nodes and shards, and assigns shards to nodes - this is a transient responsibility assigned to

a node via an election, if the current Overseer goes down, a new node will be elected to take it's place. See also

.#SolrCloud

Query parser

A query parser processes the terms entered by a user.

Recall

The ability of a search engine to retrieve of the possible matches to a user's query.all

Relevance

The appropriateness of a document to the search conducted by the user.

Replica

A copy of a shard or single logical index, for use in failover or load balancing.

Replication

A method of copying a master index from one server to one or more "slave" or "child" servers.

RequestHandler

Logic and configuration parameters that tell Solr how to handle incoming "requests", whether the requests are to

return search results, to index documents, or to handle other custom situations.

569Apache Solr Reference Guide 5.1

SearchComponent

Logic and configuration parameters used by request handlers to process query requests. Examples of search

components include faceting, highlighting, and "more like this" functionality.

Shard

In SolrCloud, a logical section of a single collection. This may be spread across multiple nodes. See also #SolrCl

.oud

SolrCloud

Umbrella term for a suite of functionality in Solr which allows managing a cluster of Solr servers for scalability,

fault tolerance, and high availability.

Solr Schema (schema.xml)

The Apache Solr index schema. The schema defines the fields to be indexed and the type for the field (text,

integers, etc.) The schema is stored in schema.xml and is located in the Solr home conf directory.

SolrConfig (solrconfig.xml)

The Apache Solr configuration file. Defines indexing options, RequestHandlers, highlighting, spellchecking and

various other configurations. The file, solrconfig.xml is located in the Solr home conf directory.

Spell Check

The ability to suggest alternative spellings of search terms to a user, as a check against spelling errors causing

few or zero results.

Stopwords

Generally, words that have little meaning to a user's search but which may have been entered as part of a natura

query. Stopwords are generally very small pronouns, conjunctions and prepositions (such as, "the",l language

"with", or "and")

Suggester

Functionality in Solr that provides the ability to suggest possible query terms to users as they type.

Synonyms

Synonyms generally are terms which are near to each other in meaning and may substitute for one another. In a

search engine implementation, synonyms may be abbreviations as well as words, or terms that are not

consistently hyphenated. Examples of synonyms in this context would be "Inc." and "Incorporated" or "iPod" and

"i-pod".

Term frequency

The number of times a word occurs in a given document. See and http://en.wikipedia.org/wiki/Tf-idf the Lucene

570Apache Solr Reference Guide 5.1

for more info on TF-IDF based scoring and Lucene scoring in particular.TFIDFSimilarity javadocs

See also .#Inverse document frequency (IDF)

Transaction log

An append-only log of write operations maintained by each node. This log is only required with SolrCloud

implementations and is created and managed automatically by Solr.

Wildcard

A wildcard allows a substitution of one or more letters of a word to account for possible variations in spelling or

tenses.

ZooKeeper

Also known as . The system used by SolrCloud to keep track of configuration files and nodeApache ZooKeeper

names for a cluster. A ZooKeeper cluster is used as the central configuration store for the cluster, a coordinator

for operations requiring distributed synchronization, and the system of record for cluster topology. See also #Solr

.Cloud

571Apache Solr Reference Guide 5.1
Major Changes from Solr 4 to Solr 5
There are some major changes in Solr 5 to consider before starting to migrate your configurations and indexes.
There are many hundreds of changes, so a thorough review of the   section as well as the Upgrading Solr CHANG
 file in your Solr instance will help you plan your migration to Solr 5. This section attempts to highlight someES.txt
of the major changes you should be aware of.
Topics discussed in this section:
Highlights of New Features in Solr 5
Solr is Now a Standalone Server
Index Format
solr.xml Format
Collections API
SolrJ changes
 
Highlights of New Features in Solr 5
Some of the major improvements in Solr 5 include:
Ease Of Use
Improved   and new   scriptsbin/solr bin/post
New, and restructured, example   available by defaultConfig Sets
SolrJ now has first class support for the Collections API
Implicit registration of  ,  , and   request handlers/replication /get /admin/*
Operational Support
Scripts to support installing and running Solr as a service on Linux
Logging improvements:
Transaction log replay status is now logged
Optional logging of slow requests.
Config API that supports paramsets for easily configuring Solr parameters for request handlers and editing
common   options.solrconfig.xml
New   with support for uploading and distributing plugin jars.Blob Store API
SolrCloud & Distributed Requests
Splitting of clusterstate to per-collection enables scalability improvement in SolrCloud
Distributed IDF Support
Option to configure max bandwidth usage by ReplicationHandler
New Search Functionality
pivot.facet results can now include nested   results constrained by those pivotsstats.field
stats.field can be used to generate statistics over the results of arbitrary numeric functions
A new   has been added for indexing date ranges, especially multi-valued onesDateRangeField
Spatial fields that used to require   now take   units=degrees distanceUnits=degrees/kilometers
miles instead
 query parserMoreLikeThis  allows requesting for documents similar to an existing document

572Apache Solr Reference Guide 5.1

timeAllowed is now used to prematurely terminate requests during query expansion and SolrClient

request retry

Solr is Now a Standalone Server

Beginning with Solr 5.0, Solr is no longer distributed as a "war" (Web Application Archive) suitable for

deployment in any Servlet Container. Solr is now distributed as a stand alone java server application, including

start and stop scripts for use on Unix and MS-Windows platforms, as well as an installation script for setting up a

on *nix platforms managed via ."production" installation of Solr /etc/init.d

The section of this guide provides detailed steps for existing users toUpgrading a Solr 4.x Cluster to Solr 5.0

upgrade their clusters regardless of what servlet container they were using with Solr 4.

Index Format

Solr 5 has no support for reading Lucene/Solr 3.x and earlier indexes. Be sure to run the Lucene IndexUpgrad

included with Solr 4.10 if you might still have old 3x formatted segments in your index. Alternatively: fullyer

optimize your index with Solr 4.10 to make sure it consists only of one up-to-date index segment.

solr.xml Format

Solr 4.4 introduced a file, eliminating the need for each to be explicitlynew format for the solr.xml <core>

listed. Instead a process known as "Core Discovery" was introduced relying on files in eachcore.properties

SolrCore instance directory.

Solr 5.0 has eliminated all support for the older format, and moved entirely to core discovery. See thesolr.xml

section for more details.Moving to the New solr.xml Format

Collections API

Solr 5.0 only supports creating and removing SolrCloud collections through the , unlike previousCollections API

versions. While not using the collections API may still work in 5.0, it is unsupported, not recommended, and the

behavior will change in a 5.x release.

SolrJ changes

Many SolrJ methods that were marked deprecated in Solr 4.10 have been removed. Please consult the Solr 4.10

for information regarding the new supported methods.javadocs for the SolrJ package

In addition, the family of classes are now deprecated, and developers are encouraged to insteadSolrServer

use the family of classes.SolrClient

Internally, Solr is still implemented via Servlet APIs and is powered by Jetty -- but this is simply an

implementation detail. Deployment as a "webapp" to other Servlet Containers (or other instances of

Jetty) is not supported, and may not work in future 5.x versions of Solr when additional changes are

likely to be made to Solr internally to leverage custom networking stack features.

573Apache Solr Reference Guide 5.1

Moving to the New solr.xml Format

If you have an existing file that contains a section, it is no longer supported and you willsolr.xml <cores>

need to manually convert it to the "core discovery" format before you can upgrade to the currently supported

latests version of Solr.

Migration from this older style of to core discovery is very straightforward.solr.xml

Moving definitions<core>

To migrate to discovery-based , You should start by removing all of the elements and thesolr.xml <core>

enclosing element from . The individual elements should be replaced with <cores> solr.xml <core> core.p

in the instanceDir for each of your existing SolrCores. The property must be filesroperties instanceDir

removed as it is now defined by the location of the file, but all other existing attributes can be specified<core>

in the new file. These property files can be empty if all defaults are acceptable, Incore.properties

particular: the will be a sub-directory of the named , and the of the core isdataDir instanceDir data name

assumed to be the name of the in which the file was discovered.instanceDir core.properties

Individual attributes on the element of your older file should be added as new child<cores> solr.xml

elements of a new element, using the appropriate type specific element name (eg: , <solrcloud> <str> <int>

, etc...). Note that , and are no longer supported and should be removed.adminPath defaultCoreName

<solr> and definitions<shardHandlerFactory>

Individual attributes on the element of your older file should be replaced with new child<solr> solr.xml

elements of the element, using the appropriate type specific element name (eg: , , etc...). <solr> <str> <int>

Note that is no longer supported since re-writing the is no longer necessary to supportpersist solr.xml

dynamic creation of cores.

If you have a configured in your existing , it should continue to work with<shardHandlerFactory> solr.xml

out any modifications in your new file.solr.xml

Example

Here's an example of what a legacy file might look like and the equivalent discovery-based solr.xml solr.xm

and files:l core.properties

574Apache Solr Reference Guide 5.1

<cores adminPath="/admin/cores" defaultCoreName="collection1" host="127.0.0.1"

hostPort="${hostPort:8983}"

hostContext="${hostContext:solr}"

zkClientTimeout="${solr.zkclienttimeout:30000}" shareSchema="${shareSchema:false}"

genericCoreNodeNames="${genericCoreNodeNames:true}">

<core name="core1" instanceDir="core1" shard="${shard:}"

collection="${collection:core1}" config="${solrconfig:solrconfig.xml}"

schema="${schema:schema.xml}" coreNodeName="${coreNodeName:}"/>

<int name="socketTimeout">${socketTimeout:120000}</int>

<int name="connTimeout">${connTimeout:15000}</int>

</shardHandlerFactory>

</cores>

</solr>

The new-style might look like what is below.solr.xml

<solr>

<int name="hostPort">${hostPort:8983}</int>

<str name="hostContext">${hostContext:solr}</str>

<int name="zkClientTimeout">${solr.zkclienttimeout:30000}</int>

<str name="shareSchema">${shareSchema:false}</str>

<bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>

</solrcloud>

<int name="socketTimeout">${socketTimeout:120000}</int>

<int name="connTimeout">${connTimeout:15000}</int>

</shardHandlerFactory>

</solr>

In each of "core1" and "core2" directories, there would be a file that might look like these.core.properties

core1:

name=core1

shard=${shard:}

collection=${collection:core1}

config=${solrconfig:solrconfig.xml}

schema=${schema:schema.xml}

coreNodeName=${coreNodeName:}

core2:

name=core2

In fact, the core2 file could even be empty and the name would default to the directory incore.properties

which the file was found.core.properties

575Apache Solr Reference Guide 5.1

Upgrading a Solr 4.x Cluster to Solr 5.0

This page covers how to upgrade an existing Solr 4.x cluster running on Linux platforms to Solr 5.0.

Planning Your Upgrade

Upgrade Process

Step 1: Stop Solr

Step 2: Install Solr 5 as a Service

Step 3: Set Environment Variable Overrides

Step 4: Start Solr 5

Step 5: Run Healthcheck

Planning Your Upgrade

Here is a checklist of things you need to prepare before starting the upgrade process:

Examine the page to determine if any changes in Solr 5 will affect your installation.Upgrading Solr

If not using replication, then you should make a backup of each collection. If all of your collections use

replication, then you don't technically need to make a backup since you will be upgrading and verifying

each node individually.

Determine which Solr node is hosting the Overseer leader process in SolrCloud, as you should upgrade

this node last. To determine the Overseer, use the Overseer Status API, see: .Collections API

Plan to perform your upgrade during a system maintenance window if possible. You'll be doing a rolling

restart of your cluster (each node, one-by-one), but we still recommend doing the upgrade when system

usage is minimal.

Verify the cluster is currently healthy and all replicas are active, as you should not perform an upgrade on

a degraded cluster.

Ensure your existing cluster is not using the legacy format. If you are, then you need to migratesolr.xml

your existing cluster to use the new format before proceeding, see: solr.xml Moving to the New solr.xml

.Format

Re-build all custom server-side components against Solr 5.0 JAR files.

Determine the values of the following variables that are used by the Solr 5 start scripts:

ZK_HOST: The ZooKeeper connection string your current SolrCloud nodes use to connect to

ZooKeeper; this value will be the same for all nodes in the cluster.

SOLR_HOST: The hostname each Solr node used to register with ZooKeeper when joining the

SolrCloud cluster; this value will be used to set the Java system property when starting thehost

new Solr 5 process.

SOLR_PORT: The port each Solr node is listening on, such as 8983.

SOLR_HOME: The absolute path to the Solr home directory for each Solr node; this directory must

contain a file. This value will be passed to the new Solr 5 process using the solr.xml solr.solr

system property, see: .home Solr Cores and solr.xml

You should now be ready to upgrade your cluster. Please verify this process in a test / staging cluster before

doing it in production.

Upgrade Process

The approach we recommend is to perform the upgrade of each Solr node, one-by-one. In other words, you will

need to stop a node, upgrade it to 5.0, and restart it before moving on to the next node. This means that for a

576Apache Solr Reference Guide 5.1

short period of time, there will be a mix of Solr 4.x and Solr 5.0 nodes running in your cluster. We also assume

that you will point the new Solr 5 node to your existing Solr home directory where the Lucene index files are

managed for each collection on the node. This means that you won't need to move any index files around to

perform the upgrade. This also implies that you can apply this process to an existing install even if your current

cluster runs on something other than the default Jetty container shipped with Solr 4.x. For instance, if your

current cluster runs on Tomcat, you can still use this process to migrate to Solr 5 because you will simply point

the new Solr 5 process at the existing Solr home used by Tomcat.

Step 1: Stop Solr

Begin by stopping the Solr node you want to upgrade. After stopping the node, if using replication, verify that all

leaders hosted on the downed node have successfully migrated to other replicas; you can do this by visiting the

Cloud panel in the Solr Admin UI. If not using replication, then any collections hosted on the downed node will be

temporarily off-line.

Step 2: Install Solr 5 as a Service

Please follow the instructions to install Solr as a Service on Linux documented at . AfterTaking Solr to Production

running the install script, the new Solr 5 node will be running, so please stop it by doing: sudo service solr

. You need to update the include file to complete the upgrade process.stop /var/solr/solr.in.sh

Step 3: Set Environment Variable Overrides

Open with a text editor and set the following variables at the bottom of the include/var/solr/solr.in.sh

file:

ZK_HOST=

SOLR_MODE=solrcloud

SOLR_HOST=

SOLR_PORT=

SOLR_HOME=

You must set to an absolute path on your system and use a path that includes a symbolic linkSOLR_HOME not

due to a in the script. Also, make sure the user you plan to own the Solr processlimitation /etc/init.d/solr

is the owner of the directory. For instance, if you plan to run Solr as the "solr" user and iSOLR_HOME SOLR_HOME

s , then you would do: /var/solr/data sudo chown -R solr: /var/solr/data

Step 4: Start Solr 5

You are now ready to start the upgraded Solr 5 node by doing: . The upgradedsudo service solr start

instance will join the existing cluster because you're using the same , , and SOLR_HOME SOLR_PORT SOLR_HOST

settings used by the old Solr 4.x node; thus, the new server will look like the old node to the running cluster. Be

sure to look in for errors during startup./var/solr/logs/solr.log

Step 5: Run Healthcheck

You should run the Solr command for all collections that are hosted on the upgraded node beforehealthcheck

SOLR_MODE=solrcloud

Due to a in the start script, you need to set the variable to "solrcloud",limitation bin/solr SOLR_MODE

i.e. SOLR_MODE=solrcloud

577Apache Solr Reference Guide 5.1

proceeding to upgrade the next node in your cluster. For instance, if the newly upgraded node hosts a replica for

the collection, then you can run the following command (replace ZK_HOST with the ZooKeeperMyDocuments

connection string):

$ /opt/solr/bin/solr healthcheck -c MyDocuments -z ZK_HOST

Look for any problems reported about any of the replicas for the collection.

Lastly, repeat Steps 1-5 for all nodes in your cluster.

578Apache Solr Reference Guide 5.1

Errata

Errata For This Documentation

Any mistakes found in this documentation after its release will be listed on the on-line version of this page:

https://cwiki.apache.org/confluence/display/solr/Errata

Errata For Past Versions of This Documentation

Any known mistakes in past releases of this documentation will be noted below.

Using copyField directives with suggester: Previous versions of this guide advocated using copyField

directives to accumulate the contents on multiple fields into a single field to be used with Solr suggester

components. This will not work previous to Solr 5.1; attempting to build the suggester will result in errors being

reported in the logs if the field is multiValued. As a work-around, indexing clients should accumulate all of the

contents into the field before sending the documents to Solr, and any fields used with the suggesters should

have multiValued="false".

The was included in documentation even though the patch was notfacet.range.gap parametervariable

committed. As of yet there is no ability to specify variable gaps via a comma-separated list for facet.range.gap.

Some of this functionality can be achieved by interval faceting, see SOLR-6216

Apache Solr Ref Guide 5.1

Navigation menu

Versions of this User Manual:

Views

Navigation