Apache Solr Ref Guide 4.10

apache-solr-ref-guide-4.10

User Manual:

Open the PDF directly: View PDF .
Page Count: 511 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Apache Solr Reference Guide

Apache Solr Reference Guide

Covering Apache Solr 4.10

Licensed to the Apache Software Foundation (ASF) under one

or more contributor license agreements. See the NOTICE file

distributed with this work for additional information

regarding copyright ownership. The ASF licenses this file

to you under the Apache License, Version 2.0 (the

"License"); you may not use this file except in compliance

with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,

software distributed under the License is distributed on an

"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY

KIND, either express or implied. See the License for the

specific language governing permissions and limitations

under the License.

Apache and the Apache feather logo are trademarks of The Apache Software Foundation. Apache Lucene, Apache

Solr and their respective logos are trademarks of the Apache Software Foundation. Please see the Apache

for more information.Trademark Policy

2Apache Solr Reference Guide 4.10

Apache Solr Reference Guide

This reference guide describes Apache Solr, the open source solution for search. You can download Apache Solr

from the Solr website at .http://lucene.apache.org/solr/

This Guide contains the following sections:

Getting Started: This section guides you through the

installation and setup of Solr.

Using the Solr Administration User Interface: This

section introduces the Solr Web-based user interface.

From your browser you can view configuration files,

submit queries, view logfile settings and Java

environment settings, and monitor and control distributed

configurations.

Documents, Fields, and Schema Design: This section

describes how Solr organizes its data for indexing. It

explains how a Solr schema defines the fields and field

types which Solr uses to organize data within the

document files it indexes.

Understanding Analyzers, Tokenizers, and Filters:

This section explains how Solr prepares text for indexing

and searching. Analyzers parse text and produce a

stream of tokens, lexical units used for indexing and

searching. Tokenizers break field data down into tokens.

Filters perform other transformational or selective work

on token streams.

Indexing and Basic Data Operations: This section

describes the indexing process and basic index

operations, such as commit, optimize, and rollback.

Searching: This section presents an overview of the

search process in Solr. It describes the main components

used in searches, including request handlers, query

parsers, and response writers. It lists the query parameters

that can be passed to Solr, and it describes features such

as boosting and faceting, which can be used to fine-tune

search results.

The Well-Configured Solr Instance: This section

discusses performance tuning for Solr. It begins with an

overview of the file, then tells you howsolrconfig.xml

to configure cores with , how to configure thesolr.xml

Lucene index writer, and more.

Managing Solr: This section discusses important topics for

running and monitoring Solr. It describes running Solr in

the Apache Tomcat servlet runner and Web server. Other

topics include how to back up a Solr instance, and how to

run Solr with Java Management Extensions (JMX).

SolrCloud: This section describes the newest and most

exciting of Solr's new features, SolrCloud, which provides

comprehensive distributed capabilities.

Legacy Scaling and Distribution: This section tells you

how to grow a Solr distribution by dividing a large index

into sections called shards, which are then distributed

across multiple servers, or by replicating a single index

across multiple services.

Client APIs: This section tells you how to access Solr

through various client APIs, including JavaScript, JSON,

and Ruby.

3Apache Solr Reference Guide 4.10

About This Guide

This guide describes all of the important features and functions of Apache Solr. It is free to download from http://luce

.ne.apache.org/solr/

Designed to provide high-level documentation, this guide is intended to be more encyclopedic and less of a

cookbook. It is structured to address a broad spectrum of needs, ranging from new developers getting started to

well-experienced developers extending their application or troubleshooting. It will be of use at any point in the

application life cycle, for whenever you need authoritative information about Solr.

The material as presented assumes that you are familiar with some basic search concepts and that you can read

XML. It does not assume that you are a Java programmer, although knowledge of Java is helpful when working

directly with Lucene or when developing custom extensions to a Lucene/Solr installation.

Special Inline Notes

Special notes are included throughout these pages.

Note Type Look & Description

Information

Notes

Tip

Warning

Hosts and Port Examples

The default port configured for Solr during the install process is 8983. The samples, URLs and screenshots in this

guide may show different ports, because the port number that Solr uses is configurable. If you have not customized

your installation of Solr, please make sure that you use port 8983 when following the examples, or configure your

own installation to use the port numbers shown in the examples. For information about configuring port numbers

used by Tomcat or Jetty, see .Managing Solr

Similarly, URL examples use 'localhost' throughout; if you are accessing Solr from a location remote to the server

hosting Solr, replace 'localhost' with the proper domain or IP where Solr is running.

Paths

Path information is given relative to , which is the location under the main Solr installation where Solr'ssolr.home

collections and their and directories are stored. In the default Solr package, is conf data solr.home example/s

, which is itself relative to where you unpackaged the application; if you have moved this location for your servletolr

container or for another reason, the path to may be different than the default.solr.home

Notes with a blue background are used for information that is important for you to know.

Yellow notes are further clarifications of important points to keep in mind while using Solr.

Notes with a green background are Helpful Tips.

Notes with a red background are warning messages.

4Apache Solr Reference Guide 4.10

Getting Started

Solr makes it easy for programmers to develop sophisticated, high-performance search applications with advanced

features such as faceting (arranging search results in columns with numerical counts of key terms). Solr builds on

another open source search technology: Lucene, a Java library that provides indexing and search technology, as

well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities. Both Solr and Lucene are

managed by the Apache Software Foundation ( .www.apache.org)

The Lucene search library currently ranks among the top 15 open source projects and is one of the top 5 Apache

projects, with installations at over 4,000 companies. Lucene/Solr downloads have grown nearly ten times over the

past three years, with a current run-rate of over 6,000 downloads a day. The Solr search server, which provides

application builders a ready-to-use search platform on top of the Lucene search library, is the fastest growing

Lucene sub-project. Apache Lucene/Solr offers an attractive alternative to the proprietary licensed search and

discovery software vendors.

This section helps you get Solr up and running quickly, and introduces you to the basic Solr architecture and

features. It covers the following topics:

Installing Solr: A walkthrough of the Solr installation process.

Running Solr: An introduction to running Solr. Includes information on starting up the servers, adding documents,

and running queries.

A Quick Overview: A high-level overview of how Solr works.

A Step Closer: An introduction to Solr's home directory and configuration options.

Installing Solr

This section describes how to install Solr. You can install Solr anywhere that a suitable Java Runtime Environment

(JRE) is available, as detailed below. Currently this includes Linux, OS X, and Microsoft Windows. The instructions

in this section should work for any platform, with a few exceptions for Windows as noted.

Got Java?

You will need the Java Runtime Environment (JRE) version 1.7 or higher. At a command line, check your Java

version like this:

$ java -version

java version "1.7.0_55"

Java(TM) SE Runtime Environment (build 1.7.0_55-b13)

Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)

The output will vary, but you need to make sure you have version 1.7 or higher. If you don't have the required

version, or if the java command is not found, download and install the latest version from Oracle at http://www.oracle

..com/technetwork/java/javase/downloads/index.html

Installing Solr

Solr is available from the Solr website at .http://lucene.apache.org/solr/

For Linux/Unix/OSX systems, download the file. For Microsoft Windows systems, download the file..tgz .zip

Solr runs inside a Java servlet container such as Tomcat, Jetty, or Resin. The Solr distribution includes a working

demonstration server in the directory that runs in Jetty. You can use the example server as a template forExample

5Apache Solr Reference Guide 4.10

your own installation, whether or not you are using Jetty as your servlet container. For more information about the

demonstration server, see the .Solr Tutorial

To install Solr

Unpack the Solr distribution to your desired location.

Stop your Java servlet container.

Copy the file from the Solr distribution to the directory of your servlet container. Do notsolr.war webapps

change the name of this file: it must be named .solr.war

Copy the Solr Home directory from the distribution to your desired Solrsolr-4.x.0/example/solr/

Home location.

Start your servlet container, passing to it the location of your Solr Home in one of these ways:

Set the Java system property to your Solr Home. (for example, using the examplesolr.solr.home

jetty setup: ).java -Dsolr.solr.home=/some/dir -jar start.jar

Configure the servlet container so that a JNDI lookup of by the Solrjava:comp/env/solr/home

webapp will point to your Solr Home.

Start the servlet container in the directory containing : the default Solr Home is under the./solr solr

JVM's current working directory ( ).$CWD/solr

To confirm your installation, go to the at . Note that yourSolr Admin page http://localhost:8983/solr/

servlet container may have started on a different port: check the documentation for your servlet container to

troubleshoot that issue. Also note that if that port is already in use, Solr will not start. In that case, shut down the

servlet container running on that port, or change your Solr port.

For more information about installing and running Solr on different Java servlet containers, see the pageSolrInstall

on the .Solr Wiki

Related Topics

SolrInstall

Running Solr

This section describes how to run Solr with an example schema, how to add documents, and how to run queries.

Start the Server

If you didn't start Solr after installing it, you can start it by running from the Solr directory.bin/solr

$ bin/solr -f

If you are running Windows, you can start the Web server by running instead.bin\solr.cmd

C:\Applications\Solr\bin\solr.cmd -f

This will start Solr in the foreground, listening on port 8983. The and scripts allow you tobin/solr bin\solr.cmd

customize how you start Solr. Let's work through a few examples of using the script (if you're runningbin/solr

Solr ships with a working Jetty server, with optimized settings for Solr, inside the directory. It isexample

recommended that you use the provided Jetty server for optimal performance. If you absolutely must use a

different servlet container then continue to the next section on how to install Solr.

6Apache Solr Reference Guide 4.10

Solr on Windows, the works the same as what is shown in the examples below):bin\solr.cmd

Solr Script Options

The script has several options.bin/solr

Script Help

To see how to use the script, execute:bin/solr

$ bin/solr -help

For specific usage instructions for the command, do:start

$ bin/solr start -help

Start Solr in the Background

As you saw above, the flag will start Solr running in the foreground. Since Solr is a server, it is more common to-f

run it in the background, especially on Unix/Linux. To start Solr running in the background, simply do:

$ bin/solr start

When you start Solr in the background, the script will wait to make sure Solr starts correctly before returning to the

command line prompt.

Start Solr with a Different Port

To change the port Solr listens on, you can use the parameter, such as:-p

$ bin/solr start -p 8984

Stop Solr

When running Solr in the foreground (using -f), then you can stop it using . However, when running in theCtrl-c

background, you should use the command, such as:stop

$ bin/solr stop

Start Solr with a Specific Example Configuration

Solr also provides a number of useful examples to help you learn about key features. You can launch the examples

using the -e flag. For instance, to launch the Data Import Handler example, you would do:

$ bin/solr -e dih

Currently, the available examples you can run are: default, dih, schemaless, and cloud.

Check if Solr is Running

If you're not sure if Solr is running locally, you can send the info flag ( ):-i

7Apache Solr Reference Guide 4.10

$ bin/solr -i

This will search for running Solr instances on your computer and then gather basic information about them, such as

the version and memory usage.

For more information on starting Solr in cloud mode, see: .Getting Started with SolrCloud

That's it! Solr is running. If you need convincing, use a Web browser to see the Admin Console.

http://localhost:8983/solr/

The Solr Admin interface.

If Solr is not running, your browser will complain that it cannot connect to the server. Check your port number and try

again.

Add Documents

Solr is built to find documents that match queries. Solr's schema provides an idea of how content is structured (more

on the schema ), but without documents there is nothing to find. Solr needs input before it can do anything.later

You may want to add a few sample documents before trying to index your own content. The Solr installation comes

with example documents located in the directory of your installation.example/exampledocs

In the directory is the SimplePostTool, a Java-based command line tool, , which can beexampledocs post.jar

used to index the documents. Do not worry too much about the details for now. The Indexing and Basic Data

section has all the details on indexing.Operations

To see some information about the usage of , use the option.post.jar -help

$ java -jar post.jar -help

The SimplePostTool is a simple command line tool for POSTing raw XML to a Solr port. XML data can be read from

files specified as command line arguments, as raw command line strings, or via STDIN.arg

Examples:

8Apache Solr Reference Guide 4.10

java -Ddata=files -jar post.jar *.xml

java -Ddata=args -jar post.jar '<delete><id>42</id></delete>'

java -Ddata=stdin -jar post.jar < hd.xml

Other options controlled by System Properties include the Solr URL to POST to, and whether a commit should be

executed. These are the defaults for all System Properties:

-Ddata=files

-Durl=http://localhost:8983/solr/update

-Dcommit=yes

Go ahead and add all the documents in the directory as follows:

$ java -Durl=http://localhost:8983/solr/update -jar post.jar *.xml

SimplePostTool: version 1.2

SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other

encodings are not currently supported

SimplePostTool: POSTing files to http://10.211.55.8:8983/solr/update..

SimplePostTool: POSTing file hd.xml

SimplePostTool: POSTing file ipod_other.xml

SimplePostTool: POSTing file ipod_video.xml

SimplePostTool: POSTing file mem.xml

SimplePostTool: POSTing file monitor.xml

SimplePostTool: POSTing file monitor2.xml

SimplePostTool: POSTing file mp500.xml

SimplePostTool: POSTing file sd500.xml

SimplePostTool: POSTing file solr.xml

SimplePostTool: POSTing file spellchecker.xml

SimplePostTool: POSTing file utf8-example.xml

SimplePostTool: POSTing file vidcard.xml

SimplePostTool: COMMITting Solr index changes..

Time spent: 0:00:00.633

That's it! Solr has indexed the documents contained in the files.

Ask Questions

Now that you have indexed documents, you can perform queries. The simplest way is by building a URL that

includes the query parameters. This is exactly the same as building any other HTTP URL.

For example, the following query searches all document fields for "video":

http://localhost:8983/solr/select?q=video

Notice how the URL includes the host name ( ), the port number where the server is listening ( ), thelocalhost 8983

application name ( ), the request handler for queries ( ), and finally, the query itself ( ).solr select q=video

The results are contained in an XML document, which you can examine directly by clicking on the link above. The

document contains two parts. The first part is the , which contains information about the responseresponseHeader

itself. The main part of the reply is in the result tag, which contains one or more doc tags, each of which contains

fields from documents that match the query. You can use standard XML transformation techniques to mold Solr's

results into a form that is suitable for displaying to users. Alternatively, Solr can output the results in JSON, PHP,

Ruby and even user-defined formats.

9Apache Solr Reference Guide 4.10

Just in case you are not running Solr as you read, the following screen shot shows the result of a query (the next

example, actually) as viewed in Mozilla Firefox. The top-level response contains a named alst responseHeader

nd a result named response. Inside result, you can see the three docs that represent the search results.

An XML response to a query.

Once you have mastered the basic idea of a query, it is easy to add enhancements to explore the query syntax. This

one is the same as before but the results only contain the ID, name, and price for each returned document. If you

don't specify which fields you want, all of them are returned.

http://localhost:8983/solr/select?q=video&fl=id,name,price

Here is another example which searches for "black" in the field only. If you do not tell Solr which field toname

search, it will search default fields, as specified in the schema.

http://localhost:8983/solr/select?q=name:black

You can provide ranges for fields. The following query finds every document whose price is between $0 and $400.

10Apache Solr Reference Guide 4.10

http://localhost:8983/solr/select?q=price:[0%20TO%20400]&fl=id,name,price

Faceted browsing is one of Solr's key features. It allows users to narrow search results in ways that are meaningful

to your application. For example, a shopping site could provide facets to narrow search results by manufacturer or

price.

Faceting information is returned as a third part of Solr's query response. To get a taste of this power, take a look at

the following query. It adds and .facet=true facet.field=cat

http://localhost:8983/solr/select?q=price:[0%20TO%20400]&fl=id,name,price&facet=true&

facet.field=cat

In addition to the familiar and response from Solr, a element is also present.responseHeader facet_counts

Here is a view with the and response collapsed so you can see the faceting information clearly.responseHeader

...

</lst>

<doc>

<str name="name">Solr, the Enterprise Search Server</str>

...

</result>

</lst>

</lst>

</response>

The facet information shows how many of the query results have each possible value of the field. You couldcat

easily use this information to provide users with a quick way to narrow their query results. You can filter results by

An XML Response with faceting

11Apache Solr Reference Guide 4.10

adding one or more filter queries to the Solr request. Here is a request further constraining the request to documents

with a category of "software".

http://localhost:8983/solr/select?q=price:[0%20TO%20400]&fl=id,name,price&facet=true&

facet.field=cat&fq=cat:software

A Quick Overview

Having had some fun with Solr, you will now learn about all the cool things it can do.

Here is a typical configuration:

In the scenario above, Solr runs alongside another application in a Web server. For example, an online store

application would provide a user interface, a shopping cart, and a way to make purchases. The store items would be

kept in some kind of database.

Solr makes it easy to add the capability to search through the online store through the following steps:

Define a . The schema tells Solr about the contents of documents it will be indexing. In the onlineschema

store example, the schema would define fields for the product name, description, price, manufacturer, and so

on. Solr's schema is powerful and flexible and allows you to tailor Solr's behavior to your application. See Doc

for all the details.uments, Fields, and Schema Design

Deploy Solr to your application server.

Feed Solr the document for which your users will search.

Expose search functionality in your application.

Because Solr is based on open standards, it is highly extensible. Solr queries are RESTful, which means, in

12Apache Solr Reference Guide 4.10

essence, that a query is a simple HTTP request URL and the response is a structured document: mainly XML, but it

could also be JSON, CSV, or some other format. This means that a wide variety of clients will be able to use Solr,

from other web applications to browser clients, rich client applications, and mobile devices. Any platform capable of

HTTP can talk to Solr. See for details on client APIs.Client APIs

Solr is based on the Apache Lucene project, a high-performance, full-featured search engine. Solr offers support for

the simplest keyword searching through to complex queries on multiple fields and faceted search results. Searching

has more information about searching and queries.

If Solr's capabilities are not impressive enough, its ability to handle very high-volume applications should do the trick.

A relatively common scenario is that you have so many queries that the server is unable to respond fast enough to

each one. In this case, you can make copies of an index. This is called replication. Then you can distribute incoming

queries among the copies in any way you see fit. A round-robin mechanism is one simple way to do this.

Another useful technique is sharding. If you have so many documents that you simply cannot fit them all on a single

box for RAM or index size reasons, you can split an index into multiple pieces, called . Each shard lives on itsshards

own physical server. An incoming query is sent to all the shard servers, which respond with matching results.

If you have huge numbers of documents and users, you might need to combine the techniques of sharding and

replication. In this case, Solr's new SolrCloud functionality may be more effective for your needs. SolrCloud includes

a number of features to simplify the process of distributing the index and the queries, and manage the resulting

nodes.

13Apache Solr Reference Guide 4.10

For full details on sharding and replication, see . We've split the SolrCloudLegacy Scaling and Distribution

information into it's own section, called .SolrCloud

Best of all, this talk about high-volume applications is not just hypothetical: some of the famous Internet sites that

use Solr today are Macy's, EBay, and Zappo's.

For more information, take a look at .https://wiki.apache.org/solr/PublicServers

A Step Closer

You already have some idea of Solr's schema. This section describes Solr's home directory and other configuration

options.

When Solr runs in an application server, it needs access to a home directory. The home directory contains important

configuration information and is the place where Solr will store its index.

The crucial parts of the Solr home directory are shown here:

<solr-home-directory>/

solr.xml

conf/

solrconfig.xml

schema.xml

data/

You supply , , and to tell Solr how to behave. By default, Solr stores itssolr.xml solrconfig.xml schema.xml

index inside data.

solr.xml specifies configuration options for your Solr core, and also allows you to configure multiple cores. For

more information on see .solr.xml The Well-Configured Solr Instance

solrconfig.xml controls high-level behavior. You can, for example, specify an alternate location for the data

directory. For more information on , see .solrconfig.xml The Well-Configured Solr Instance

schema.xml describes the documents you will ask Solr to index. Inside , you define a document as aschema.xml

14Apache Solr Reference Guide 4.10

collection of fields. You get to define both the field types and the fields themselves. Field type definitions are

powerful and include information about how Solr processes incoming field values and query values. For more

information on , see .schema.xml Documents, Fields, and Schema Design

15Apache Solr Reference Guide 4.10
Upgrading Solr
If you are already using Solr 4.9, Solr 4.10 should not present any major problems. However, you should review the 
 file found in your Solr package for changes and updates that may effect your existingCHANGES.txt
implementation.
Upgrading from 4.9.x
In Solr 3.6, all primitive field types were changed to omit norms by default when the schema version is 1.5 or
greater ( ), but  's default was mistakenly not changed. As of Solr 4.10, SOLR-3140 TrieDateField TrieDat
 omits norms by default (see  ).eField SOLR-6211
Creating a   via   no longer requires an additional call to SolrCore CoreContainer.create() CoreContai
 to make it available to clients (see  ).ner.register() SOLR-6170
CoreContainer.remove() has been removed. You should now use   toCoreContainer.unload()
delete a   (see  ).SolrCore SOLR-6232
solr.xml parsing has been improved to better account for the expected data types of various options. As
part of this fix, additional error checking has also been added to provide errors in the event of duplicated
options, or unknown option names that may indicate a typo. Users who have modified their solr.xml in the
past and now upgrade may get errors on startup if they have typos or unexpected options specified in their s
 file. (See   for more information.)olr.xml SOLR-5746
Upgrading from Older Versions of Solr
This is a summary of some of the key issues related to upgrading in previous versions of Solr. Users upgrading from
older versions are strongly encouraged to consult   for the details of all changes since the version theyCHANGES.txt
are upgrading from.
In Solr 4.9, Support for DiskDocValuesFormat (i.e., fieldTypes configured with  )docValuesFormat="Disk"
was removed due to poor performance. If you have existing fieldTypes using DiskDocValuesFormat please
modify your schema.xml to remove the 'docValuesFormat' attribute, and optimize your index to rewrite it into
the default codec prior to upgrading to 4.9 or later. See   for more details.LUCENE-5761
Begining with Solr 4.8, Java 7 or greater is required. When using Oracle Java 7 or OpenJDK 7, be sure to not
use the GA build 147 or update versions u40, u45 and u51! We recommend using u55 or later. An overview
of known JVM bugs can be found on http://wiki.apache.org/lucene-java/JavaBugs
Prior to Solr 4.8, terms that exceeded Lucene's   were silently ignored when indexingMAX_TERM_LENGTH
documents. Begining with Solr 4.8, a document an error will be generated when attempting to index a
document with a term that is too large. If you wish to continue to have large terms ignored, use  solr.Lengt
in all of your Analyzers. See   for more details. hFilterFactory LUCENE-5472
The  and   tags in   was deprecated in Solr 4.8. There is no longer any <fields> <types> schema.xml
reason to keep them in the schema file, they may be safely removed. This allows intermixing of <fieldType
,   and   definitions if desired. Currently, these tags are supported so either style may> <field> <copyField>
be implemented. They may be deprecated formally in 5.0. See   for more details.SOLR-5228
In Solr 4.7, due to a bug in previous versions the default value of the   property of discountOverlap Defaul
 was not being set appropriately if you were using the implicit   tSimilarity DefaultSimilarityFactory
instead of explicitly configuring it. To preserve consistent behavior for people who upgrade, the implicit
behavior is now contingent on the   --   for 4.6 and<luceneMatchVersion/> discountOverlap=false
below,   for 4.7 and above. See   for more information.discountOverlap=true SOLR-5561
In Solr 4.6, The "file" attribute of infoStream in   was removed. Control this via your loggingsolrconfig.xml

16Apache Solr Reference Guide 4.10

configuration (org.apache.solr.update.LoggingInfoStream) instead.

In Solr 4.5, XML configuration parsing was made more strict about situations where a single setting is allowed

but multiple values are found. Configuration parsing now fails with an error in situations like this. Also, schem

parsing was also made more strict: "default" or "required" options specified on dea.xml <dynamicField/>

clarations will cause an init error. You can safely remove these attributes.

In Solr 4.5, can now use multiple threads to add documents by default. This is a smallCloudSolrServer

change in runtime semantics when using the bulk add method - you will still end up with the same exception

on a failure, but some documents beyond the one that failed may have made it in. To get the old, single

threaded behavior, set parallel updates to false on the instance.CloudSolrServer

Beginning with 4.4, the use of the Compound File Format is determined by IndexWriter configuration, and not

the Merge Policy. If you have explicitly configured a with the confi<mergePolicy> setUseCompoundFile

guration option, you should change this to use the configuration option useCompoundFile directly in the <i

. Specifying on the Merge Policy will no longer work in Solr 5.0. blockndexConfig> setUseCompoundFile

In Solr 4.4, and were deprecated, and will be removed in 5.0. Please switch toByteField ShortField

using TrieIntField

The pre-4.3.0 "legacy" mode and format will no longer be supported in Solr 5.0. Users aresolr.xml

encouraged to migrate from "legacy" to "discovery" configurations, see .solr.xml Solr Cores and solr.xml

As of Solr 4.3 the slf4j/logging jars are no longer included in the Solr webapp to allow for more flexibility in

logging.

Minor changes were made to the Schema API response format in Solr 4.3

In Solr 4.1 the method Solr uses to identify node names for SolrCloud was changed. If you are using

SolrCloud and upgrading from Solr 4.0, you may have issues with unknown or lost nodes. If this occurs, you

can manually set the parameter either in or as a system variable. More information can behost solr.xml

found in the section on .SolrCloud

If you are upgrading from Solr 3.x, you should familiarize yourself with the Major Changes from Solr 3 to Solr

17Apache Solr Reference Guide 4.10
Using the Solr Administration User Interface
This section discusses the Solr Administration User Interface ("Admin UI").
The   explains how the features of the user interface that are new with Solr 4, what'sOverview of the Solr Admin UI
on the initial Admin UI page, and how to configure the interface. In addition, there are pages describing each screen
of the Admin UI:
Getting Assistance shows you how to get
more information about the UI.
Logging explains the various logging levels
available and how to invoke them.
Cloud Screens display information about
nodes when running in SolrCloud mode.
Core Admin explains how to get
management information about each core.
Java Properties shows the Java
information about each core.
Thread Dump lets you see detailed
information about each thread, along with
state information.
Core-Specific Tools is a section explaining additional
screens available for each named core.
Analysis - lets you analyze the data found in
specific fields.
Dataimport - shows you information about the
current status of the Data Import Handler.
Documents - provides a simple form allowing you to
execute various Solr indexing commands directly
from the browser.
Files - shows the current core configuration files
such as   and  .solrconfig.xml schema.xml
Ping - lets you ping a named core and determine
whether the core is active.
Plugins/Stats - shows statistics for plugins and other
installed components.
Query - lets you submit a structured query about
various elements of a core.
Replication - shows you the current replication
status for the core, and lets you enable/disable
replication.
Schema Browser - displays schema data in a
browser window.
Overview of the Solr Admin UI
Solr features a Web interface that makes it easy for Solr administrators and programmers to view   Solr configuration
details, run   document fields in order to fine-tune a Solr configuration and access queries and analyze online
 and other help.documentation

18Apache Solr Reference Guide 4.10

With Solr 4, the Solr Admin has been completely redesigned. The redesign was completed with these benefits in

mind:

load pages quicker

access and control functionality from the Dashboard

re-use the same servlets that access Solr-related data from an external interface, and

ignore any differences between working with one or multiple cores.

Accessing the URL (if running Jetty on the default port of 8983), will show thehttp://hostname:8983/solr/

main dashboard, which is divided into two parts.

A left-side of the screen is a menu under the Solr logo that provides the navigation through the screens of the UI.

The first set of links are for system-level information and configuration and provide access to Logging, Core Admin

and Java Properties, among other things. At the end of this information is a list of Solr cores configured for this

instance. Clicking on a core name shows a secondary menu of information and configuration options for the core

specifically. Items in this list include the Schema, Config, Plugins, and an ability to perform Queries on indexed data.

The center of the screen shows the detail of the option selected. This may include a sub-navigation for the option or

text or graphical representation of the requested data. See the sections in this guide for each screen for more

details.

Configuring the Admin UI in solrconfig.xml

You can configure the Solr Admin UI by editing the file .solrconfig.xml

The block in the file determines the default query to be displayed in the Query section<admin> solrconfig.xml

of the core-specific pages. The default is , which is to find all documents. In this example, we have changed the*:*

default to the term .solr

<admin>

</admin>

Related Topics

Configuring solrconfig.xml

19Apache Solr Reference Guide 4.10

Getting Assistance

At the bottom of each screen of the Admin UI is a set of links that can be used to get more assistance with

configuring and using Solr.

Assistance icons

These icons include the following links.

Link Description

Documentation Navigates to the Apache Solr documentation hosted on .http://lucene.apache.org/solr/

Issue Tracker Navigates to the JIRA issue tracking server for the Apache Solr project. This server resides at ht

.tp://issues.apache.org/jira/browse/SOLR

IRC Channel Connects you to the web interface for Solr's IRC channel. This channel is found on irc.freen

, Port 7000, channel.ode.net #solr

Community

forum

Connects you to the Solr , which at the current time is a set of mailing lists andcommunity forum

their archives.

Solr Query

Syntax

Navigates to the Apache Wiki page describing the Solr query syntax: http://wiki.apache.org/solr/

.SolrQuerySyntax

These links cannot be modified without editing the in the that contains the Admin UI files.admin.html solr.war

Logging

The Logging page shows messages from Solr's log files.

When you click the link for "Logging", a page similar to the one below will be displayed:

The Main Logging Screen

While this example shows logged messages for only one core, if you have multiple cores in a single instance, they

will each be listed, with the level for each.

20Apache Solr Reference Guide 4.10

Selecting a Logging Level

When you select the link on the left, you see the hierarchy of classpaths and classnames for your instance. ALevel

row highlighted in yellow indicates that the class has logging capabilities. Click on a highlighted row, and a menu will

appear to allow you to change the log level for that class. Characters in boldface indicate that the class will not be

affected by level changes to root.

For an explanation of the various logging levels, see .Configuring Logging

Cloud Screens

When running in SolrCloud mode, an option will appear in the Admin UI between Logging and Core Admin for

Cloud. It's not possible at the current time to manage the nodes of the SolrCloud cluster, but you can view them and

open the Solr Admin UI on each node to view the status and statistics for the node and each core on each node.

Click on the Cloud option in the left-hand navigation, and a small sub-menu appears with options called "Tree",

"Graph", "Graph (Radial)" and "Dump". The default view (which is "Tree") shows a graph of each core and the

addresses of each node. This example shows a very simple two-node cluster with a single core:

The "Graph (Radial)" option provides a different visual view of each node. Using the same simple two-node cluster,

21Apache Solr Reference Guide 4.10

the radial graph view looks like:

The "Tree" option shows a directory structure of the files in ZooKeeper, including ,clusterstate.json

configuration files, and other status and information files. In this example, we show the leader definition files for the

core named "collection1":

The final option is "Dump", which allows you to download an XML file with all the ZooKeeper configuration files.

Core Admin

The Core Admin screen lets you manage your cores.

The buttons at the top of the screen let you add a new core, unload the core displayed, rename the currently

displayed core, swap the existing core with one that you specify in a drop-down box, reload the current core, and

optimize the current core.

The main display and available actions correspond to the commands used with the , but provideCoreAdminHandler

another way of working with your cores.

22Apache Solr Reference Guide 4.10

Java Properties

The Java Properties screen provides easy access to one of the most essential components of a top-performing Solr

systems With the Java Properties screen, you can see all the properties of the JVM running Solr, including the class

paths, file encodings, JVM memory settings, operating system, and more.

Thread Dump

The Thread Dump screen lets you inspect the currently active threads on your server. Each thread is listed and

access to the stacktraces is available where applicable. Icons to the left indicate the state of the thread: for example,

threads with a green check-mark in a green circle are in a "RUNNABLE" state. On the right of the thread name, a

down-arrow means you can expand to see the stacktrace for that thread.

23Apache Solr Reference Guide 4.10

When you move your cursor over a thread name, a box floats over the name with the state for that thread. Thread

states can be:

State Meaning

NEW A thread that has not yet started.

RUNNABLE A thread executing in the Java virtual machine.

BLOCKED A thread that is blocked waiting for a monitor lock.

WAITING A thread that is waiting indefinitely for another thread to perform a particular action.

TIMED_WAITING A thread that is waiting for another thread to perform an action for up to a specified waiting

time.

TERMINATED A thread that has exited.

When you click on one of the threads that can be expanded, you'll see the stacktrace, as in the example below:

24Apache Solr Reference Guide 4.10

Inspecting a thread

You can also check the button to automatically enable expansion for all threads.Show all Stacktraces

Core-Specific Tools

In the left-hand navigation bar, you will see a pull-down menu titled "Core Selector". Clicking on the menu will show

a list of Solr cores, with a search box that can be used to find a specific core (handy if you have a lot of cores).

When you select a core, such as in the example, a secondary menu opens under the core name withcollection1

the administration options available for that particular core.

After selecting the core, the central part of the screen shows Statistics and other information about the core you

chose. You can define a file called that includes links or other information you would like toadmin-extra.html

display in the "Admin Extra" part of this main screen.

On the left side, under the core name, are links to other screens that display information or provide options for the

specific core chosen. The core-specific options are listed below, with a link to the section of this Guide to find out

25Apache Solr Reference Guide 4.10

Analysis - lets you analyze the data found in specific fields.

Dataimport - shows you information about the current status of the Data Import Handler.

Documents - provides a simple form allowing you to execute various Solr indexing commands directly from

the browser.

Files - shows the current core configuration files such as and .solrconfig.xml schema.xml

Ping - lets you ping a named core and determine whether the core is active.

Plugins/Stats - shows statistics for plugins and other installed components.

Query - lets you submit a structured query about various elements of a core.

Replication - shows you the current replication status for the core, and lets you enable/disable replication.

Schema Browser - displays schema data in a browser window.

Analysis Screen

The Analysis screen lets you inspect how data will be handled according to the field, field type and dynamic rule

configurations found in . You can analyze how content would be handled during indexing or duringschema.xml

query processing and view the results separately or at the same time. Ideally, you would want content to be handled

consistently, and this screen allows you to validate the settings in the field type or field analysis chains.

Enter content in one or both boxes at the top of the screen, and then choose the field or field type definitions to use

for analysis.

The standard output (shown if "Verbose Output" is not checked) will display the step of the analysis and the output

based on the current settings. If you click the check box, you see more information, includingVerbose Output

transformations to the input (such as, convert to lower case, strip extra characters, etc.) and the bytes, type and

detailed position information. The information displayed will vary depending on the settings of the field or field type.

Each step of the process is displayed in a separate section, with an abbreviation for the tokenizer or filter that is

applied in that step. Hover or click on the abbreviation, and you'll see the name and path of the tokenizer or filter.

In example screenshot above, several transformations are applied to the text string "Running is a sport." We've used

the field "text", which has rules that remove the "is" and "a" and the word "running" has been changed to its basic

form, "run". This is because we have defined the field type, in this scenario, to remove stop words (smalltext_en

words that usually do not provide a great deal of context) and "stem" terms when possible to find more possible

26Apache Solr Reference Guide 4.10

matches (this is particularly helpful with plural forms of words). If you click the question mark next to the Analyze

pull-down menu, the Schema Browser window will open, showing you the settings for theFieldname/Field Type

field specified.

The section describes in detail what each option is and how it mayUnderstanding Analyzers, Tokenizers, and Filters

transform your data and the section has specific examples for using the Analysis screen.Running Your Analyzer

Dataimport Screen

The Dataimport screen shows the configuration of the DataImportHandler (DIH) and allows you to start indexing

data, as defined by the options selected on the screen and defined in the configuration file.

The configuration file defines the location of the data and how to perform the SQL queries for the data you want. The

options on the screen control how the data is imported to Solr. For more information about data importing with DIH,

see the section on .Uploading Structured Data Store Data with the Data Import Handler

Documents Screen

The Documents screen provides a simple form allowing you to execute various Solr indexing commands in a variety

of formats directly from the browser.

27Apache Solr Reference Guide 4.10

The screen allows you to:

Copy documents in JSON, CSV or XML and submit them to the index

Upload documents (in JSON, CSV or XML)

Construct documents by selecting fields and field values

The first step is to define the RequestHandler to use (aka, 'qt'). By default will be defined. To use Solr Cell,/update

for example, change the request handler to ./update/extract

Then choose the Document Type to define the type of document to load. The remaining parameters will change

depending on the document type selected.

JSON

When using the JSON document type, the functionality is similar to using a requestHandler on the command line.

Instead of putting the documents in a curl command, they can instead be input into the Document entry box. The

document structure should still be in proper JSON format.

Then you can choose when documents should be added to the index (Commit Within), whether existing documents

should be overwritten with incoming documents with the same id (if this is not , then the incoming documentstrue

will be dropped), and, finally, if a document boost should be applied.

This option will only add or overwrite documents to the index; for other update tasks, see the option#Solr Command

CSV

When using the CSV document type, the functionality is similar to using a requestHandler on the command line.

Instead of putting the documents in a curl command, they can instead be input into the Document entry box. The

document structure should still be in proper CSV format, with columns delimited and one row per document.

Then you can choose when documents should be added to the index (Commit Within), and whether existing

documents should be overwritten with incoming documents with the same id (if this is not , then the incomingtrue

documents will be dropped).

28Apache Solr Reference Guide 4.10

Document Builder

The Document Builder provides a wizard-like interface to enter fields of a document

File Upload

The File Upload option allows choosing a prepared file and uploading it. If using only for the/update

Request-Handler option, you will be limited to XML, CSV, and JSON.

However, to use the ExtractingRequestHandler (aka Solr Cell), you can modify the Request-Handler to /update/e

. You must have this defined in your file, with your desired defaults. You should alsoxtract solrconfig.xml

update the shown in the Extracting Req. Handler Params so the file chosen is given a unique id.&literal.id

Then you can choose when documents should be added to the index (Commit Within), and whether existing

documents should be overwritten with incoming documents with the same id (if this is not , then the incomingtrue

documents will be dropped).

Solr Command

The Solr Command option allows you use XML or JSON to perform specific actions on documents, such as defining

documents to be added or deleted, updating only certain fields of documents, or commit and optimize commands on

the index.

The documents should be structured as they would be if using on the command line./update

XML

When using the XML document type, the functionality is similar to using a requestHandler on the command line.

Instead of putting the documents in a curl command, they can instead be input into the Document entry box. The

document structure should still be in proper Solr XML format, with each document separated by tags and<doc>

each field defined.

Then you can choose when documents should be added to the index (Commit Within), and whether existing

documents should be overwritten with incoming documents with the same id (if this is not , then the incomingtrue

documents will be dropped).

This option will only add or overwrite documents to the index; for other update tasks, see the option#Solr Command

Related Topics

Uploading Data with Index Handlers

Uploading Data with Solr Cell using Apache Tika

Files Screen

The Files screen lets you browse & view the various configuration files (such and )solrconfig.xml schema.xml

for the core you selected.

29Apache Solr Reference Guide 4.10

While the defines the behaviour of Solr as it indexes content and responds to queries, the solrconfig.xml schem

allows you to define the types of data in your content (field types), the fields your documents will be brokena.xml

into, and any dynamic fields that should be generated based on patterns of field names in the incoming documents.

Any other configuration files are used depending on how they are referenced in either or solrconfig.xml schema

..xml

Configuration files cannot be edited with this screen, so a text editor of some kind must be used.

This screen is related to the , in that they both can display information from the schema, butSchema Browser Screen

the Schema Browser provides a way to drill into the analysis chain and displays linkages between field types, fields,

and dynamic field rules.

Many of the options defined in and are described throughout the rest of this Guide.solrconfig.xml schema.xml

In particular, you will want to review these sections:

Indexing and Basic Data Operations

Searching

The Well-Configured Solr Instance

Documents, Fields, and Schema Design

Ping

Choosing Ping under a core name issues a request to check whether a server is up.ping

Ping is configured using a in the file:requestHandler solrconfig.xml

30Apache Solr Reference Guide 4.10

<str name="q">solrpingquery</str>

</lst>

</lst>

<!-- An optional feature of the PingRequestHandler is to configure the

handler with a "healthcheckFile" which can be used to enable/disable

the PingRequestHandler.

relative paths are resolved against the data dir

-->

</requestHandler>

The Ping option doesn't open a page, but the status of the request can be seen on the core overview page shown

when clicking on a collection name. The length of time the request has taken is displayed next to the Ping option, in

milliseconds.

Plugins & Stats Screen

The Plugins screen shows information and statistics about Solr's status and performance. You can find information

about the performance of Solr's caches, the state of Solr's searchers, and the configuration of searchHandlers and

requestHandlers.

Choose an area of interest on the right, and then drill down into more specifics by clicking on one of the names that

appear in the central part of the window. In this example, we've chosen to look at the Searcher stats, from the Core

area:

Searcher Statistics

The display is a snapshot taken when the page is loaded. You can get updated status by choosing to either Watch

or . Watching the changes will highlight those areas that have changed, while refreshingChanges Refresh Values

the values will reload the page with updated information.

Query Screen

31Apache Solr Reference Guide 4.10

You can use , shown under the name of each core, to submit a search query to a Solr server and analyze theQuery

results. In the example in the screenshot, a query has been submitted, and the screen shows the query results sent

to the browser as JSON.

The query was sent to a core named "collection1". We used Solr's default query for this screen (as defined in solrc

), which is . This query will find all records in the index for this core. We kept the other defaults, butonfig.xml *:*

the table below explains these options, which are also covered in detail in later parts of this Guide.

The response is shown to the right of the form. Requests to Solr are simply HTTP requests, and the query submitted

is shown in light type above the results; if you click on this it will open a new browser window with just this request

and response (without the rest of the Solr Admin UI). The rest of the response is shown in JSON, which is part of the

request (see the part at the end).wt=json

The response has at least two sections, but may have several more depending on the options chosen. The two

sections it always has are the and the . The includes the status ofresponseHeader response responseHeader

the search ( ), the processing time ( ), and the parameters ( ) that were used to process thestatus QTime params

query.

The includes the documents that matched the query, in sub-sections. The fields return depend onresponse doc

the parameters of the query (and the defaults of the request handler used). The number of results is also included in

this section.

This screen allows you to experiment with different query options, and inspect how your documents were indexed.

The query parameters available on the form are some basic options that most users want to have available, but

there are dozens more available which could be simply added to the basic request by hand (if opened in a browser).

The table below explains the parameters available:

32Apache Solr Reference Guide 4.10

Field Description

Request-handler

(qt)

Specifies the query handler for the request. If a query handler is not specified, Solr processes

the response with the standard query handler.

q The query event. See for an explanation of this parameter.Searching

fq The filter queries. See for more information on this parameter.Common Query Parameters

sort Sorts the response to a query in either ascending or descending order based on the

response's score or another specified characteristic.

start, rows start is the offset into the query result starting at which documents should be returned. The

default value is 0, meaning that the query should return results starting with the first document

that matches. This field accepts the same syntax as the start query parameter, which is

described in . is the number of rows to return.Searching rows

fl Defines the fields to return for each document. You can explicitly list the stored fields you want

to have returned by separating them with either a comma or a space. In Solr 4, the results of

functions can also be included in the list.fl

wt Specifies the Response Writer to be used to format the query response. Defaults to XML if not

specified.

indent Click this button to request that the Response Writer use indentation to make the responses

Related Topics

Searching

Replication Screen

The Replication screen shows you the current replication state for the named core you have specified. In Solr,

33Apache Solr Reference Guide 4.10

replication is for the index only. SolrCloud has supplanted much of this functionality, but if you are still using index

replication, you can use this screen to see the replication state:

In this example, replication is enabled and will be done after each commit. Because this server is the Master, it is

showing only the config settings for the master. On the master, you can disable replication by clicking the Disable

button.Replication

In Solr, the replication is initiated by the slave servers so there is more value by looking at the Replication screen on

the slave nodes. This screenshot shows the Replication screen for a slave:

You can click the button to show the most current replication status, or choose to get a newRefresh Status

snapshot from the master server.

More details on how to configure replication is available in the section called .Index Replication

Schema Browser Screen

The Schema Browser screen lets you see schema data in a browser window. If you have accessed this window

from the Analysis screen, it will be opened to a specific field, dynamic field rule or field type. If there is nothing

chosen, use the pull-down menu to choose the field or field type.

34Apache Solr Reference Guide 4.10

The screen provides a great deal of useful information about each particular field. In the example above, we have

chosen the field. On the right side of the center window, we see the field name, and a list of fields thattext

populate this field because they are defined to be copied to the field. Click on one of those field names, andtext

you can see the definitions for that field. We can also see the field type, which would allow us to inspect the type

definitions as well.

In the left part of the center window, we see the field type again, and the defined properties for the field. We can also

see how many documents have populated this field. Then we see the analyzer used for indexing and query

processing. Click the icon to the left of either of those, and you'll see the definitions for the tokenizers and/or filters

that are used. The output of these processes is the information you see when testing how content is handled for a

particular field with the .Analysis Screen

Under the analyzer information is a button to . Clicking that button will show the top terms that areLoad Term Info N

in the index for that field. Click on a term, and you will be taken to the to see the results of a query ofQuery Screen

that term in that field. If you want to always see the term information for a field, choose and it will alwaysAutoload

appear when there are terms for a field. A histogram shows the number of terms with a given frequency in the field.

35Apache Solr Reference Guide 4.10

Documents, Fields, and Schema Design

This section discusses how Solr organizes its data into documents and fields, as well as how to work with the Solr

schema file, . It includes the following topics:schema.xml

Overview of Documents, Fields, and Schema Design: An introduction to the concepts covered in this section.

Solr Field Types: Detailed information about field types in Solr, including the field types in the default Solr schema.

Defining Fields: Describes how to define fields in Solr.

Copying Fields: Describes how to populate fields with data copied from another field.

Dynamic Fields: Information about using dynamic fields in order to catch and index fields that do not exactly conform

to other field definitions in your schema.

Schema API: Use curl commands to read various parts of a schema or create new fields and copyField rules.

Other Schema Elements: Describes other important elements in the Solr schema: Unique Key, Default Search Field,

and the Query Parser Operator.

Putting the Pieces Together: A higher-level view of the Solr schema and how its elements work together.

DocValues: Describes how to create a docValues index for faster lookups.

Schemaless Mode: Automatically add previously unknown schema fields using value-based field type guessing.

Overview of Documents, Fields, and Schema Design

The fundamental premise of Solr is simple. You give it a lot of information, then later you can ask it questions and

find the piece of information you want. The part where you feed in all the information is called or .indexing updating

When you ask a question, it's called a .query

One way to understand how Solr works is to think of a loose-leaf book of recipes. Every time you add a recipe to the

book, you update the index at the back. You list each ingredient and the page number of the recipe you just added.

Suppose you add one hundred recipes. Using the index, you can very quickly find all the recipes that use garbanzo

beans, or artichokes, or coffee, as an ingredient. Using the index is much faster than looking through each recipe

one by one. Imagine a book of one thousand recipes, or one million.

Solr allows you to build an index with many different fields, or types of entries. The example above shows how to

build an index with just one field, . You could have other fields in the index for the recipe's cookingingredients

style, like , , or , and you could have an index field for preparation times. Solr can answerAsian Cajun vegan

questions like "What Cajun-style recipes that have blood oranges as an ingredient can be prepared in fewer than 30

minutes?"

The schema is the place where you tell Solr how it should build indexes from input documents.

How Solr Sees the World

Solr's basic unit of information is a , which is a set of data that describes something. A recipe documentdocument

would contain the ingredients, the instructions, the preparation time, the cooking time, the tools needed, and so on.

A document about a person, for example, might contain the person's name, biography, favorite color, and shoe size.

A document about a book could contain the title, author, year of publication, number of pages, and so on.

In the Solr universe, documents are composed of , which are more specific pieces of information. Shoe sizefields

could be a field. First name and last name could be fields.

36Apache Solr Reference Guide 4.10

Fields can contain different kinds of data. A name field, for example, is text (character data). A shoe size field might

be a floating point number so that it could contain values like 6 and 9.5. Obviously, the definition of fields is flexible

(you could define a shoe size field as a text field rather than a floating point number, for example), but if you define

your fields correctly, Solr will be able to interpret them correctly and your users will get better results when they

perform a query.

You can tell Solr about the kind of data a field contains by specifying its . The field type tells Solr how tofield type

interpret the field and how it can be queried.

When you add a document, Solr takes the information in the document's fields and adds that information to an

index. When you perform a query, Solr can quickly consult the index and return the matching documents.

Field Analysis

Field analysis tells Solr what to do with incoming data when building an index. A more accurate name for this

process would be or even , but the official name is .processing digestion analysis

Consider, for example, a biography field in a person document. Every word of the biography must be indexed so that

you can quickly find people whose lives have had anything to do with ketchup, or dragonflies, or cryptography.

However, a biography will likely contains lots of words you don't care about and don't want clogging up your

index—words like "the", "a", "to", and so forth. Furthermore, suppose the biography contains the word "Ketchup",

capitalized at the beginning of a sentence. If a user makes a query for "ketchup", you want Solr to tell you about the

person even though the biography contains the capitalized word.

The solution to both these problems is field analysis. For the biography field, you can tell Solr how to break apart the

biography into words. You can tell Solr that you want to make all the words lower case, and you can tell Solr to

remove accents marks.

Field analysis is an important part of a field type. is a detailedUnderstanding Analyzers, Tokenizers, and Filters

description of field analysis.

Solr Field Types

The field type defines how Solr should interpret data in a field and how the field can be queried. There are many

field types included with Solr by default, and they can also be defined locally.

Topics covered in this section:

Field Type Definitions and Properties

Field Types Included with Solr

Working with Currencies and Exchange Rates

Working with Dates

Working with Enum Fields

Working with External Files and Processes

Field Properties by Use Case

Related Topics

SchemaXML-DataTypes

FieldType Javadoc

37Apache Solr Reference Guide 4.10

Field Type Definitions and Properties

A field type definition can include four types of information:

The name of the field type (mandatory)

An implementation class name (mandatory)

If the field type is , a description of the field analysis for the field typeTextField

Field type properties - depending on the implementation class, some properties may be mandatory.

Field Type Definitions in schema.xml

Field types are defined in , with the element. Each field type is defined between elschema.xml types fieldType

ements. Here is an example of a field type definition for a type called :text_general

<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"

enablePositionIncrements="true" />

<!-- in this example, we will only use synonyms at query time

<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"

ignoreCase="true" expand="false"/>

-->

</analyzer>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"

enablePositionIncrements="true" />

<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"

ignoreCase="true" expand="true"/>

</analyzer>

</fieldType>

The first line in the example above contains the field type name, , and the name of the implementingtext_general

class, . The rest of the definition is about field analysis, described in solr.TextField Understanding Analyzers,

.Tokenizers, and Filters

The implementing class is responsible for making sure the field is handled correctly. In the class names in schema.

, the string is shorthand for or . Therefore, xml solr org.apache.solr.schema org.apache.solr.analysis

is really .solr.TextField org.apache.solr.schema.TextField.

Field Type Properties

The field type determines most of the behavior of a field type, but optional properties can also be defined.class

For example, the following definition of a date field type defines two properties, and sortMissingLast omitNorm

<fieldType name="date" class="solr.TrieDateField"

sortMissingLast="true" omitNorms="true"/>

The properties that can be specified for a given field type fall into three major categories:

Properties specific to the field type's class.

38Apache Solr Reference Guide 4.10

General Properties Solr supports for any field type.

Field Default Properties that can be specified on the field type that will be inherited by fields that use this type

instead of the default behavior.

General Properties

Property Description Values

name The name of the fieldType. This value gets used in field definitions, in

the "type" attribute. It is strongly recommended that names consist of

alphanumeric or underscore characters only and not start with a digit.

This is not currently strictly enforced.

class The class name that gets used to store and index the data for this type.

Note that you may prefix included class names with "solr." and Solr will

automatically figure out which packages to search for the class - so

"solr.TextField" will work. If you are using a third-party class, you will

probably need to have a fully qualified class name. The fully qualified

equivalent for "solr.TextField" is "org.apache.solr.schema.TextField".

positionIncrementGap For multivalued fields, specifies a distance between multiple values,

which prevents spurious phrase matches

integer

autoGeneratePhraseQueries For text fields. If true, Solr automatically generates phrase queries for

adjacent terms. If false, terms must be enclosed in double-quotes to be

treated as phrases.

true or

false

docValuesFormat Defines a custom to use for fields of this type. ThisDocValuesFormat

requires that a schema-aware codec, such as the SchemaCodecFacto

has been configured in solrconfig.xml.ry

n/a

postingsFormat Defines a custom to use for fields of this type. ThisPostingsFormat

requires that a schema-aware codec, such as the SchemaCodecFacto

has been configured in solrconfig.xml.ry

n/a

Field Default Properties

Property Description Values

indexed If true, the value of the field can be used in queries to retrieve matching

documents

true or

false

stored If true, the actual value of the field can be retrieved by queries true or

false

Lucene index back-compatibility is only supported for the default codec. If you choose to customize the pos

or in your schema.xml, upgrading to a future version of Solr maytingsFormat docValuesFormat

require you to either switch back to the default codec and optimize your index to rewrite it into the default

codec before upgrading, or re-build your entire index from scratch after upgrading.

39Apache Solr Reference Guide 4.10

docValues If true, the value of the field will be put in a column-oriented strDocValues

ucture

true or

false

sortMissingFirst

sortMissingLast

Control the placement of documents when a sort field is not present. As of

Solr 3.5, these work for all numeric fields, including Trie and date fields.

true or

false

multiValued If true, indicates that a single document might contain multiple values for

this field type

true or

false

omitNorms If true, omits the norms associated with this field (this disables length

normalization and index-time boosting for the field, and saves some

memory). Defaults to true for all primitive (non-analyzed) field types, such

as int, float, data, bool, and string. Only full-text fields or fields that need

an index-time boost need norms.

true or

false

omitTermFreqAndPositions If true, omits term frequency, positions, and payloads from postings for

this field. This can be a performance boost for fields that don't require that

information. It also reduces the storage space required for the index.

Queries that rely on position that are issued on a field with this option will

silently fail to find documents. This property defaults to true for all fields

that are not text fields.

true or

false

omitPositions Similar to but preserves term frequencyomitTermFreqAndPositions

information

true or

false

termVectors

termPositions

termOffsets

These options instruct Solr to maintain full term vectors for each

document, optionally including the position and offset information for each

term occurrence in those vectors. These can be used to accelerate

highlighting and other ancillary functionality, but impose a substantial cost

in terms of index size. They are not necessary for typical uses of Solr

true or

false

required Instructs Solr to reject any attempts to add a document which does not

have a value for this field. This property defaults to false.

true or

false

Field Types Included with Solr

The following table lists the field types that are available in Solr. The packageorg.apache.solr.schema

includes all the classes listed in this table.

Class Description

BCDIntField Binary-coded decimal (BCD) integer. BCD is a relatively inefficient

encoding that offers the benefits of quick decimal calculations and quick

conversion to a string. This field has been deprecated and will be

removed in Solr 5.0, use TrieIntField instead.

BCDLongField Binary-coded decimal long integer. This field has been deprecated and

will be removed in Solr 5.0, use TrieLongField instead.

BCDStrField Binary-coded decimal string. This field has been deprecated and will be

removed in Solr 5.0, use TrieIntField instead.

BinaryField Binary data.

40Apache Solr Reference Guide 4.10

BoolField Contains either true or false. Values of "1", "t", or "T" in the first character

are interpreted as true. Any other values in the first character are

interpreted as false.

ByteField Contains a byte (an 8-bit signed integer). This field has been deprecated

and will be removed in Solr 5.0, use TrieIntField instead.

CollationField Supports Unicode collation for sorting and range queries.

ICUCollationField is a better choice if you can use ICU4J. See the section

.Unicode Collation

CurrencyField Supports currencies and exchange rates. See the section Working with

.Currencies and Exchange Rates

DateField Represents a point in time with millisecond precision. See the section Wor

.king with Dates This field has been deprecated and will be removed in

Solr 5.0, use TrieDateField instead.

DoubleField Double (64-bit IEEE floating point). This field has been deprecated and

will be removed in Solr 5.0, use TrieDoubleField instead.

ExternalFileField Pulls values from a file on disk. See the section Working with External

.Files and Processes

EnumField Allows defining an enumerated set of values which may not be easily

sorted by either alphabetic or numeric order (such as a list of severities,

for example). This field type takes a configuration file, which lists the

proper order of the field values. See the section Working with Enum

for more information.Fields

FloatField Floating point (32-bit IEEE floating point). This field has been deprecated

and will be removed in Solr 5.0, use TrieFloatField instead.

ICUCollationField Supports Unicode collation for sorting and range queries. See the section

.Unicode Collation

IntField Integer (32-bit signed integer). This field has been deprecated and will be

removed in Solr 5.0, use TrieIntField instead.

LatLonType Spatial Search: a latitude/longitude coordinate pair. The latitude is

specified first in the pair.

LongField Long integer (64-bit signed integer). This field has been deprecated and

will be removed in Solr 5.0, use TrieLongField instead.

PointType Spatial Search: An arbitrary n-dimensional point, useful for searching

sources such as blueprints or CAD drawings.

41Apache Solr Reference Guide 4.10

PreAnalyzedField Provides a way to send to Solr serialized token streams, optionally with

independent stored values of a field, and have this information stored and

indexed without any additional text processing. Useful if you want to

submit field content that was already processed by some existing external

text processing pipeline (e.g. tokenized, annotated, stemmed, inserted

synonyms, etc.), while using all the rich attributes that Lucene's TokenSt

provides via token attributes.ream

RandomSortField Does not contain a value. Queries that sort on this field type will return

results in random order. Use a dynamic field to use this feature.

ShortField Short integer. This field has been deprecated and will be removed in Solr

5.0, use TrieIntField instead.

SortableDoubleField The Sortable fields provide correct numeric sorting. This field has been

deprecated and will be removed in Solr 5.0, use TrieDoubleField instead.

SortableFloatField Numerically sorted floating point. This field has been deprecated and will

be removed in Solr 5.0, use TrieFloatField instead.

SortableIntField Numerically sorted integer. This field has been deprecated and will be

removed in Solr 5.0, use TrieIntField instead.

SortableLongField Numerically sorted long integer. This field has been deprecated and will

be removed in Solr 5.0, use TrieLongField instead.

SpatialRecursivePrefixTreeFieldType (RPT for short) : Accepts latitude comma longitude stringsSpatial Search

or other shapes in WKT format.

StrField String (UTF-8 encoded string or Unicode).

TextField Text, usually multiple words or tokens.

TrieDateField Date field. Represents a point in time with millisecond precision. See the

section . enables efficient dateWorking with Dates precisionStep="0"

sorting and minimizes index size; (the default)precisionStep="8"

enables efficient range queries.

TrieDoubleField Double field (64-bit IEEE floating point). precisionStep="0" enables

efficient numeric sorting and minimizes index size; precisionStep="8"

(the default) enables efficient range queries.

TrieField If this field type is used, a "type" attribute must also be specified, valid

values are: , , , , . Using this field is theinteger long float double date

same as using any of the Trie fields. precisionStep="0" enables

efficient numeric sorting and minimizes index size; precisionStep="8"

(the default) enables efficient range queries.

TrieFloatField Floating point field (32-bit IEEE floating point). precisionStep="0" en

ables efficient numeric sorting and minimizes index size; precisionSte

p="8" (the default) enables efficient range queries.

42Apache Solr Reference Guide 4.10

TrieIntField Integer field (32-bit signed integer). precisionStep="0" enables

efficient numeric sorting and minimizes index size; precisionStep="8"

(the default) enables efficient range queries.

TrieLongField Long field (64-bit signed integer). precisionStep="0" enables efficient

numeric sorting and minimizes index size; precisionStep="8" (the

default) enables efficient range queries.

UUIDField Universally Unique Identifier (UUID). Pass in a value of "NEW" and Solr

will create a new UUID. : configuring a UUIDField instance with aNote

default value of "NEW" is not advisable for most users when using

SolrCloud (and not possible if the UUID value is configured as the unique

key field) since the result will be that each replica of each document will

get a unique UUID value. Using UUIDUpdateProcessorFactory to

generate UUID values when documents are added is recommended

instead.

The has been added to relevant entries in (e.g.,MultiTermAwareComponent solr.TextField schema.xml

wildcards, regex, prefix, range, etc.) to allow automatic lowercasing for multi-term queries.

Further, you can optionally specify a multi-term analyzer in field types in your schema: <analyzer

; if you don't do this, will process the fields according to their specific attributes.type="multiterm"> analyzer

Working with Currencies and Exchange Rates

The FieldType provides support for monetary values to Solr/Lucene with query-time currency conversioncurrency

and exchange rates. The following features are supported:

Point queries

Range queries

Function range queries (new in Solr 4.2)

Sorting

Currency parsing by either currency code or symbol

Symmetric & asymmetric exchange rates (asymmetric exchange rates are useful if there are fees associated

with exchanging the currency)

Configuring Currencies

The field type is defined in . This is the default configuration of this type:currency schema.xml

<fieldType name="currency" class="solr.CurrencyField" precisionStep="8"

defaultCurrency="USD" currencyConfig="currency.xml" />

In this example, we have defined the name and class of the field type, and defined the asdefaultCurrency

"USD", for U.S. Dollars. We have also defined a to use a file called "currency.xml". This is a filecurrencyConfig

of exchange rates between our default currency to other currencies. There is an alternate implementation that would

allow regular downloading of currency data. See below for more.#Exchange Rates

At indexing time, money fields can be indexed in a native currency. For example, if a product on an e-commerce site

is listed in Euros, indexing the price field as "1000,EUR" will index it appropriately. The price should be separated

from the currency by a comma, and the price must be encoded with a floating point value (a decimal point).

43Apache Solr Reference Guide 4.10

During query processing, range and point queries are both supported.

Exchange Rates

You configure exchange rates by specifying a provider. Natively, two provider types are supported: FileExchange

or .RateProvider OpenExchangeRatesOrgProvider

FileExchangeRateProvider

This provider requires you to provide a file of exchange rates. It is the default, meaning that to use this provider you

only need to specify the file path and name as a value for in the definition for this type.currencyConfig

There is a sample file included with Solr, found in the same directory as the file. Herecurrency.xml schema.xml

is a small snippet from this file:

<rates>

</rates>

</currencyConfig>

OpenExchangeRatesOrgProvider

With Solr 4, you can configure Solr to download exchange rates from , with updates ratesOpenExchangeRates.Org

between USD and 158 currencies hourly. These rates are symmetrical only.

In this case, you need to specify the in the definitions for the field type. Here is an example:providerClass

<fieldType name="currency" class="solr.CurrencyField" precisionStep="8"

providerClass="solr.OpenExchangeRatesOrgProvider"

refreshInterval="60"

ratesFileLocation="http://internal.server/rates.json"/>

The is minutes, so the above example will download the newest rates every 60 minutes.refreshInterval

Working with Dates

Date Formatting

Solr's (and deprecated ) represents a point in time with millisecond precision. TheTrieDateField DateField

format used is a restricted form of the canonical representation of in the :dateTime XML Schema specification

YYYY-MM-DDThh:mm:ssZ

44Apache Solr Reference Guide 4.10

YYYY is the year.

MM is the month.

DD is the day of the month.

hh is the hour of the day as on a 24-hour clock.

mm is minutes.

ss is seconds.

Z is a literal 'Z' character indicating that this string representation of the date is in UTC

Note that no time zone can be specified; the String representations of dates is always expressed in Coordinated

Universal Time (UTC). Here is an example value:

1972-05-20T17:33:18Z

You can optionally include fractional seconds if you wish, although any precision beyond milliseconds will be

ignored. Here are examples value with sub-seconds include:

1972-05-20T17:33:18.772Z

1972-05-20T17:33:18.77Z

1972-05-20T17:33:18.7Z

Date Math

Solr's date field types also supports expressions, which makes it easy to create times relative to fixeddate math

moments in time, include the current time which can be represented using the special value of " ".NOW

Date Math Syntax

Date math expressions consist either adding some quantity of time in a specified unit, or rounding the current time

by a specified unit. expressions can be chained and are evaluated left to right.

For example: this represents a point in time two months from now:

NOW+2MONTHS

This is one day ago:

NOW-1DAY

A slash is used to indicate rounding. This represents the beginning of the current hour:

NOW/HOUR

The following example computes (with millisecond precision) the point in time six months and three days into the

future and then rounds that time to the beginning of that day:

NOW+6MONTHS+3DAYS/DAY

Note that while date math is most commonly used relative to it can be applied to any fixed moment in time asNOW

well:

1972-05-20T17:33:18.772Z+6MONTHS+3DAYS/DAY

Request Parameters That Affect Date Math

NOW

The parameter is used internally by Solr to ensure consistent date math expression parsing across multipleNOW

45Apache Solr Reference Guide 4.10

nodes in a distributed request. But it can be specified to instruct Solr to use an arbitrary moment in time (past or

future) to override for all situations where the the special value of " " would impact date math expressions.NOW

It must be specified as a (long valued) milliseconds since epoch

Example:

q=solr&fq=start_date:[* TO NOW]&NOW=1384387200000

By default, all date math expressions are evaluated relative to the UTC TimeZone, but the parameter can beTZ

specified to override this behaviour, by forcing all date based addition and rounding to be relative to the specified tim

.e zone

For example, the following request will use range faceting to facet over the current month, "per day" relative UTC:

http://localhost:8983/solr/select?q=*:*&facet.range=my_date_field&facet=true&facet.ran

ge.start=NOW/MONTH&facet.range.end=NOW/MONTH%2B1MONTH&facet.range.gap=%2B1DAY

...

While in this example, the "days" will be computed relative to the specified time zone - including any applicable

Daylight Savings Time adjustments:

http://localhost:8983/solr/select?q=*:*&facet.range=my_date_field&facet=true&facet.ran

ge.start=NOW/MONTH&facet.range.end=NOW/MONTH%2B1MONTH&facet.range.gap=%2B1DAY&TZ=Ameri

ca/Los_Angeles

...

Working with Enum Fields

The EnumField type allows defining a field whose values are a closed set, and the sort order is pre-determined but

is not alphabetic nor numeric. Examples of this are severity lists, or risk definitions.

Defining an EnumField in schema.xml

The EnumField type definition is quite simple, as in this example defining field types for "priorityLevel" and

"riskLevel" enumerations:

46Apache Solr Reference Guide 4.10

<fieldType name="priorityLevel" class="solr.EnumField" enumsConfig="enumsConfig.xml"

enumName="severity"/>

<fieldType name="riskLevel" class="solr.EnumField" enumsConfig="enumsConfig.xml"

enumName="risk" />

Besides the and the , which are common to all field types, this type also takes two additionalname class

parameters:

enumsConfig: the name of a configuration file that contains the list of field values and their order<enum/>

that you wish to use with this field type. If a path to the file is not defined specified, the file should be in the co

directory for the collection.nf

enumName: the name of the specific enumeration in the file to use for this type.enumsConfig

Defining the EnumField configuration file

The file named with the parameter can contain multiple enumeration value lists with different namesenumsConfig

if there are multiple uses for enumerations in your Solr schema.

In this example, there are two value lists defined. Each list is between opening and closing tags:enum

<?xml version="1.0" ?>

<value>Not Available</value>

<value>Medium</value>

<value>Urgent</value>

</enum>

<value>Unknown</value>

<value>Medium</value>

<value>Critical</value>

</enum>

</enumsConfig>

Working with External Files and Processes

The TypeExternalFileField

The type makes it possible to specify the values for a field in a file outside the Solr index. ForExternalFileField

such a field, the file contains mappings from a key field to the field value. Another way to think of this is that, instead

of specifying the field in documents as they are indexed, Solr finds values for this field in the external file.

Changing Values

You cannot change the order, or remove, existing values in an without reindexing.<enum/>

You can however add new values to the end.

External fields are not searchable. They can be used only for function queries or display. For more

47Apache Solr Reference Guide 4.10

The type is handy for cases where you want to update a particular field in many documentsExternalFileField

more often than you want to update the rest of the documents. For example, suppose you have implemented a

document rank based on the number of views. You might want to update the rank of all the documents daily or

hourly, while the rest of the contents of the documents might be updated much less frequently. Without ExternalF

, you would need to update each document just to change the rank. Using isileField ExternalFileField

much more efficient because all document values for a particular field are stored in an external file that can be

updated as frequently as you wish.

In , the definition of this field type might look like this:schema.xml

<fieldType name="entryRankFile" keyField="pkId" defVal="0" stored="false"

indexed="false" class="solr.ExternalFileField" valType="pfloat"/>

The attribute defines the key that will be defined in the external file. It is usually the unique key for thekeyField

index, but it doesn't need to be as long as the can be used to identify documents in the index. A keyField defVal

defines a default value that will be used if there is no entry in the external file for a particular document.

The attribute specifies the actual type of values that will be found in the file. The type specified must bevalType

either a float field type, so valid values for this attribute are , or . This attribute can bepfloat float tfloat

omitted.

Format of the External File

The file itself is located in Solr's index directory, which by default is . The name of the file should$SOLR_HOME/data

be or . For the example above, then, the file could be named external_fieldname external_ .*fieldname ex

or .ternal_entryRankFile external_entryRankFile.txt

The file contains entries that map a key field, on the left of the equals sign, to a value, on the right. Here are a few

example entries:

doc33=1.414

doc34=3.14159

doc40=42

The keys listed in this file do not need to be unique. The file does not need to be sorted, but Solr will be able to

perform the lookup faster if it is.

Reloading an External File

As of Solr 4.1, it's possible to define an event listener to reload an external file when either a searcher is reloaded or

when a new searcher is started. See the section for more information, but a sampleQuery-Related Listeners

definition in might look like this:solrconfig.xml

information on function queries, see the section on .Function Queries

If any files using the name pattern (such as ) appear, the last (after being sorted by name) will be.* .txt

used and previous versions will be deleted. This behavior supports implementations on systems where one

may not be able to overwrite a file (for example, on Windows, if the file is in use).

48Apache Solr Reference Guide 4.10

<listener event="newSearcher"

class="org.apache.solr.schema.ExternalFileFieldReloader"/>

<listener event="firstSearcher"

class="org.apache.solr.schema.ExternalFileFieldReloader"/>

Pre-Analyzing a Field Type

The type provides a way to send to Solr serialized token streams, optionally with independentPreAnalyzedField

stored values of a field, and have this information stored and indexed without any additional text processing applied

in Solr. This is useful if user wants to submit field content that was already processed by some existing external text

processing pipeline (e.g., it has been tokenized, annotated, stemmed, synonyms inserted, etc.), while using all the

rich attributes that Lucene's TokenStream provides (per-token attributes).

The serialization format is pluggable using implementations of PreAnalyzedParser interface. There are two

out-of-the-box implementations:

JsonPreAnalyzedParser: as the name suggests, it parses content that uses JSON to represent field's content.

This is the default parser to use if the field type is not configured otherwise.

SimplePreAnalyzedParser: uses a simple strict plain text format, which in some situations may be easier to

create than JSON.

There is only one configuration parameter, . The value of this parameter should be a fully qualifiedparserImpl

class name of a class that implements PreAnalyzedParser interface. The default value of this parameter is org.apa

.che.solr.schema.JsonPreAnalyzedParser

Field Properties by Use Case

Here is a summary of common use cases, and the attributes the fields or field types should have to support the

case. An entry of true or false in the table indicates that the option must be set to the given value for the use case to

function correctly. If no entry is provided, the setting of that attribute has no impact on the case.

Use Case indexed stored multiValued omitNorms termVectors termPositions docValues

search within

field

true

retrieve

contents

true

use as unique

key

true false

sort on field true7 false true 1 true7

use field boosts

false

document

boosts affect

searches within

field

false

highlighting true 4true true2true 3

49Apache Solr Reference Guide 4.10

faceting 5true7 true7

add multiple

values,

maintaining

order

true

field length

affects doc

score

false

MoreLikeThis 5 true 6

Notes:

1 Recommended but not necessary.

Will be used if present, but not necessary.

(if termVectors=true)

A tokenizer must be defined for the field, but it doesn't need to be indexed.

Described in .

5Understanding Analyzers, Tokenizers, and Filters

Term vectors are not mandatory here. If not true, then a stored field is analyzed. So term vectors are

recommended, but only required if .stored=false

Either or must be true, but both are not required. can be more efficient in many

7indexed docValues DocValues

cases.

Defining Fields

Fields are defined in the fields element of . Once you have the field types set up, defining the fieldsschema.xml

themselves is simple.

Example

The following example defines a field named with a type named and a default value of ; the price float 0.0 inde

and properties are explicitly set to , while any other properties specified on the field typexed stored true float

are inherited.

Field Properties

Property Description

name The name of the field. Field names should consist of alphanumeric or underscore characters only and

not start with a digit. This is not currently strictly enforced, but other field names will not have first

class support from all components and back compatibility is not guaranteed. Names with both leading

and trailing underscores (e.g. ) are reserved. Every field must have a .version_ _ name

type The name of the for this field. This will be found in the " " attribute on the fieldType name fieldTyp

definition. Every field must have a .e type

default A default value that will be added automatically to any document that does not have a value in this

field when it is indexed. If this property is not specified, there is no default.

50Apache Solr Reference Guide 4.10

Optional Field Type Override Properties

Fields can have the same options as field types. The field type options serve as defaults which can be overridden by

options defined per field. Included below is the table of field type properties from the section Field Type Definitions

:and Properties

Property Description Values

indexed If true, the value of the field can be used in queries to retrieve matching

documents

true or

false

stored If true, the actual value of the field can be retrieved by queries true or

false

docValues If true, the value of the field will be put in a column-oriented strDocValues

ucture

true or

false

sortMissingFirst

sortMissingLast

Control the placement of documents when a sort field is not present. As of

Solr 3.5, these work for all numeric fields, including Trie and date fields.

true or

false

multiValued If true, indicates that a single document might contain multiple values for

this field type

true or

false

omitNorms If true, omits the norms associated with this field (this disables length

normalization and index-time boosting for the field, and saves some

memory). Defaults to true for all primitive (non-analyzed) field types, such

as int, float, data, bool, and string. Only full-text fields or fields that need

an index-time boost need norms.

true or

false

omitTermFreqAndPositions If true, omits term frequency, positions, and payloads from postings for

this field. This can be a performance boost for fields that don't require that

information. It also reduces the storage space required for the index.

Queries that rely on position that are issued on a field with this option will

silently fail to find documents. This property defaults to true for all fields

that are not text fields.

true or

false

omitPositions Similar to but preserves term frequencyomitTermFreqAndPositions

information

true or

false

termVectors

termPositions

termOffsets

These options instruct Solr to maintain full term vectors for each

document, optionally including the position and offset information for each

term occurrence in those vectors. These can be used to accelerate

highlighting and other ancillary functionality, but impose a substantial cost

in terms of index size. They are not necessary for typical uses of Solr

true or

false

required Instructs Solr to reject any attempts to add a document which does not

have a value for this field. This property defaults to false.

true or

false

Related Topics

SchemaXML-Fields

Field Options by Use Case

Copying Fields

51Apache Solr Reference Guide 4.10

You might want to interpret some document fields in more than one way. Solr has a mechanism for making copies of

fields so that you can apply several distinct field types to a single piece of incoming information.

The name of the field you want to copy is the , and the name of the copy is the . In ,source destination schema.xml

it's very simple to make copies of fields:

If the destination field has data of its own in the input documents, the contents of the field will be addedtext cat

as additional values – just as if all of the values had originally been specified by the client. Remember to configure

your fields as if they will ultimately get multiple values (either from a multivalued source, ormultivalued="true"

multiple directives, etc...)copyField

The parameter, an parameter, establishes an upper limit for the number of characters to be copiedmaxChars int

from the source value when constructing the value added to the destination field. This limit is useful for situations in

which you want to copy some data from the source field, but also control the size of index files.

Both the source and the destination of can contain either leading or trailing asterisks, which will matchcopyField

anything. For example, the following line will copy the contents of all incoming fields that match the wildcard pattern

to the text field.:*_t

Related Topics

SchemaXML-Copy Fields

Dynamic Fields

Dynamic fields allow Solr to index fields that you did not explicitly define in your schema. This is useful if you

discover you have forgotten to define one or more fields. Dynamic fields can make your application less brittle by

providing some flexibility in the documents you can add to Solr.

A dynamic field is just like a regular field except it has a name with a wildcard in it. When you are indexing

documents, a field that does not match any explicitly defined fields can be matched with a dynamic field.

For example, suppose your schema includes a dynamic field with a name of . If you attempt to index a*_i

document with a field, but no explicit field is defined in the schema, then the field will havecost_i cost_i cost_i

the field type and analysis defined for .*_i

Dynamic fields are also defined in the fields element of . Like fields, they have a name, a field type,schema.xml

and options.

It is recommended that you include basic dynamic field mappings (like that shown above) in your . Theschema.xml

The command can use a wildcard (*) character in the parameter only if the paracopyField dest source

meter contains one as well. uses the matching glob from the source field for the fieldcopyField dest

name into which the source content is copied.

52Apache Solr Reference Guide 4.10

mappings can be very useful.

Related Topics

SchemaXML-Dynamic Fields

Other Schema Elements

This section describes several other important elements of .schema.xml

Unique Key

The element specifies which field is a unique identifier for documents. Although is notuniqueKey uniqueKey

required, it is nearly always warranted by your application design. For example, should be used if youuniqueKey

will ever update a document in the index.

You can define the unique key field by naming it:

Starting with Solr 4, schema defaults and cannot be used to populate the field. You alsocopyFields uniqueKey

can't use to have values generated automatically.UUIDUpdateProcessorFactory uniqueKey

Further, the operation will fail if the field is used, but is multivalued (or inherits the multivalueness fromuniqueKey

the ). However, will continue to work, as long as the field is properly used.fieldtype uniqueKey

Default Search Field

If you are using the Lucene query parser, queries that don't specify a field name will use the defaultSearchFiel

. The DisMax and Extended DisMax query parsers do not use this value.d

For more information about query parsers, see the section on .Query Syntax and Parsing

Query Parser Default Operator

In queries with multiple terms, Solr can either return results where all conditions are met or where one or more

conditions are met. The controls this behavior. An operator of AND means that all conditions must beoperator

fulfilled, while an operator of OR means that one or more conditions must be true.

In , the element controls what operator is used if an operator is not specified inschema.xml solrQueryParser

the query. The default operator setting only applies to the Lucene query parser, not the DisMax or Extended DisMax

query parsers, which internally hard-code their operators to OR.

Similarity

Use of the element is deprecated in Solr versions 3.6 and higher. Instead, youdefaultSearchField

should use the request parameter. At some point, the element may bedf defaultSearchField

removed.

The query parser default operator parameter has been deprecated in Solr versions 3.6 and higher. You are

instead encouraged to specify the query parser parameter in your request handler.q.op

53Apache Solr Reference Guide 4.10

Similarity is a Lucene class used to score a document in searching. This class can be changed in order to provide a

more custom sorting. With Solr 4, you can configure a different for each field, meaning that scoring asimilarity

document will differ depending on what's in each field. However, you can still configure a global issimilarity

configured in the schema.xml file, where an implicit instance of is used.DefaultSimilarityFactory

A global declaration can be used to specify a custom similarity implementation that you want Solr to<similarity>

use when dealing with your index. A similarity can be specified either by referring directly to the name of a class with

a no-argument constructor:

or by referencing a implementation, which may take optional initialization parameters:SimilarityFactory

</similarity>

Beginning with Solr 4, similarity factories can be specified on individual field types:

</similarity>

</fieldType>

This example uses (using the Information-Based model), but there are several similarityIBSimilarityFactory

implementations that can be used. For Solr 4.2, has been added. Other optionsSweetSpotSimilarityFactory

include , , and others. ForBM25SimilarityFactory DFRSimilarityFactory SchemaSimilarityFactory

details, see the Solr Javadocs for the .similarity factories

Related Topics

SchemaXML-Miscellaneous Settings

UniqueKey

Schema API

The Solr schema API allows using a REST API to get information about the for each collection (orschema.xml

core for standalone Solr), including defined field types, fields, dynamic fields, and copy field declarations. In Solr 4.2

and 4.3, it only allows GET (read-only) access, but in Solr 4.4, new fields and copyField directives may be added to

the schema. Future Solr releases will extend this functionality to allow more schema elements to be updated.

To enable schema modification with this API, the schema will need to be managed and mutable. See the section Ma

for more information.naged Schema Definition in SolrConfig

The API allows two output modes for all calls: JSON or XML. When requesting the complete schema, there is

another output mode which is XML modeled after the schema.xml file itself.

54Apache Solr Reference Guide 4.10
The base address for the API is  , where   ishttp://<host>:<port>/<context-path> <context-path>
usually  , though you may have configured it differently. Example base address: solr http://localhost:8983/s
. olr
In the API entry points and example URLs below, you may alternatively specify a Solr   name where it says core colle
.ction
API Entry Points
Retrieve schema information
Retrieve the Entire Schema
List Fields
List a Specific Field
List Dynamic Fields
List a Specific Dynamic Field Rule
List Field Types
List a Specific Field Type
List Copy Fields
Show Schema Name
Show the Schema Version
List UniqueKey
Show Global Similarity
Get the Default Query Operator
Modify the schema
Create new schema fields
Create one new schema field
Create new copyField directives
Manage Resource Data
Related Topics
API Entry Points
/ /schemacollection :   the entire schemaretrieve
:   about all defined fields, or   new fields with optional/ /schema/fieldscollection retrieve information create
copyField directives
:   about a named field, or   a new named field/ /schema/fields/  collection name retrieve information create
with optional copyField directives
:   about dynamic field rules/ /schema/dynamicfieldscollection retrieve information
:   about a named dynamic rule/ /schema/dynamicfields/  collection name retrieve information
:   about field types/ /schema/fieldtypescollection retrieve information
:   about a named field type/ /schema/fieldtypes/  collection name retrieve information
:   about copy fields, or   new copyField directives/ /schema/copyfieldscollection retrieve information create
:   the schema name/ /schema/namecollection retrieve
:   the schema version/ /schema/versioncollection retrieve
:   the defined uniqueKey/ /schema/uniquekeycollection retrieve
:   the global similarity definition/ /schema/similaritycollection retrieve
:   the default operator/ /schema/solrqueryparser/defaultoperatorcollection retrieve
: Manipulate / /schema/collection managed/resource/paths managed resource data
Retrieve schema information

55Apache Solr Reference Guide 4.10

Retrieve the Entire Schema

GET / /schemacollection

Input

Path Parameters

Key Description

collection The collection (or core) name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are , or json xml schema.x

. If not specified, JSON will be returned by default.ml

Output

Output Content

The output will include all fields, field types, dynamic rules and copy field rules. The schema name and version are

also included.

Examples

Input

Get the entire schema in JSON.

curl http://localhost:8983/solr/collection1/schema?wt=json

Get the entire schema in XML.

curl http://localhost:8983/solr/collection1/schema?wt=xml

Get the entire schema in "schema.xml" format.

curl http://localhost:8983/solr/collection1/schema?wt=schema.xml

Output

The samples below have been truncated to only show a few snippets of the output.

Example output in JSON:

56Apache Solr Reference Guide 4.10

{

"responseHeader":{

"status":0,

"QTime":5},

"schema":{

"name":"example",

"version":1.5,

"uniqueKey":"id",

"fieldTypes":[{

"name":"alphaOnlySort",

"class":"solr.TextField",

"sortMissingLast":true,

"omitNorms":true,

"analyzer":{

"tokenizer":{

"class":"solr.KeywordTokenizerFactory"},

"filters":[{

"class":"solr.LowerCaseFilterFactory"},

{

"class":"solr.TrimFilterFactory"},

{

"class":"solr.PatternReplaceFilterFactory",

"replace":"all",

"replacement":"",

"pattern":"([^a-z])"}]}},

...

"fields":[{

"name":"_version_",

"type":"long",

"indexed":true,

"stored":true},

{

"name":"author",

"type":"text_general",

"indexed":true,

"stored":true},

{

"name":"cat",

"type":"string",

"multiValued":true,

"indexed":true,

"stored":true},

...

"copyFields":[{

"source":"author",

"dest":"text"},

{

"source":"cat",

"dest":"text"},

{

"source":"content",

"dest":"text"},

...

{

"source":"author",

"dest":"author_s"}]}}

57Apache Solr Reference Guide 4.10

Example output in XML:

</lst>

<str name="name">example</str>

<lst>

<str name="name">alphaOnlySort</str>

<str name="class">solr.TextField</str>

<str name="class">solr.KeywordTokenizerFactory</str>

</lst>

<lst>

<str name="class">solr.LowerCaseFilterFactory</str>

</lst>

<lst>

<str name="class">solr.TrimFilterFactory</str>

</lst>

<lst>

<str name="class">solr.PatternReplaceFilterFactory</str>

</lst>

</arr>

</lst>

...

<lst>

<str name="source">author</str>

<str name="dest">author_s</str>

</lst>

</arr>

</lst>

</response>

Example output in schema.xml format:

58Apache Solr Reference Guide 4.10

<types>

<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true"

omitNorms="true">

<filter class="solr.PatternReplaceFilterFactory" replace="all" replacement=""

pattern="([^a-z])"/>

</analyzer>

</fieldType>

...

</schema>

List Fields

GET / /schema/fieldscollection

Input

Path Parameters

Key Description

collection The collection (or core) name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

Output

Output Content

The output will include each field and any defined configuration for each field. The defined configuration can vary for

each field, but will minimally include the field , the , if it is and if it is . If iname type indexed stored multiValued

s defined as either true or false (most likely true), that will also be shown. See the section for moreDefining Fields

information about each parameter.

Examples

Input

Get a list of all fields.

59Apache Solr Reference Guide 4.10

curl http://localhost:8983/solr/collection1/schema/fields?wt=json

Output

The sample output below has been truncated to only show a few fields.

{

"fields": [

{

"indexed": true,

"name": "_version_",

"stored": true,

"type": "long"

{

"indexed": true,

"name": "author",

"stored": true,

"type": "text_general"

{

"indexed": true,

"multiValued": true,

"name": "cat",

"stored": true,

"type": "string"

...

"responseHeader": {

"QTime": 1,

"status": 0

}

List a Specific Field

GET / /schema/fields/collection fieldname

Input

Path Parameters

Key Description

collection The collection (or core) name.

fieldname The specific field name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

60Apache Solr Reference Guide 4.10

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

Output

Output Content

The output will include each field and any defined configuration for the field. The defined configuration can vary for a

field, but will minimally include the field , the , if it is and if it is . If isname type indexed stored multiValued

defined as either true or false (most likely true), that will also be shown. See the section for moreDefining Fields

information about each parameter.

Examples

Input

Get the 'author' field.

curl http://localhost:8983/solr/collection1/schema/fields/author?wt=json

Output

{

"field": {

"indexed": true,

"name": "author",

"stored": true,

"type": "text_general"

"responseHeader": {

"QTime": 2,

"status": 0

}

List Dynamic Fields

GET / /schema/dynamicfieldscollection

Input

Path Parameters

Key Description

collection The collection (or core) name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

61Apache Solr Reference Guide 4.10

Output

Output Content

The output will include each dynamic field rule and the defined configuration for each rule. The defined configuration

can vary for each rule, but will minimally include the dynamic field , the , if it is and if it is name type indexed store

. See the section for more information about each parameter.dDynamic Fields

Examples

Input

Get a list of all dynamic field declarations

curl http://localhost:8983/solr/collection1/schema/dynamicfields?wt=json

Output

The sample output below has been truncated.

{

"dynamicFields": [

{

"indexed": true,

"name": "*_coordinate",

"stored": false,

"type": "tdouble"

{

"multiValued": true,

"name": "ignored_*",

"type": "ignored"

{

"name": "random_*",

"type": "random"

{

"indexed": true,

"multiValued": true,

"name": "attr_*",

"stored": true,

"type": "text_general"

{

"indexed": true,

"multiValued": true,

"name": "*_txt",

"stored": true,

"type": "text_general"

}

...

"responseHeader": {

"QTime": 1,

"status": 0

}

62Apache Solr Reference Guide 4.10

List a Specific Dynamic Field Rule

GET / /schema/dynamicfields/collection name

Input

Path Parameters

Key Description

collection The collection (or core) name.

name The name of the dynamic field rule.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

Output

Output Content

The output will include the requested dynamic field rule and any defined configuration for the rule. The defined

configuration can vary for each rule, but will minimally include the dynamic field , the , if it is andname type indexed

if it is . See the section for more information about each parameter.stored Dynamic Fields

Examples

Input

Get the details of the "*_s" rule.

curl http://localhost:8983/solr/collection1/schema/dynamicfields/*_s?wt=json

Output

{

"dynamicfield": {

"indexed": true,

"name": "*_s",

"stored": true,

"type": "string"

"responseHeader": {

"QTime": 1,

"status": 0

}

List Field Types

63Apache Solr Reference Guide 4.10

GET / /schema/fieldtypescollection

Input

Path Parameters

Key Description

collection The collection (or core) name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

Output

Output Content

The output will include each field type and any defined configuration for the type. The defined configuration can vary

for each type, but will minimally include the field type and the . If query or index analyzers, tokenizers,name class

or filters are defined, those will also be shown with other defined parameters. See the section forSolr Field Types

more information about how to configure various types of fields.

Examples

Input

Get a list of all field types.

curl http://localhost:8983/solr/collection1/schema/fieldtypes?wt=json

Output

The sample output below has been truncated to show a few different field types from different parts of the list.

64Apache Solr Reference Guide 4.10

{

"fieldTypes": [

{

"analyzer": {

"class": "solr.TokenizerChain",

"filters": [

{

"class": "solr.LowerCaseFilterFactory"

{

"class": "solr.TrimFilterFactory"

{

"class": "solr.PatternReplaceFilterFactory",

"pattern": "([^a-z])",

"replace": "all",

"replacement": ""

}

"tokenizer": {

"class": "solr.KeywordTokenizerFactory"

}

"class": "solr.TextField",

"dynamicFields": [],

"fields": [],

"name": "alphaOnlySort",

"omitNorms": true,

"sortMissingLast": true

...

{

"class": "solr.TrieFloatField",

"dynamicFields": [

"*_fs",

"*_f"

"fields": [

"price",

"weight"

"name": "float",

"positionIncrementGap": "0",

"precisionStep": "0"

...

}

List a Specific Field Type

GET / /schema/fieldtypes/collection name

Input

Path Parameters

65Apache Solr Reference Guide 4.10

Key Description

collection The collection (or core) name.

name The name of the field type.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

Output

Output Content

The output will include each field type and any defined configuration for the type. The defined configuration can vary

for each type, but will minimally include the field type and the . If query and/or index analyzers,name class

tokenizers, or filters are defined, those will be shown with other defined parameters. See the section Solr Field

for more information about how to configure various types of fields.Types

Examples

Input

Get details of the "date" field type.

curl http://localhost:8983/solr/collection1/schema/fieldtypes/date?wt=json

Output

The sample output below has been truncated.

{

"fieldType": {

"class": "solr.TrieDateField",

"dynamicFields": [

"*_dts",

"*_dt"

"fields": [

"last_modified"

"name": "date",

"positionIncrementGap": "0",

"precisionStep": "0"

"responseHeader": {

"QTime": 2,

"status": 0

}

List Copy Fields

66Apache Solr Reference Guide 4.10

GET / /schema/copyfieldscollection

Input

Path Parameters

Key Description

collection The collection (or core) name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

Output

Output Content

The output will include the and ination of each copy field rule defined in . For moresource dest schema.xml

information about copying fields, see the section .Copying Fields

Examples

Input

Get a list of all copyfields.

curl http://localhost:8983/solr/collection1/schema/copyfields?wt=json

Output

The sample output below has been truncated to the first few copy definitions.

67Apache Solr Reference Guide 4.10

{

"copyFields": [

{

"dest": "text",

"source": "author"

{

"dest": "text",

"source": "cat"

{

"dest": "text",

"source": "content"

{

"dest": "text",

"source": "content_type"

...

"responseHeader": {

"QTime": 3,

"status": 0

}

Show Schema Name

GET / /schema/namecollection

Input

Path Parameters

Key Description

collection The collection (or core) name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

Output

Output Content

The output will be simply the name given to the schema.

Examples

Input

Get the schema name.

68Apache Solr Reference Guide 4.10

curl http://localhost:8983/solr/collection1/schema/name?wt=json

Output

{

"responseHeader":{

"status":0,

"QTime":1},

"name":"example"}

Show the Schema Version

GET / /schema/versioncollection

Input

Path Parameters

Key Description

collection The collection (or core) name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

Output

Output Content

The output will simply be the schema version in use.

Examples

Input

Get the schema version

curl http://localhost:8983/solr/collection1/schema/version?wt=json

Output

{

"responseHeader":{

"status":0,

"QTime":2},

"version":1.5}

List UniqueKey

69Apache Solr Reference Guide 4.10

GET / /schema/uniquekeycollection

Input

Path Parameters

Key Description

collection The collection (or core) name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

Output

Output Content

The output will include simply the field name that is defined as the uniqueKey for the index.

Examples

Input

List the uniqueKey.

curl http://localhost:8983/solr/collection1/schema/uniquekey?wt=json

Output

The sample output below has been truncated to the first few copy definitions.

{

"responseHeader":{

"status":0,

"QTime":2},

"uniqueKey":"id"}

Show Global Similarity

GET / /schema/similaritycollection

Input

Path Parameters

Key Description

collection The collection (or core) name.

Query Parameters

70Apache Solr Reference Guide 4.10

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

Output

Output Content

The output will include the class name of the global similarity defined (if any).

Examples

Input

Get the similarity implementation.

curl http://localhost:8983/solr/collection1/schema/similarity?wt=json

Output

{

"responseHeader":{

"status":0,

"QTime":1},

"similarity":{

"class":"org.apache.solr.search.similarities.DefaultSimilarityFactory"}}

Get the Default Query Operator

GET / /schema/solrqueryparser/defaultoperatorcollection

Input

Path Parameters

Key Description

collection The collection (or core) name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, JSON will be returned by default.

Output

Output Content

The output will include simply the default operator if none is defined by the user.

71Apache Solr Reference Guide 4.10

Examples

Input

Get the default operator.

curl

http://localhost:8983/solr/collection1/schema/solrqueryparser/defaultoperator?wt=json

Output

{

"responseHeader":{

"status":0,

"QTime":2},

"defaultOperator":"OR"}

Modify the schema

Create new schema fields

POST / /schema/fieldscollection

To enable schema modification, the schema will need to be managed and mutable. See the section Managed

for more information.Schema Definition in SolrConfig

Input

Path Parameters

Key Description

collection The collection (or core) name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, json will be returned by default.

Request body

Only JSON format is supported in the request body. The JSON must contain an array of one or more new field

specifications, each of which must include mappings for the new field's and . All attributes specifiable onname type

a schema declaration may be specified here - see .<field name="..." ... /> Defining Fields

Additionally, destination(s) may optionally be specified. Note that each specified copyField destinationcopyField

must be an existing schema field (and not a dynamic field). In particular, since the new fields specified in a new field

creation request are defined all at once, you cannot specify a that targets another new field in the samecopyField

request - instead, you have to make two requests, defining the destination in the first new field creationcopyField

request, then specifying that field as a destination in the second new field creation request.copyField

72Apache Solr Reference Guide 4.10

The utility can provide the request body via its option.curl --data-binary

Output

Output Content

The output will be the response header, containing a status code, and if there was a problem, an associated error

message.

Example output in the default JSON format:

{

"responseHeader":{

"status":0,

"QTime":8}}

Examples

Input

Add two new fields:

curl http://localhost:8983/solr/collection1/schema/fields -X POST -H

'Content-type:application/json' --data-binary '

[

{

"name":"sell-by",

"type":"tdate",

"stored":true

{

"name":"catchall",

"type":"text_general",

"stored":false

}

Add a third new field and copy it to the "catchall" field created above:

curl http://localhost:8983/solr/collection1/schema/fields -X POST -H

'Content-type:application/json' --data-binary '

[

{

"name":"department",

"type":"string",

"docValues":"true",

"default":"no department",

"copyFields": [ "catchall" ]

}

Create one new schema field

PUT / /schema/fields/collection name

To enable schema modification, the schema will need to be managed and mutable. See the section Managed

73Apache Solr Reference Guide 4.10

for more information.Schema Definition in SolrConfig

Input

Path Parameters

Key Description

collection The collection (or core) name.

name The new field name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, json will be returned by default.

Request body

Only JSON format is supported in the request body. The body must include a set of mappings, minimally for the new

field's and . All attributes specifiable on a schema declaration may bename type <field name="..." ... />

specified here - see .Defining Fields

Additionally, destination(s) may optionally be specified. Note that each specified copyField destinationcopyField

must be an existing schema field (and not a dynamic field).

The utility can provide the request body via its option.curl --data-binary

Output

Output Content

The output will be the response header, containing a status code, and if there was a problem, an associated error

message.

Example output in the default JSON format:

{

"responseHeader":{

"status":0,

"QTime":4}}

Examples

Input

Add a new field named "narrative":

74Apache Solr Reference Guide 4.10

curl http://localhost:8983/solr/collection1/schema/fields/narrative -X PUT -H

'Content-type:application/json' --data-binary '

{

"type":"text_general",

"stored":true,

"termVectors":true,

"termPositions":true,

"termOffsets":true

Add a new field named "color" and copy it to two fields, named "narrative" and "catchall", which must already exist in

the schema:

curl http://localhost:8983/solr/collection1/schema/fields/color -X PUT -H

'Content-type:application/json' --data-binary '

{

"type":"string",

"stored":true,

"copyFields": [

"narrative",

"catchall"

]

Create new copyField directives

POST / /schema/copyfieldscollection

To enable schema modification, the schema will need to be managed and mutable. See the section Managed

for more information.Schema Definition in SolrConfig

Input

Path Parameters

Key Description

collection The collection (or core) name.

Query Parameters

The query parameters can be added to the API request after a '?'.

Key Type Required Default Description

wt string No json Defines the format of the response. The options are or . If notjson xml

specified, json will be returned by default.

Request body

Only JSON format is supported in the request body. The body must contain an array of zero or more copyField

directives, each containing a mapping from to the source field name, and from to an array ofsource dest

destination field name(s).

75Apache Solr Reference Guide 4.10

source field names must either be an existing field, or be a field name glob (with an asterisk either at the beginning

or the end, or consist entirely of a single asterisk). field names must either be existing fields, or, if is adest source

glob, fields may be globs that match an existing dynamic field.dest

The utility can provide the request body via its option.curl --data-binary

Output

Output Content

The output will be the response header, containing a status code, and if there was a problem, an associated error

message.

Example output in the default JSON format:

{

"responseHeader":{

"status":0,

"QTime":2}}

Examples

Input

Copy the "affiliations" field to the "relations" field, and the "shelf" field to the "location" and "catchall" fields:

curl http://localhost:8983/solr/collection1/schema/copyfields -X POST -H

'Content-type:application/json' --data-binary '

[

{

"source":"affiliations",

"dest": [

"relations"

]

{

"source":"shelf",

"dest": [

"location",

"catchall"

]

}

Copy all fields names matching "finance_*" to the "*_s" dynamic field:

curl http://localhost:8983/solr/collection1/schema/copyfields -X POST -H

'Content-type:application/json' --data-binary '

[

{

"source":"finance_*",

"dest": [

"*_s"

]

}

76Apache Solr Reference Guide 4.10

Manage Resource Data

The REST API provides a mechanism for any Solr plugin to expose resources that shouldManaged Resources

support CRUD (Create, Read, Update, Delete) operations. Depending on what Field Types and Analyzers are

configured in your Schema, additional REST API paths may exist. See the section/schema/ Managed Resources

for more information and examples.

Related Topics

Managed Schema Definition in SolrConfig

Putting the Pieces Together

At the highest level, is structured as follows. This example is not real XML, but it gives you an idea ofschema.xml

the structure of the file.

<types>

</schema>

Obviously, most of the excitement is in types and fields, where the field types and the actual field definitions live.

These are supplemented by . Sandwiched between fields and the section are the uniquecopyFields copyField

key, default search field, and the default query operator.

Choosing Appropriate Numeric Types

For general numeric needs, use , , , and wTrieIntField TrieLongField TrieFloatField TrieDoubleField

ith . precisionStep="0"

If you expect users to make frequent range queries on numeric types, use the default (by notprecisionStep

specifying it) or specify it as (which is the default). This offers faster speed for range queriesprecisionStep="8"

at the expense of increasing index size.

Working With Text

Handling text properly will make your users happy by providing them with the best possible results for text searches.

One technique is using a text field as a catch-all for keyword searching. Most users are not sophisticated about their

searches and the most common search is likely to be a simple keyword search. You can use to take acopyField

variety of fields and funnel them all into a single text field for keyword searches. In the example schema

representing a store, is used to dump the contents of , , , , and intcopyField cat name manu features includes

o a single field, . In addition, it could be a good idea to copy into in case users wanted to search for atext ID text

particular product by passing its product number to a keyword search.

Another technique is using to use the same field in different ways. Suppose you have a field that is acopyField

list of authors, like this:

Schildt, Herbert; Wolpert, Lewis; Davies, P.

77Apache Solr Reference Guide 4.10

For searching by author, you could tokenize the field, convert to lower case, and strip out punctuation:

schildt / herbert / wolpert / lewis / davies / p

For sorting, just use an untokenized field, converted to lower case, with punctuation stripped:

schildt herbert wolpert lewis davies p

Finally, for faceting, use the primary author only via a :StringField

Schildt, Herbert

Related Topics

SchemaXML

DocValues

An exciting addition to Solr functionality was introduced in Solr 4.2. This functionality has been around in Lucene for

a while, but is now available to Solr users.

DocValues are a way of building the index that is more efficient for some purposes.

Why DocValues?

The standard way that Solr builds the index is with an . This style builds a list of terms found in all theinverted index

documents in the index and next to each term is a list of documents that the term appears in (as well as how many

times the term appears in that document). This makes search very fast - since users search by terms, having a

ready list of term-to-document values makes the query process faster.

For other features that we now commonly associate with search, such as sorting, faceting, and highlighting, this

approach is not very efficient. The faceting engine, for example, must look up each term that appears in each

document that will make up the result set and pull the document IDs in order to build the facet list. In Solr, this is

maintained in memory, and can be slow to load (depending on the number of documents, terms, etc.).

In Lucene 4.0, a new approach was introduced. DocValue fields are now column-oriented fields with a

document-to-value mapping built at index time. This approach promises to relieve some of the memory

requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.

How to Use DocValues

To use docValues, you only need to enable it for a field that you will use it with. As with all schema design, you need

to define a field type and then define fields of that type with docValues enabled. All of these actions are done in sch

.ema.xml

Enabling a field for docValues only requires adding to the field definition, as in this exampledocValues="true"

(from Solr's default ):schema.xml

<field name="manu_exact" type="string" indexed="false" stored="false" docValues="true"

Prior to Solr 4.5, a field could not be empty to be used with docValues; in Solr 4.5, that restriction is removed.

If you have already indexed data into your Solr index, you will need to completely re-index your content after

changing your field definitions in in order to successfully use docValues.schema.xml

78Apache Solr Reference Guide 4.10

DocValues are only available for specific field types. The types chosen determine the underlying Lucene docValue

type that will be used. The available Solr field types are:

String fields of type .StrField

If the field is single-valued (i.e., multi-valued is false), Lucene will use the SORTED type.

If the field is multi-valued, Lucene will use the SORTED_SET type.

Any Trie* fields.

If the field is single-valued (i.e., multi-valued is false), Lucene will use the NUMERIC type.

If the field is multi-valued, Lucene will use the SORTED_SET type.

UUID fields

These Lucene types are related to how the values are sorted and stored.

There is an additional configuration option available, which is to modify the docValuesFormat used by the field

. The default implementation employs a mixture of loading some things into memory and keeping some on disk.type

In some cases, however, you may choose to specify an alternative . For example,DocValuesFormat implementation

you could choose to keep everything in memory by specifying on a field type:docValuesFormat="Memory"

<fieldType name="string_in_mem_dv" class="solr.StrField" docValues="true"

docValuesFormat="Memory" />

Please note that the option may change in future releases.docValuesFormat

Related Topics

DocValues are quite new to Solr. For more background see:

Introducing Lucene Index Doc Values, by Simon Willnauer, at SearchWorkings.org

Fun with DocValues in Solr 4.2, by David Arthur, at SearchHub.org

The old wiki page on (note, that page is now obsoleted by this one)DocValues

Schemaless Mode

Schemaless Mode is a set of Solr features that, when used together, allow users to rapidly construct an effective

schema by simply indexing sample data, without having to manually edit the schema. These Solr features, all

specified in , are:solrconfig.xml

Managed schema: Schema modifications are made through Solr APIs rather than manual edits - see Manage

.d Schema Definition in SolrConfig

Field value class guessing: Previously unseen fields are run through a cascading set of value-based parsers,

which guess the Java class of field values - parsers for Boolean, Integer, Long, Float, Double, and Date are

currently available.

Automatic schema field addition, based on field value class(es): Previously unseen fields are added to the

schema, based on field value Java classes, which are mapped to schema field types - see .Solr Field Types

Using the Schemaless Example

Lucene index back-compatibility is only supported for the default codec. If you choose to customize the doc

in your schema.xml, upgrading to a future version of Solr may require you to either switchValuesFormat

back to the default codec and optimize your index to rewrite it into the default codec before upgrading, or

re-build your entire index from scratch after upgrading.

79Apache Solr Reference Guide 4.10

The three features of schemaless mode are pre-configured in the directexample/example-schemaless/solr/

ory in the Solr distribution. To start Solr in this pre-configured schemaless mode, go to the directory andexample/

start up Solr, setting the system property to this directory on the command line:solr.solr.home

java -Dsolr.solr.home=example-schemaless/solr -jar start.jar

The schema in is shipped with only two fields, and example-schemaless/solr/collection1/conf/ id _ver

, as can be seen from calling the - sion_ /schema/fields Schema API curl http://localhost:8983/solr

outputs: /schema/fields

{

"responseHeader":{

"status":0,

"QTime":1},

"fields":[{

"name":"_version_",

"type":"long",

"indexed":true,

"stored":true},

{

"name":"id",

"type":"string",

"multiValued":false,

"indexed":true,

"required":true,

"stored":true,

"uniqueKey":true}]}

Configuring Schemaless Mode

As described above, there are three configuration elements that need to be in place to use Solr in schemaless

mode. If you use the from these elements are configuredsolrconfig.xml example/example-schemaless

already. If, however, you would like to implement schemaless on your own, you should make the following changes.

Enable Managed Schema

As described in the section , changing the will allow theManaged Schema Definition in SolrConfig schemaFactory

schema to be modified by the . Your should have a section like the one below (andSchema API solrconfig.xml

the ClassicIndexSchemaFactory should be commented out or removed).

<str name="managedSchemaResourceName">managed-schema</str>

</schemaFactory>

Define an UpdateRequestProcessorChain

The UpdateRequestProcessorChain allows Solr to guess field types, and you can define the default field type

classes to use. To start, you should define it as follows (see the javadoc links below for update processor factory

documentation):

80Apache Solr Reference Guide 4.10

<!-- UUIDUpdateProcessorFactory will generate an id if none is present in the

incoming document -->

</arr>

</processor>

<str name="defaultFieldType">text_general</str>

<str name="valueClass">java.lang.Boolean</str>

<str name="fieldType">booleans</str>

</lst>

<str name="fieldType">tdates</str>

</lst>

<str name="valueClass">java.lang.Integer</str>

<str name="fieldType">tlongs</str>

</lst>

<str name="valueClass">java.lang.Number</str>

<str name="fieldType">tdoubles</str>

</lst>

</processor>

</updateRequestProcessorChain>

Javadocs for update processor factories mentioned above:

UUIDUpdateProcessorFactory

81Apache Solr Reference Guide 4.10

ParseBooleanFieldUpdateProcessorFactory

ParseLongFieldUpdateProcessorFactory

ParseDoubleFieldUpdateProcessorFactory

ParseDateFieldUpdateProcessorFactory

AddSchemaFieldsUpdateProcessorFactory

Make the UpdateRequestProcessorChain the Default for the UpdateRequestHandler

Once the UpdateRequestProcessorChain has been defined, you must instruct your UpdateRequestHandler to use it

when working with index updates (i.e., adding, removing, replacing documents). Here is an example using the /upd

requestHandler:ate

<str name="update.chain">add-unknown-fields-to-the-schema</str>

</lst>

</requestHandler>

Examples of Indexed Documents

Once the schemaless mode has been enabled (whether you configured it manually or are using example-schema

), documents that include fields that are not defined in your schema should be added to the index, and the newless

fields added to the schema.

For example, adding a CSV document will cause its fields that are not in the schema to be added, with fieldTypes

based on values:

curl "http://localhost:8983/solr/update?commit=true" -H "Content-type:application/csv"

-d '

id,Artist,Album,Released,Rating,FromDistributor,Sold

44C,Old Shews,Mead for Walking,1988-08-13,0.01,14,0'

Output indicating success:

<lst name="responseHeader"><int name="status">0</int><int

name="QTime">106</int></lst>

</response>

The fields now in the schema (output from ):curl http://localhost:8983/solr/schema/fields

After each of these changes have been made, Solr should be restarted (or, you can reload the cores to load

the new definitions).solrconfig.xml

82Apache Solr Reference Guide 4.10

{

"responseHeader":{

"status":0,

"QTime":1},

"fields":[{

"name":"Album",

"type":"text_general"}, // Field value guessed as String -> text_general

fieldType

{

"name":"Artist",

"type":"text_general"}, // Field value guessed as String -> text_general

fieldType

{

"name":"FromDistributor",

"type":"tlongs"}, // Field value guessed as Long -> tlongs fieldType

{

"name":"Rating",

"type":"tdoubles"}, // Field value guessed as Double -> tdoubles fieldType

{

"name":"Released",

"type":"tdates"}, // Field value guessed as Date -> tdates fieldType

{

"name":"Sold",

"type":"tlongs"}, // Field value guessed as Long -> tlongs fieldType

{

"name":"_version_",

...

{

"name":"id",

...

}]}

Once a field has been added to the schema, its field type is fixed. As a consequence, adding documents with field

value(s) that conflict with the previously guessed field type will fail. For example, after adding the above document,

the field has the fieldType , but the document below has a non-integral decimal value in this field:Sold tlongs

curl "http://localhost:8983/solr/update?commit=true" -H "Content-type:application/csv"

-d '

id,Description,Sold

19F,Cassettes by the pound,4.93'

This document will fail, as shown in this output:

You Can Still Be Explicit

Even if you want to use schemaless mode for most fields, you can still use the to pre-emptivelySchema API

create some fields, with explicit types, before you index documents that use them.

Internally, the Schema REST API and the Schemaless Update Processors both use the same Managed

functionality.Schema

83Apache Solr Reference Guide 4.10

</lst>

<str name="msg">ERROR: [doc=19F] Error adding field 'Sold'='4.93' msg=For input

string: "4.93"</str>

</lst>

</response>

84Apache Solr Reference Guide 4.10

Understanding Analyzers, Tokenizers, and Filters

The following sections describe how Solr breaks down and works with textual data. There are three main concepts

to understand: analyzers, tokenizers, and filters.

Field analyzers are used both during ingestion, when a document is indexed, and at query time. An analyzer

examines the text of fields and generates a token stream. Analyzers may be a single class or they may be

composed of a series of tokenizer and filter classes.

Tokenizers break field data into lexical units, or .tokens

Filters examine a stream of tokens and keep them, transform or discard them, or create new ones. Tokenizers and

filters may be combined to form pipelines, or , where the output of one is input to the next. Such a sequencechains

of tokenizers and filters is called an and the resulting output of an analyzer is used to match query resultsanalyzer

or build indices.

Using Analyzers, Tokenizers and Filters

Although the analysis process is used for both indexing and querying, the same analysis process need not be used

for both operations. For indexing, you often want to simplify, or normalize, words. For example, setting all letters to

lowercase, eliminating punctuation and accents, mapping words to their stems, and so on. Doing so can increase

recall because, for example, "ram", "Ram" and "RAM" would all match a query for "ram". To increase query-time

precision, a filter could be employed to narrow the matches by, for example, ignoring all-cap acronyms if you're

interested in male sheep, but not Random Access Memory.

The tokens output by the analysis process define the values, or , of that field and are used either to build anterms

index of those terms when a new document is added, or to identify which documents contain the terms your are

querying for.

For More Information

These sections will show you how to configure field analyzers and also serves as a reference for the details of

configuring each of the available tokenizer and filter classes. It also serves as a guide so that you can configure your

own analysis classes if you have special needs that cannot be met with the included filters or tokenizers.

For Analyzers, see:

Analyzers: Detailed conceptual information about Solr analyzers.

Running Your Analyzer: Detailed information about testing and running your Solr analyzer.

For Tokenizers, see:

About Tokenizers: Detailed conceptual information about Solr tokenizers.

Tokenizers: Information about configuring tokenizers, and about the tokenizer factory classes included in this

distribution of Solr.

For Filters, see:

About Filters: Detailed conceptual information about Solr filters.

Filter Descriptions: Information about configuring filters, and about the filter factory classes included in this

distribution of Solr.

CharFilterFactories: Information about filters for pre-processing input characters.

To find out how to use Tokenizers and Filters with various languages, see:

85Apache Solr Reference Guide 4.10

Language Analysis: Information about tokenizers and filters for character set conversion or for use with

specific languages.

Analyzers

An analyzer examines the text of fields and generates a token stream. Analyzers are specified as a child of the <fi

element in the configuration file that can be found in the directory, oreldType> schema.xml solr/conf

wherever is located.solrconfig.xml

In normal usage, only fields of type will specify an analyzer. The simplest way to configure ansolr.TextField

analyzer is with a single element whose class attribute is a fully qualified Java class name. The<analyzer>

named class must derive from . For example:org.apache.lucene.analysis.Analyzer

</fieldType>

In this case a single class, , is responsible for analyzing the content of the named text fieldWhitespaceAnalyzer

and emitting the corresponding tokens. For simple cases, such as plain English prose, a single analyzer class like

this may be sufficient. But it's often necessary to do more complex analysis of the field content.

Even the most complex analysis requirements can usually be decomposed into a series of discrete, relatively simple

processing steps. As you will soon discover, the Solr distribution comes with a large selection of tokenizers and

filters that covers most scenarios you are likely to encounter. Setting up an analyzer chain is very straightforward;

you specify a simple element (no class attribute) with child elements that name factory classes for the<analyzer>

tokenizer and filters to use, in the order you want them to run.

For example:

</analyzer>

</fieldType>

Note that classes in the package may be referred to here with the shorthand org.apache.solr.analysis solr

prefix..

In this case, no Analyzer class was specified on the element. Rather, a sequence of more specialized<analyzer>

classes are wired together and collectively act as the Analyzer for the field. The text of the field is passed to the first

item in the list ( ), and the tokens that emerge from the last one (solr.StandardTokenizerFactory solr.Engl

) are the terms that are used for indexing or querying any fields that use theishPorterFilterFactory

"nametext" .fieldType

Analysis Phases

Analysis takes place in two contexts. At index time, when a field is being created, the token stream that results from

86Apache Solr Reference Guide 4.10

analysis is added to an index and defines the set of terms (including positions, sizes, and so on) for the field. At

query time, the values being searched for are analyzed and the terms that result are matched against those that are

stored in the field's index.

In many cases, the same analysis should be applied to both phases. This is desirable when you want to query for

exact string matches, possibly with case-insensitivity, for example. In other cases, you may want to apply slightly

different analysis steps during indexing than those used at query time.

If you provide a simple definition for a field type, as in the examples above, then it will be used for both<analyzer>

indexing and queries. If you want distinct analyzers for each phase, you may include two definitions<analyzer>

distinguished with a type attribute. For example:

</analyzer>

</analyzer>

</fieldType>

In this theoretical example, at index time the text is tokenized, the tokens are set to lowercase, any that are not listed

in are discarded and those that remain are mapped to alternate values as defined by the synonymkeepwords.txt

rules in the file . This essentially builds an index from a restricted set of possible values and thensyns.txt

normalizes them to values that may not even occur in the original text.

At query time, the only normalization that happens is to convert the query terms to lowercase. The filtering and

mapping steps that occur at index time are not applied to the query terms. Queries must then, in this example, be

very precise, using only the normalized terms that were stored at index time.

About Tokenizers

The job of a is to break up a stream of text into tokens, where each token is (usually) a sub-sequence oftokenizer

the characters in the text. An analyzer is aware of the field it is configured for, but a tokenizer is not. Tokenizers read

from a character stream (a Reader) and produce a sequence of Token objects (a TokenStream).

Characters in the input stream may be discarded, such as whitespace or other delimiters. They may also be added

to or replaced, such as mapping aliases or abbreviations to normalized forms. A token contains various metadata in

addition to its text value, such as the location at which the token occurs in the field. Because a tokenizer may

produce tokens that diverge from the input text, you should not assume that the text of the token is the same text

that occurs in the field, or that its length is the same as the original text. It's also possible for more than one token to

have the same position or refer to the same offset in the original text. Keep this in mind if you use token metadata

for things like highlighting search results in the field text.

</analyzer>

</fieldType>

87Apache Solr Reference Guide 4.10

The class named in the tokenizer element is not the actual tokenizer, but rather a class that implements the org.ap

interface. This factory class will be called upon to create newache.solr.analysis.TokenizerFactory

tokenizer instances as needed. Objects created by the factory must derive from org.apache.lucene.analysis

, which indicates that they produce sequences of tokens. If the tokenizer produces tokens that are.TokenStream

usable as is, it may be the only component of the analyzer. Otherwise, the tokenizer's output tokens will serve as

input to the first filter stage in the pipeline.

A is available that creates a that filters tokens based on theirTypeTokenFilterFactory TypeTokenFilter

TypeAttribute, which is set in .factory.getStopTypes

For a complete list of the available TokenFilters, see the section .Tokenizers

When To use a CharFilter vs. a TokenFilter

There are several pairs of CharFilters and TokenFilters that have related (ie: and MappingCharFilter ASCIIFol

) or nearly identical (ie: and dingFilter PatternReplaceCharFilterFactory PatternReplaceFilterFac

) functionality and it may not always be obvious which is the best choice.tory

The decision about which to use depends largely on which Tokenizer you are using, and whether you need to

preprocess the stream of characters.

For example, suppose you have a tokenizer such as and although you are pretty happy withStandardTokenizer

how it works overall, you want to customize how some specific characters behave. You could modify the rules and

re-build your own tokenizer with JFlex, but it might be easier to simply map some of the characters before

tokenization with a .CharFilter

About Filters

Like , consume input and produce a stream of tokens. Filters also derive from tokenizers filters org.apache.lucen

. Unlike tokenizers, a filter's input is another TokenStream. The job of a filter is usuallye.analysis.TokenStream

easier than that of a tokenizer since in most cases a filter looks at each token in the stream sequentially and decides

whether to pass it along, replace it or discard it.

A filter may also do more complex analysis by looking ahead to consider multiple tokens at once, although this is

less common. One hypothetical use for such a filter might be to normalize state names that would be tokenized as

two words. For example, the single token "california" would be replaced with "CA", while the token pair "rhode"

followed by "island" would become the single token "RI".

Because filters consume one and produce a new , they can be chained one afterTokenStream TokenStream

another indefinitely. Each filter in the chain in turn processes the tokens produced by its predecessor. The order in

which you specify the filters is therefore significant. Typically, the most general filtering is done first, and later filtering

stages are more specialized.

</analyzer>

</fieldType>

This example starts with Solr's standard tokenizer, which breaks the field's text into tokens. Those tokens then pass

88Apache Solr Reference Guide 4.10

through Solr's standard filter, which removes dots from acronyms, and performs a few other common operations. All

the tokens are then set to lowercase, which will facilitate case-insensitive matching at query time.

The last filter in the above example is a stemmer filter that uses the Porter stemming algorithm. A stemmer is

basically a set of mapping rules that maps the various forms of a word back to the base, or , word from whichstem

they derive. For example, in English the words "hugs", "hugging" and "hugged" are all forms of the stem word "hug".

The stemmer will replace all of these terms with "hug", which is what will be indexed. This means that a query for

"hug" will match the term "hugged", but not "huge".

Conversely, applying a stemmer to your query terms will allow queries containing non stem terms, like "hugging", to

match documents with different variations of the same stem word, such as "hugged". This works because both the

indexer and the query will map to the same stem ("hug").

Word stemming is, obviously, very language specific. Solr includes several language-specific stemmers created by

the generator that are based on the Porter stemming algorithm. The generic Snowball Porter StemmerSnowball

Filter can be used to configure any of these language stemmers. Solr also includes a convenience wrapper for the

English Snowball stemmer. There are also several purpose-built stemmers for non-English languages. These

stemmers are described in .Language Analysis

Tokenizers

You configure the tokenizer for a text field type in with a schema.xml <tokeniz

element, as a child of :er> <analyzer>

</analyzer>

</fieldType>

The class attribute names a factory class that will instantiate a tokenizer object

when needed. Tokenizer factory classes implement the org.apache.solr.an

. A TokenizerFactory's methodalysis.TokenizerFactory create()

accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it

passes a Reader object that provides the content of the text field.

89Apache Solr Reference Guide 4.10
Tokenizers
discussed in this
section:
Standard
Tokenizer
Classic
Tokenizer
Keyword
Tokenizer
Letter
Tokenizer
Lower Case
Tokenizer
N-Gram
Tokenizer
Edge N-Gram
Tokenizer
ICU
Tokenizer
Path
Hierarchy
Tokenizer
Regular
Expression
Pattern
Tokenizer
UAX29 URL
Email
Tokenizer
White Space
Tokenizer
Related
Topics
Arguments may be passed to tokenizer factories by setting attributes on the   element.<tokenizer>
<fieldType name="semicolonDelimited" class="solr.TextField">
 <analyzer type="query">
 <tokenizer class="solr.PatternTokenizerFactory" pattern="; "/>
 <analyzer>
</fieldType>
The following sections describe the tokenizer factory classes included in this release of Solr.
For more information about Solr's tokenizers, see  .http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
Standard Tokenizer

90Apache Solr Reference Guide 4.10

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters

are discarded, with the following exceptions:

Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain

names.

The "@" character is among the set of token-splitting punctuation, so email addresses are preserved asnot

single tokens.

Note that words are split at hyphens.

The Standard Tokenizer supports word boundaries with the following token types: Unicode standard annex UAX#29

, , , , and .<ALPHANUM> <NUM> <SOUTHEAST_ASIAN> <IDEOGRAPHIC> <HIRAGANA>

Factory class: solr.StandardTokenizerFactory

Arguments:

maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by max

.TokenLength

Example:

</analyzer>

In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

Out: "Please", "email", "john.doe", "foo.com", "by", "03", "09", "re", "m37", "xq"

Classic Tokenizer

The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions 3.1 and previous. It

does not use the word boundary rules that the Standard Tokenizer uses. ThisUnicode standard annex UAX#29

tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are

discarded, with the following exceptions:

Periods (dots) that are not followed by whitespace are kept as part of the token.

Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the

numbers and hyphen(s) are preserved.

Recognizes Internet domain names and email addresses and preserves them as a single token.

Factory class: solr.ClassicTokenizerFactory

Arguments:

maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by max

.TokenLength

Example:

</analyzer>

91Apache Solr Reference Guide 4.10

In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

Out: "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"

Keyword Tokenizer

This tokenizer treats the entire text field as a single token.

Factory class: solr.KeywordTokenizerFactory

Arguments: None

Example:

</analyzer>

In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

Out: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

Letter Tokenizer

This tokenizer creates tokens from strings of contiguous letters, discarding all non-letter characters.

Factory class: solr.LetterTokenizerFactory

Arguments: None

Example:

</analyzer>

In: "I can't."

Out: "I", "can", "t"

Lower Case Tokenizer

Tokenizes the input stream by delimiting at non-letters and then converting all letters to lowercase. Whitespace and

non-letters are discarded.

Factory class: solr.LowerCaseTokenizerFactory

Arguments: None

Example:

</analyzer>

In: "I just my iPhone!"LOVE

92Apache Solr Reference Guide 4.10

Out: "i", "just", "love", "my", "iphone"

N-Gram Tokenizer

Reads the field text and generates n-gram tokens of sizes in the given range.

Factory class: solr.NGramTokenizerFactory

Arguments:

minGramSize: (integer, default 1) The minimum n-gram size, must be > 0.

maxGramSize: (integer, default 2) The maximum n-gram size, must be >= .minGramSize

Example:

Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at whitespace. As

a result, the space character is included in the encoding.

</analyzer>

In: "hey man"

Out: "h", "e", "y", " ", "m", "a", "n", "he", "ey", "y ", " m", "ma", "an"

Example:

With an n-gram size range of 4 to 5:

</analyzer>

In: "bicycle"

Out: "bicy", "bicyc", "icyc", "icycl", "cycl", "cycle", "ycle"

Edge N-Gram Tokenizer

Reads the field text and generates edge n-gram tokens of sizes in the given range.

Factory class: solr.EdgeNGramTokenizerFactory

Arguments:

minGramSize: (integer, default is 1) The minimum n-gram size, must be > 0.

maxGramSize: (integer, default is 1) The maximum n-gram size, must be >= .minGramSize

side: ("front" or "back", default is "front") Whether to compute the n-grams from the beginning (front) of the text or

from the end (back).

Example:

Default behavior (min and max default to 1):

93Apache Solr Reference Guide 4.10

</analyzer>

In: "babaloo"

Out: "b"

Example:

Edge n-gram range of 2 to 5

</analyzer>

In: "babaloo"

Out:"ba", "bab", "baba", "babal"

Example:

Edge n-gram range of 2 to 5, from the back side:

<tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5"

side="back"/>

</analyzer>

In: "babaloo"

Out: "oo", "loo", "aloo", "baloo"

ICU Tokenizer

This tokenizer processes multilingual text and tokenizes it appropriately based on its script attribute.

You can customize this tokenizer's behavior by specifying . To add per-script rules, add a per-script rule files rulefi

argument, which should contain a comma-separated list of pairs in the following format:les code:rulefile

four-letter ISO 15924 script code, followed by a colon, then a resource path. For example, to specify rules for Latin

(script code "Latn") and Cyrillic (script code "Cyrl"), you would enter Latn:my.Latin.rules.rbbi,Cyrl:my.Cy

.rillic.rules.rbbi

The default provides UAX#29 word break rules tokenization (like solr.ICUTokenizerFactory solr.Standard

), but also includes custom tailorings for Hebrew (specializing handling of double and single quotationTokenizer

marks), and for syllable tokenization for Khmer, Lao, and Myanmar.

Factory class: solr.ICUTokenizerFactory

Arguments:

rulefile: a comma-separated list of pairs in the following format: four-letter ISO 15924 scriptcode:rulefile

code, followed by a colon, then a resource path.

Example:

94Apache Solr Reference Guide 4.10

</analyzer>

<tokenizer class="solr.ICUTokenizerFactory"

rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"

</analyzer>

Path Hierarchy Tokenizer

This tokenizer creates synonyms from file path hierarchies.

Factory class: solr.PathHierarchyTokenizerFactory

Arguments:

delimiter: (character, no default) You can specify the file path delimiter and replace it with a delimiter you

provide. This can be useful for working with backslash delimiters.

replace: (character, no default) Specifies the delimiter character Solr uses in the tokenized output.

Example:

</analyzer>

</fieldType>

In: "c:\usr\local\apache"

Out: "c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache"

Regular Expression Pattern Tokenizer

This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided

by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that

should be extracted from the text as tokens.

See for more information on Java regular expression syntax.the Javadocs for java.util.regex.Pattern

Factory class: solr.PatternTokenizerFactory

Arguments:

pattern: (Required) The regular expression, as defined by in .java.util.regex.Pattern

group: (Optional, default -1) Specifies which regex group to extract as the token(s). The value -1 means the regex

should be treated as a delimiter that separates tokens. Non-negative group numbers (>= 0) indicate that character

sequences matching that regex group should be converted to tokens. Group zero refers to the entire regex, groups

greater than zero refer to parenthesized sub-expressions of the regex, counted from left to right.

95Apache Solr Reference Guide 4.10

Example:

A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or more

spaces.

</analyzer>

In: "fee,fie, foe , fum, foo"

Out: "fee", "fie", "foe", "fum", "foo"

Example:

Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of either

case is extracted as a token.

<tokenizer class="solr.PatternTokenizerFactory" pattern="\[A-Z\]\[A-Za-z\]"

group="0"/>

</analyzer>

In: "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."

Out: "Hello", "My", "Inigo", "Montoya", "You", "Prepare"

Example:

Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an optional

semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex capture groups are

numbered by counting left parenthesis from left to right. Group 3 is the subexpression "[0-9-]+", which matches one

or more digits or hyphens.

<tokenizer class="solr.PatternTokenizerFactory"

pattern="(SKU|Part(\sNumber)?):?\s(\[0-9-\]+)" group="3"/>

</analyzer>

In: "SKU: 1234, Part Number 5678, Part: 126-987"

Out: "1234", "5678", "126-987"

UAX29 URL Email Tokenizer

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters

are discarded, with the following exceptions:

Periods (dots) that are not followed by whitespace are kept as part of the token.

Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the

numbers and hyphen(s) are preserved.

Recognizes top-level Internet domain names (validated against the white list in the IANA Root Zone

when the tokenizer was generated); email addresses; , , and addrDatabase file : // http(s):// ftp://

96Apache Solr Reference Guide 4.10

esses; IPv4 and IPv6 addresses; and preserves them as a single token.

The UAX29 URL Email Tokenizer supports word boundaries with the followingUnicode standard annex UAX#29

token types: , , , , , , and .<ALPHANUM> <NUM> <URL> <EMAIL> <SOUTHEAST_ASIAN> <IDEOGRAPHIC> <HIRAGANA>

Factory class: solr.UAX29URLEmailTokenizerFactory

Arguments:

maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by max

.TokenLength

Example:

</analyzer>

In: "Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com"

Out: "Visit", "http://accarol.com/contact.htm?from=external&a=10", "or", "email", "bob.cratchet@accarol.com"

White Space Tokenizer

Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as

tokens. Note that any punctuation be included in the tokenization.will

Factory class: solr.WhitespaceTokenizerFactory

Arguments: None

Example:

</analyzer>

In: "To be, or what?"

Out: "To", "be,", "or", "what?"

Related Topics

TokenizerFactories

Filter Descriptions

You configure each filter with a element in as a<filter> schema.xml

child of , following the element. Filter<analyzer> <tokenizer>

definitions should follow a tokenizer or another filter definition because they

take a as input. For example.TokenStream

Filters discussed in this

section:

ASCII Folding Filter

Beider-Morse Filter

Classic Filter

Common Grams

Filter

Collation Key Filter

97Apache Solr Reference Guide 4.10
<fieldType name="text" class="solr.TextField">
 <analyzer type="index">
 <tokenizer
class="solr.StandardTokenizerFactory"/>
 <filter
class="solr.LowerCaseFilterFactory"/>...
 </analyzer>
</fieldType>
The class attribute names a factory class that will instantiate a filter object
as needed. Filter factory classes must implement the org.apache.solr.
 interface. Like tokenizers, filters areanalysis.TokenFilterFactory
also instances of TokenStream and thus are producers of tokens. Unlike
tokenizers, filters also consume tokens from a TokenStream. This allows
you to mix and match filters, in any order you prefer, downstream of a
tokenizer.
Arguments may be passed to tokenizer factories to modify their behavior by
setting attributes on the   element. For example:<filter>
<fieldType name="semicolonDelimited"
class="solr.TextField">
 <analyzer type="query">
 <tokenizer class="solr.PatternTokenizerFactory"
pattern="; " />
 <filter class="solr.LengthFilterFactory" min="2"
max="7"/>
 </analyzer>
</fieldType>
The following sections describe the filter factories that are included in this
release of Solr.
For more information about Solr's filters, see http://wiki.apache.org/solr/Ana
.lyzersTokenizersTokenFilters
Edge N-Gram Filter
English Minimal
Stem Filter
Hunspell Stem
Filter
Hyphenated Words
Filter
ICU Folding Filter
ICU Normalizer 2
Filter
ICU Transform
Filter
Keep Words Filter
KStem Filter
Length Filter
Lower Case Filter
Managed Stop
Filter
Managed Synonym
Filter
N-Gram Filter
Numeric Payload
Token Filter
Pattern Replace
Filter
Phonetic Filter
Porter Stem Filter
Position Filter
Factory
Remove Duplicates
Token Filter
Reversed Wildcard
Filter
Shingle Filter
Snowball Porter
Stemmer Filter
Standard Filter
Stop Filter
Synonym Filter
Token Offset
Payload Filter
Trim Filter
Type As Payload
Filter
Type Token Filter
Word Delimiter
Filter
Related Topics

98Apache Solr Reference Guide 4.10

ASCII Folding Filter

This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the Basic Latin Unicode

block (the first 127 ASCII characters) to their ASCII equivalents, if one exists. This filter converts characters from the

following Unicode blocks:

C1 Controls and Latin-1 Supplement (PDF)

Latin Extended-A (PDF)

Latin Extended-B (PDF)

Latin Extended Additional (PDF)

Latin Extended-C (PDF)

Latin Extended-D (PDF)

IPA Extensions (PDF)

Phonetic Extensions (PDF)

Phonetic Extensions Supplement (PDF)

General Punctuation (PDF)

Superscripts and Subscripts (PDF)

Enclosed Alphanumerics (PDF)

Dingbats (PDF)

Supplemental Punctuation (PDF)

Alphabetic Presentation Forms (PDF)

Halfwidth and Fullwidth Forms (PDF)

Factory class: solr.ASCIIFoldingFilterFactory

Arguments: None

Example:

</analyzer>

In: "á" (Unicode character 00E1)

Out: "a" (ASCII character 97)

Beider-Morse Filter

Implements the Beider-Morse Phonetic Matching (BMPM) algorithm, which allows identification of similar names,

even if they are spelled differently or in different languages. More information about how this works is available in

the section on .Phonetic Matching

Factory class: solr.BeiderMorseFilterFactory

Arguments:

nameType: Types of names. Valid values are GENERIC, ASHKENAZI, or SEPHARDIC. If not processing

Ashkenazi or Sephardic names, use GENERIC.

ruleType: Types of rules to apply. Valid values are APPROX or EXACT.

99Apache Solr Reference Guide 4.10

concat: Defines if multiple possible matches should be combined with a pipe ("|").

languageSet: The language set to use. The value "auto" will allow the Filter to identify the language, or a

comma-separated list can be supplied.

Example:

<filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX"

concat="true" languageSet="auto">

</filter>

</analyzer>

Classic Filter

This filter takes the output of the and strips periods from acronyms and "'s" from possessives.Classic Tokenizer

Factory class: solr.ClassicFilterFactory

Arguments: None

Example:

</analyzer>

In: "I.B.M. cat's can't"

Tokenizer to Filter: "I.B.M", "cat's", "can't"

Out: "IBM", "cat", "can't"

Common Grams Filter

This filter creates word shingles by combining common tokens such as stop words with regular tokens. This is useful

for creating phrase queries containing common words, such as "the cat." Solr normally ignores stop words in queried

phrases, so searching for "the cat" would return all matches for the word "cat."

Factory class: solr.CommonGramsFilterFactory

Arguments:

words: (a common word file in .txt format) Provide the name of a common word file, such as .stopwords.txt

format: (optional) If the stopwords list has been formatted for Snowball, you can specify soformat="snowball"

Solr can read the stopwords file.

ignoreCase: (boolean) If true, the filter ignores the case of words when comparing them to the common word file.

The default is false.

Example:

100Apache Solr Reference Guide 4.10

<filter class="solr.CommonGramsFilterFactory" words="stopwords.txt"

ignoreCase="true"/>

</analyzer>

In: "the Cat"

Tokenizer to Filter: "the", "Cat"

Out: "the_cat"

Collation Key Filter

Collation allows sorting of text in a language-sensitive way. It is usually used for sorting, but can also be used with

advanced searches. We've covered this in much more detail in the section on .Unicode Collation

Edge N-Gram Filter

This filter generates edge n-gram tokens of sizes within the given range.

Factory class: solr.EdgeNGramFilterFactory

Arguments:

minGramSize: (integer, default 1) The minimum gram size.

maxGramSize: (integer, default 1) The maximum gram size.

Example:

Default behavior.

</analyzer>

In: "four score and twenty"

Tokenizer to Filter: "four", "score", "and", "twenty"

Out: "f", "s", "a", "t"

Example:

A range of 1 to 4.

</analyzer>

In: "four score"

Tokenizer to Filter: "four", "score"

101Apache Solr Reference Guide 4.10

Out: "f", "fo", "fou", "four", "s", "sc", "sco", "scor"

Example:

A range of 4 to 6.

</analyzer>

In: "four score and twenty"

Tokenizer to Filter: "four", "score", "and", "twenty"

Out: "four", "scor", "score", "twen", "twent", "twenty"

English Minimal Stem Filter

This filter stems plural English words to their singular form.

Factory class: solr.EnglishMinimalStemFilterFactory

Arguments: None

Example:

</analyzer>

In: "dogs cats"

Tokenizer to Filter: "dogs", "cats"

Out: "dog", "cat"

Hunspell Stem Filter

The provides support for several languages. You must provide the dictionary ( ) and rules (Hunspell Stem Filter .dic

) files for each language you wish to use with the Hunspell Stem Filter. You can download those language files .aff

. Be aware that your results will vary widely based on the quality of the provided dictionary and rules files. Forhere

example, some languages have only a minimal word list with no morphological information. On the other hand, for

languages that have no stemmer but do have an extensive dictionary file, the Hunspell stemmer may be a good

choice.

Factory class: solr.HunspellStemFilterFactory

Arguments:

dictionary: (required) The path of a dictionary file.

: (required) The path of a rules file. affix

: (boolean) controls whether matching is case sensitive or not. The default is false.ignoreCase

: (boolean) controls whether the affix parsing is strict or not. If true, an error while reading anstrictAffixParsing

affix rule causes a ParseException, otherwise is ignored. The default is true.

102Apache Solr Reference Guide 4.10

Example:

<filter class="solr.HunspellStemFilterFactory"

dictionary="en_GB.dic"

affix="en_GB.aff"

ignoreCase="true"

strictAffixParsing="true" />

</analyzer>

In: "jump jumping jumped"

Tokenizer to Filter: "jump", "jumping", "jumped"

Out: "jump", "jump", "jump"

Hyphenated Words Filter

This filter reconstructs hyphenated words that have been tokenized as two tokens because of a line break or other

intervening whitespace in the field test. If a token ends with a hyphen, it is joined with the following token and the

hyphen is discarded. Note that for this filter to work properly, the upstream tokenizer must not remove trailing

hyphen characters. This filter is generally only useful at index time.

Factory class: solr.HyphenatedWordsFilterFactory

Arguments: None

Example:

</analyzer>

In: "A hyphen- ated word"

Tokenizer to Filter: "A", "hyphen-", "ated", "word"

Out: "A", "hyphenated", "word"

ICU Folding Filter

This filter is a custom Unicode normalization form that applies the foldings specified in Unicode Technical Report 30

in addition to the normalization form as described in . This filter is a betterNFKC_Casefold ICU Normalizer 2 Filter

substitute for the combined behavior of the , , and .ASCII Folding Filter Lower Case Filter ICU Normalizer 2 Filter

To use this filter, see for instructions on which jars you need tosolr/contrib/analysis-extras/README.txt

add to your .solr_home/lib

Factory class: solr.ICUFoldingFilterFactory

Arguments: None

Example:

103Apache Solr Reference Guide 4.10

</analyzer>

For detailed information on this normalization form, see .http://www.unicode.org/reports/tr30/tr30-4.html

ICU Normalizer 2 Filter

This filter factory normalizes text according to one of five Unicode Normalization Forms as described in Unicode

:Standard Annex #15

NFC: (name="nfc" mode="compose") Normalization Form C, canonical decomposition

NFD: (name="nfc" mode="decompose") Normalization Form D, canonical decomposition, followed by

canonical composition

NFKC: (name="nfkc" mode="compose") Normalization Form KC, compatibility decomposition

NFKD: (name="nfkc" mode="decompose") Normalization Form KD, compatibility decomposition, followed by

canonical composition

NFKC_Casefold: (name="nfkc_cf" mode="compose") Normalization Form KC, with additional Unicode case

folding. Using the ICU Normalizer 2 Filter is a better-performing substitution for the andLower Case Filter

NFKC normalization.

Factory class: solr.ICUNormalizer2FilterFactory

Arguments:

name: (string) The name of the normalization form; , , , , nfc nfd nfkc nfkd nfkc_cf

mode: (string) The mode of Unicode character composition and decomposition; or compose decompose

Example:

</analyzer>

For detailed information about these Unicode Normalization Forms, see .http://unicode.org/reports/tr15/

To use this filter, see for instructions on which jars you need tosolr/contrib/analysis-extras/README.txt

add to your .solr_home/lib

ICU Transform Filter

This filter applies to text. This filter supports only ICU System Transforms. Custom rule sets are notICU Tranforms

supported.

Factory class: solr.ICUTransformFilterFactory

Arguments:

id: (string) The identifier for the ICU System Transform you wish to apply with this filter. For a full list of ICU System

Transforms, see .http://demo.icu-project.org/icu-bin/translit?TEMPLATE_FILE=data/translit_rule_main.html

104Apache Solr Reference Guide 4.10

Example:

</analyzer>

For detailed information about ICU Transforms, see .http://userguide.icu-project.org/transforms/general

To use this filter, see for instructions on which jars you need tosolr/contrib/analysis-extras/README.txt

add to your .solr_home/lib

Keep Words Filter

This filter discards all tokens except those that are listed in the given word list. This is the inverse of the Stop Words

Filter. This filter can be useful for building specialized indices for a constrained set of terms.

Factory class: solr.KeepWordFilterFactory

Arguments:

words: (required) Path of a text file containing the list of keep words, one per line. Blank lines and lines that begin

with "#" are ignored. This may be an absolute path, or a simple filename in the Solr config directory.

ignoreCase: (true/false) If then comparisons are done case-insensitively. If this argument is true, then thetrue

words file is assumed to contain only lowercase words. The default is .false

Example:

Where contains:keepwords.txt

happy

funny

silly

</analyzer>

In: "Happy, sad or funny"

Tokenizer to Filter: "Happy", "sad", "or", "funny"

Out: "funny"

Example:

Same , case insensitive:keepwords.txt

</analyzer>

105Apache Solr Reference Guide 4.10

In: "Happy, sad or funny"

Tokenizer to Filter: "Happy", "sad", "or", "funny"

Out: "Happy", "funny"

Example:

Using LowerCaseFilterFactory before filtering for keep words, no flag.ignoreCase

</analyzer>

In: "Happy, sad or funny"

Tokenizer to Filter: "Happy", "sad", "or", "funny"

Filter to Filter: "happy", "sad", "or", "funny"

Out: "happy", "funny"

KStem Filter

KStem is an alternative to the Porter Stem Filter for developers looking for a less aggressive stemmer. KStem was

written by Bob Krovetz, ported to Lucene by Sergio Guzman-Lara (UMASS Amherst). This stemmer is only

appropriate for English language text.

Factory class: solr.KStemFilterFactory

Arguments: None

Example:

</analyzer>

In: "jump jumping jumped"

Tokenizer to Filter: "jump", "jumping", "jumped"

Out: "jump", "jump", "jump"

Length Filter

This filter passes tokens whose length falls within the min/max limit specified. All other tokens are discarded.

Factory class: solr.LengthFilterFactory

Arguments:

min: (integer, required) Minimum token length. Tokens shorter than this are discarded.

max: (integer, required, must be >= min) Maximum token length. Tokens longer than this are discarded.

106Apache Solr Reference Guide 4.10

Example:

</analyzer>

In: "turn right at Albuquerque"

Tokenizer to Filter: "turn", "right", "at", "Albuquerque"

Out: "turn", "right"

Lower Case Filter

Converts any uppercase letters in a token to the equivalent lowercase token. All other characters are left

unchanged.

Factory class: solr.LowerCaseFilterFactory

Arguments: None

Example:

</analyzer>

In: "Down With CamelCase"

Tokenizer to Filter: "Down", "With", "CamelCase"

Out: "down", "with", "camelcase"

Managed Stop Filter

This is specialized version of the that uses a set of stop words that are Stop Words Filter Factory managed from a

REST API.

Arguments:

managed: The name that should be used for this set of stop words in the managed REST API.

Example:

With this configuration the set of words is named "english" and can be managed via /solr/[collection]/sche

ma/analysis/stopwords/english

</analyzer>

See for example input/output.Stop Filter

107Apache Solr Reference Guide 4.10

Managed Synonym Filter

This is specialized version of the that uses a mapping on synonyms that is Synonym Filter Factory managed from a

REST API.

Arguments:

managed: The name that should be used for this mapping on synonyms in the managed REST API.

Example:

With this configuration the set of mappings is named "english" and can be managed via /solr/[collection]/sc

hema/analysis/synonyms/english

</analyzer>

See for example input/output.Synonym Filter

N-Gram Filter

Generates n-gram tokens of sizes in the given range. Note that tokens are ordered by position and then by gram

size.

Factory class: solr.NGramFilterFactory

Arguments:

minGramSize: (integer, default 1) The minimum gram size.

maxGramSize: (integer, default 2) The maximum gram size.

Example:

Default behavior.

</analyzer>

In: "four score"

Tokenizer to Filter: "four", "score"

Out: "f", "o", "u", "r", "fo", "ou", "ur", "s", "c", "o", "r", "e", "sc", "co", "or", "re"

Example:

A range of 1 to 4.

108Apache Solr Reference Guide 4.10

</analyzer>

In: "four score"

Tokenizer to Filter: "four", "score"

Out: "f", "fo", "fou", "four", "s", "sc", "sco", "scor"

Example:

A range of 3 to 5.

</analyzer>

In: "four score"

Tokenizer to Filter: "four", "score"

Out: "fou", "four", "our", "sco", "scor", "score", "cor", "core", "ore"

Numeric Payload Token Filter

This filter adds a numeric floating point payload value to tokens that match a given type. Refer to the Javadoc for the

class for more information about token types and payloads.org.apache.lucene.analysis.Token

Factory class: solr.NumericPayloadTokenFilterFactory

Arguments:

payload: (required) A floating point value that will be added to all matching tokens.

typeMatch: (required) A token type name string. Tokens with a matching type name will have their payload set to

the above floating point value.

Example:

<filter class="solr.NumericPayloadTokenFilterFactory" payload="0.75"

typeMatch="word"/>

</analyzer>

In: "bing bang boom"

Tokenizer to Filter: "bing", "bang", "boom"

Out: "bing"[0.75], "bang"[0.75], "boom"[0.75]

Pattern Replace Filter

109Apache Solr Reference Guide 4.10

This filter applies a regular expression to each token and, for those that match, substitutes the given replacement

string in place of the matched pattern. Tokens which do not match are passed though unchanged.

Factory class: solr.PatternReplaceFilterFactory

Arguments:

pattern: (required) The regular expression to test against each token, as per .java.util.regex.Pattern

replacement: (required) A string to substitute in place of the matched pattern. This string may contain references

to capture groups in the regex pattern. See the Javadoc for .java.util.regex.Matcher

replace: ("all" or "first", default "all") Indicates whether all occurrences of the pattern in the token should be

replaced, or only the first.

Example:

Simple string replace:

</analyzer>

In: "cat concatenate catycat"

Tokenizer to Filter: "cat", "concatenate", "catycat"

Out: "dog", "condogenate", "dogydog"

Example:

String replacement, first occurrence only:

<filter class="solr.PatternReplaceFilterFactory" pattern="cat" replacement="dog"

replace="first"/>

</analyzer>

In: "cat concatenate catycat"

Tokenizer to Filter: "cat", "concatenate", "catycat"

Out: "dog", "condogenate", "dogycat"

Example:

More complex pattern with capture group reference in the replacement. Tokens that start with non-numeric

characters and end with digits will have an underscore inserted before the numbers. Otherwise the token is passed

through.

110Apache Solr Reference Guide 4.10

<filter class="solr.PatternReplaceFilterFactory" pattern="(\D+)(\d+)$"

replacement="$1_$2"/>

</analyzer>

In: "cat foo1234 9987 blah1234foo"

Tokenizer to Filter: "cat", "foo1234", "9987", "blah1234foo"

Out: "cat", "foo_1234", "9987", "blah1234foo"

Phonetic Filter

This filter creates tokens using one of the phonetic encoding algorithms in the .langorg.apache.commons.codec

uage package.

Factory class: solr.PhoneticFilterFactory

Arguments:

encoder: (required) The name of the encoder to use. The encoder name must be one of the following (case

insensitive): " ", " ", " ", " ", " ", or " "DoubleMetaphone Metaphone Soundex RefinedSoundex Caverphone ColognePhonetic

inject: (true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens are

replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the exact spelling of

the target word may not match.

maxCodeLength: (integer) The maximum length of the code to be generated by the Metaphone or Double

Metaphone encoders.

Example:

Default behavior for DoubleMetaphone encoding.

</analyzer>

In: "four score and twenty"

Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)

Out: "four"(1), "FR"(1), "score"(2), "SKR"(2), "and"(3), "ANT"(3), "twenty"(4), "TNT"(4)

The phonetic tokens have a position increment of 0, which indicates that they are at the same position as the token

they were derived from (immediately preceding).

Example:

Discard original token.

111Apache Solr Reference Guide 4.10

<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone"

inject="false"/>

</analyzer>

In: "four score and twenty"

Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)

Out: "FR"(1), "SKR"(2), "ANT"(3), "TWNT"(4)

Example:

Default Soundex encoder.

</analyzer>

In: "four score and twenty"

Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)

Out: "four"(1), "F600"(1), "score"(2), "S600"(2), "and"(3), "A530"(3), "twenty"(4), "T530"(4)

Porter Stem Filter

This filter applies the Porter Stemming Algorithm for English. The results are similar to using the Snowball Porter

Stemmer with the argument. But this stemmer is coded directly in Java and is not based onlanguage="English"

Snowball. It does not accept a list of protected words and is only appropriate for English language text. However, it

has been benchmarked as than the English Snowball stemmer, so can provide a performancefour times faster

enhancement.

Factory class: solr.PorterStemFilterFactory

Arguments: None

Example:

</analyzer>

In: "jump jumping jumped"

Tokenizer to Filter: "jump", "jumping", "jumped"

Out: "jump", "jump", "jump"

Position Filter Factory

This filter sets the position increment values of all tokens in a token stream except the first, which retains its original

112Apache Solr Reference Guide 4.10

position increment value. This filter and will be removed in Solr 5.has been deprecated

Factory class: solr.PositionIncrementFilterFactory

Arguments:

positionIncrement: (integer, default = 0) The position increment value to apply to all tokens in a token stream

except the first.

Example:

</analyzer>

In: "hello world"

Tokenizer to Filter: "hello", "world"

Out: "hello" (token position 1), "world" (token position 2)

Remove Duplicates Token Filter

The filter removes duplicate tokens in the stream. Tokens are considered to be duplicates if they have the same text

and position values.

Factory class: solr.RemoveDuplicatesTokenFilterFactory

Arguments: None

Example:

One example of where is in situations where a synonym file is beingRemoveDuplicatesTokenFilterFactory

used in conjuntion with a stemmer causes some synonyms to be reduced to the same stem. Consider the following

entry from a file:synonyms.txt

Television, Televisions, TV, TVs

When used in the following configuration:

</analyzer>

In: "Watch TV"

Tokenizer to Synonym Filter: "Watch"(1) "TV"(2)

Synonym Filter to Stem Filter: "Watch"(1) "Television"(2) "Televisions"(2) "TV"(2) "TVs"(2)

113Apache Solr Reference Guide 4.10

Stem Filter to Remove Dups Filter: "Watch"(1) "Television"(2) "Television"(2) "TV"(2) "TV"(2)

Out: "Watch"(1) "Television"(2) "TV"(2)

Reversed Wildcard Filter

This filter reverses tokens to provide faster leading wildcard and prefix queries. Tokens without wildcards are not

reversed.

Factory class: solr.ReveresedWildcardFilterFactory

Arguments:

withOriginal (boolean) If true, the filter produces both original and reversed tokens at the same positions. If

false, produces only reversed tokens.

maxPosAsterisk (integer, default = 2) The maximum position of the asterisk wildcard ('*') that triggers the reversal

of the query term. Terms with asterisks at positions above this value are not reversed.

maxPosQuestion (integer, default = 1) The maximum position of the question mark wildcard ('?') that triggers the

reversal of query term. To reverse only pure suffix queries (queries with a single leading asterisk), set this to 0 and m

to 1.axPosAsterisk

maxFractionAsterisk (float, default = 0.0) An additional parameter that triggers the reversal if asterisk ('*')

position is less than this fraction of the query token length.

minTrailing (integer, default = 2) The minimum number of trailing characters in a query token after the last

wildcard character. For good performance this should be set to a value larger than 1.

Example:

<filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"

maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2" maxFractionAsterisk="0"/>

</analyzer>

In: "*foo *bar"

Tokenizer to Filter: "*foo", "*bar"

Out: "oof*", "rab*"

Shingle Filter

This filter constructs shingles, which are token n-grams, from the token stream. It combines runs of tokens into a

single token.

Factory class: solr.ShingleFilterFactory

Arguments:

minShingleSize: (integer, default 2) The minimum number of tokens per shingle.

maxShingleSize: (integer, must be >= 2, default 2) The maximum number of tokens per shingle.

outputUnigrams: (true/false) If true (the default), then each individual token is also included at its original position.

114Apache Solr Reference Guide 4.10

outputUnigramsIfNoShingles: (true/false) If false (the default), then individual tokens will be output if no

shingles are possible.

tokenSeparator: (string, default is " ") The default string to use when joining adjacent tokens to form a shingle.

Example:

Default behavior.

</analyzer>

In: "To be, or what?"

Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)

Out: "To"(1), "To be"(1), "be"(2), "be or"(2), "or"(3), "or what"(3), "what"(4)

Example:

A shingle size of four, do not include original token.

<filter class="solr.ShingleFilterFactory" maxShingleSize="4"

outputUnigrams="false"/>

</analyzer>

In: "To be, or not to be."

Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "not"(4), "to"(5), "be"(6)

Out: "To be"(1), "To be or"(1), "To be or not"(1), "be or"(2), "be or not"(2), "be or not to"(2), "or not"(3), "or not to"(3),

"or not to be"(3), "not to"(4), "not to be"(4), "to be"(5)

Snowball Porter Stemmer Filter

This filter factory instantiates a language-specific stemmer generated by Snowball. Snowball is a software package

that generates pattern-based word stemmers. This type of stemmer is not as accurate as a table-based stemmer,

but is faster and less complex. Table-driven stemmers are labor intensive to create and maintain and so are typically

commercial products.

Solr contains Snowball stemmers for Armenian, Basque, Catalan, Danish, Dutch, English, Finnish, French, German,

Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish and Turkish. For more

information on Snowball, visit .http://snowball.tartarus.org/

StopFilterFactory, , and can optionallyCommonGramsFilterFactory CommonGramsQueryFilterFactory

read stopwords in Snowball format (specify in the configuration of those FilterFactories).format="snowball"

Factory class: solr.SnowballPorterFilterFactory

Arguments:

language: (default "English") The name of a language, used to select the appropriate Porter stemmer to use. Case

115Apache Solr Reference Guide 4.10

is significant. This string is used to select a package name in the "org.tartarus.snowball.ext" class hierarchy.

protected: Path of a text file containing a list of protected words, one per line. Protected words will not be

stemmed. Blank lines and lines that begin with "#" are ignored. This may be an absolute path, or a simple file name

in the Solr config directory.

Example:

Default behavior:

</analyzer>

In: "flip flipped flipping"

Tokenizer to Filter: "flip", "flipped", "flipping"

Out: "flip", "flip", "flip"

Example:

French stemmer, English words:

</analyzer>

In: "flip flipped flipping"

Tokenizer to Filter: "flip", "flipped", "flipping"

Out: "flip", "flipped", "flipping"

Example:

Spanish stemmer, Spanish words:

</analyzer>

In: "cante canta"

Tokenizer to Filter: "cante", "canta"

Out: "cant", "cant"

Standard Filter

This filter removes dots from acronyms and the substring "'s" from the end of tokens. This filter depends on the

tokens being tagged with the appropriate term-type to recognize acronyms and words with apostrophes.

Factory class: solr.StandardFilterFactory

116Apache Solr Reference Guide 4.10

Arguments: None

Stop Filter

This filter discards, or analysis of, tokens that are on the given stop words list. A standard stop words list isstops

included in the Solr config directory, named , which is appropriate for typical English language text.stopwords.txt

Factory class: solr.StopFilterFactory

Arguments:

words: (optional) The path to a file that contains a list of stop words, one per line. Blank lines and lines that begin

with "#" are ignored. This may be an absolute path, or path relative to the Solr config directory.

format: (optional) If the stopwords list has been formatted for Snowball, you can specify soformat="snowball"

Solr can read the stopwords file.

ignoreCase: (true/false, default false) Ignore case when testing for stop words. If true, the stop list should contain

lowercase words.

Example:

Case-sensitive matching, capitalized words not stopped. Token positions skip stopped words.

</analyzer>

In: "To be or what?"

Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)

Out: "To"(1), "what"(4)

Example:

</analyzer>

In: "To be or what?"

Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)

Out: "what"(4)

Synonym Filter

This filter is no longer operational in Solr when the (in ) is higherluceneMatchVersion solrconfig.xml

than "3.1".

As of Solr 4.4, the argument is no longer supported.enablePositionIncrements

117Apache Solr Reference Guide 4.10

This filter does synonym mapping. Each token is looked up in the list of synonyms and if a match is found, then the

synonym is emitted in place of the token. The position value of the new tokens are set such they all occur at the

same position as the original token.

Factory class: solr.SynonymFilterFactory

Arguments:

synonyms: (required) The path of a file that contains a list of synonyms, one per line. Blank lines and lines that

begin with "#" are ignored. This may be an absolute path, or path relative to the Solr config directory.There are two

ways to specify synonym :mappings

A comma-separated list of words. If the token matches any of the words, then all the words in the list are

substituted, which will include the original token.

Two comma-separated lists of words with the symbol "=>" between them. If the token matches any word on

the left, then the list on the right is substituted. The original token will not be included unless it is also in the

list on the right.

For the following examples, assume a synonyms file named :mysynonyms.txt

couch,sofa,divan

teh => the

huge,ginormous,humungous => large

small => tiny,teeny,weeny

Example:

</analyzer>

In: "teh small couch"

Tokenizer to Filter: "teh"(1), "small"(2), "couch"(3)

Out: "the"(1), "tiny"(2), "teeny"(2), "weeny"(2), "couch"(3), "sofa"(3), "divan"(3)

Example:

</analyzer>

In: "teh ginormous, humungous sofa"

Tokenizer to Filter: "teh"(1), "ginormous"(2), "humungous"(3), "sofa"(4)

Out: "the"(1), "large"(2), "large"(3), "couch"(4), "sofa"(4), "divan"(4)

Token Offset Payload Filter

This filter adds the numeric character offsets of the token as a payload value for that token.

118Apache Solr Reference Guide 4.10

Factory class: solr.TokenOffsetPayloadTokenFilterFactory

Arguments: None

Example:

</analyzer>

In: "bing bang boom"

Tokenizer to Filter: "bing", "bang", "boom"

Out: "bing"[0,4], "bang"[5,9], "boom"[10,14]

Trim Filter

This filter trims leading and/or trailing whitespace from tokens. Most tokenizers break tokens at whitespace, so this

filter is most often used for special situations.

Factory class: solr.TrimFilterFactory

Arguments: None

Example:

The PatternTokenizerFactory configuration used here splits the input on simple commas, it does not remove

whitespace.

</analyzer>

In: "one, two , three ,four "

Tokenizer to Filter: "one", " two ", " three ", "four "

Out: "one", "two", "three", "four"

Type As Payload Filter

This filter adds the token's type, as an encoded byte sequence, as its payload.

Factory class: solr.TypeAsPayloadTokenFilterFactory

Arguments: None

Example:

As of Solr 4.4, the argument is no longer supported.updateOffsets

119Apache Solr Reference Guide 4.10

</analyzer>

In: "Pay Bob's I.O.U."

Tokenizer to Filter: "Pay", "Bob's", "I.O.U."

Out: "Pay"[<ALPHANUM>], "Bob's"[<APOSTROPHE>], "I.O.U."[<ACRONYM>]

Type Token Filter

This filter blacklists or whitelists a specified list of token types, assuming the tokens have type metadata associated

with them. For example, the emits "<URL>" and "<EMAIL>" typed tokens, as well asUAX29 URL Email Tokenizer

other types. This filter would allow you to pull out only e-mail addresses from text as tokens, if you wish.

Factory class: solr.TypeTokenFilterFactory

Arguments:

types: Defines the location of a file of types to filter.

useWhitelist: If , the file defined in should be used as include list. If , or undefined, the filetrue types false

defined in is used as a blacklist.types

Example:

<filter class="solr.TypeTokenFilterFactory" types="stoptypes.txt"

useWhitelist="true"/>

</analyzer>

Word Delimiter Filter

This filter splits tokens at word delimiters. The rules for determining delimiters are determined as follows:

A change in case within a word: "CamelCase" "Camel", "Case". This can be disabled by setting -> splitOnC

.aseChange="0"

A transition from alpha to numeric characters or vice versa: "Gonzo5000" "Gonzo", "5000" "4500XL" "45-> ->

00", "XL". This can be disabled by setting .splitOnNumerics="0"

Non-alphanumeric characters (discarded): "hot-spot" "hot", "spot"->

A trailing "'s" is removed: "O'Reilly's" "O", "Reilly"->

Any leading or trailing delimiters are discarded: "--hot-spot--" "hot", "spot"->

Factory class: solr.WordDelimiterFilterFactory

Arguments:

As of Solr 4.4, the argument is no longer supported.enablePositionIncrements

120Apache Solr Reference Guide 4.10

generateWordParts: (integer, default 1) If non-zero, splits words at delimiters. For example:"CamelCase",

"hot-spot" "Camel", "Case", "hot", "spot"->

generateNumberParts: (integer, default 1) If non-zero, splits numeric strings at delimiters:"1947-32" "1947",->

"32"

splitOnCaseChange: (integer, default 1) If 0, words are not split on camel-case changes:"BugBlaster-XL" "Bug->

Blaster", "XL". Example 1 below illustrates the default (non-zero) splitting behavior.

splitOnNumerics: (integer, default 1) If 0, don't split words on transitions from alpha to numeric:"FemBot3000" ->

"Fem", "Bot3000"

catenateWords: (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor's" "h->

otspotsensor"

catenateNumbers: (integer, default 0) If non-zero, maximal runs of number parts will be joined: 1947-32" "1947->

32"

catenateAll: (0/1, default 0) If non-zero, runs of word and number parts will be joined: "Zap-Master-9000" "Zap->

Master9000"

preserveOriginal: (integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" "Zap-Ma->

ster-9000", "Zap", "Master", "9000"

protected: (optional) The pathname of a file that contains a list of protected words that should be passed through

without splitting.

stemEnglishPossessive: (integer, default 1) If 1, strips the possessive "'s" from each subword.

Example:

Default behavior. The whitespace tokenizer is used here to preserve non-alphanumeric characters.

</analyzer>

In: "hot-spot RoboBlaster/9000 100XL"

Tokenizer to Filter: "hot-spot", "RoboBlaster/9000", "100XL"

Out: "hot", "spot", "Robo", "Blaster", "9000", "100", "XL"

Example:

Do not split on case changes, and do not generate number parts. Note that by not generating number parts, tokens

containing only numeric parts are ultimately discarded.

<filter class="solr.WordDelimiterFilterFactory" generateNumberParts="0"

splitOnCaseChange="0"/>

</analyzer>

In: "hot-spot RoboBlaster/9000 100-42"

121Apache Solr Reference Guide 4.10

Tokenizer to Filter: "hot-spot", "RoboBlaster/9000", "100-42"

Out: "hot", "spot", "RoboBlaster", "9000"

Example:

Concatenate word parts and number parts, but not word and number parts that occur in the same token.

<filter class="solr.WordDelimiterFilterFactory" catenateWords="1"

catenateNumbers="1"/>

</analyzer>

In: "hot-spot 100+42 XL40"

Tokenizer to Filter: "hot-spot"(1), "100+42"(2), "XL40"(3)

Out: "hot"(1), "spot"(2), "hotspot"(2), "100"(3), "42"(4), "10042"(4), "XL"(5), "40"(6)

Example:

Concatenate all. Word and/or number parts are joined together.

</analyzer>

In: "XL-4000/ES"

Tokenizer to Filter: "XL-4000/ES"(1)

Out: "XL"(1), "4000"(2), "ES"(3), "XL4000ES"(3)

Example:

Using a protected words list that contains "AstroBlaster" and "XL-5000" (among others).

</analyzer>

In: "FooBar AstroBlaster XL-5000 ==ES-34-"

Tokenizer to Filter: "FooBar", "AstroBlaster", "XL-5000", "==ES-34-"

Out: "FooBar", "FooBar", "AstroBlaster", "XL-5000", "ES", "34"

Related Topics

TokenFilterFactories

CharFilterFactories

Char Filter is a component that pre-processes input

characters. Char Filters can be chained like Token Filters

122Apache Solr Reference Guide 4.10

and placed in front of a Tokenizer. Char Filters can add,

change, or remove characters while preserving the

original character offsets to support features like

highlighting.

Topics discussed in this section:

solr.MappingCharFilterFactory

solr.HTMLStripCharFilterFactory

solr.ICUNormalizer2CharFilterFactory

solr.PatternReplaceCharFilterFactory

Related Topics

solr.MappingCharFilterFactory

This filter creates , which can be used for changing oneorg.apache.lucene.analysis.MappingCharFilter

character to another (for example, for normalizing é to e.).

This filter requires specifying a argument, which is the path and name of a file containing the mappings tomapping

perform.

Example:

<charFilter class="solr.MappingCharFilterFactory"

mapping="mapping-FoldToASCII.txt"/>

[...]

</analyzer>

solr.HTMLStripCharFilterFactory

This filter creates org.apache.solr.analysis.HTMLStripCharFilter. This Char Filter strips HTML

from the input stream and passes the result to another Char Filter or a Tokenizer.

This filter:

Removes HTML/XML tags while preserving other content.

Removes attributes within tags and supports optional attribute quoting.

Removes XML processing instructions, such as: <?foo bar?>

Removes XML comments.

Removes XML elements starting with <!>.

Removes contents of <script> and <style> elements.

Handles XML comments inside these elements (normal comment processing will not always work).

Replaces numeric character entities references like ; or ; with the corresponding character.&#65 &#x7f

The terminating ';' is optional if the entity reference at the end of the input; otherwise the terminating ';' is

mandatory, to avoid false matches on something like "Alpha&Omega Corp".

Replaces all named character entity references with the corresponding character.

  is replaced with a space instead of the 0xa0 character.

Newlines are substituted for block-level elements.

<CDATA> sections are recognized.

Inline tags, such as , , or will be removed.<b> <i> <span>

Uppercase character entities like , , and are recognized and handled as lowercase.quot gt lt amp

123Apache Solr Reference Guide 4.10

The table below presents examples of HTML stripping.

Input Output

my <a href="www.foo.bar">link</a> my link

<br>hello hello

hello<script></script>'); --></script> hello

if a<b then print a; if a<b then print a;

hello <td height=22 nowrap align="left"> hello

a<b &#65 Alpha&Omega a<b A Alpha&Omega

solr.ICUNormalizer2CharFilterFactory

This filter performs pre-tokenization Unicode normalization using .ICU4J

Arguments:

name: A , one of , , . Default is .Unicode Normalization Form nfc nfkc nfkc_cf nfkc_cf

mode: Either or . Default is . Use with or tcompose decompose compose decompose name="nfc" name="nfkc"

o get NFD or NFKD, respectively.

filter: A pattern. Codepoints outside the set are always left unchanged. Default is (the null set, noUnicodeSet []

filtering - all codepoints are subject to normalization).

Example:

[...]

</analyzer>

solr.PatternReplaceCharFilterFactory

This filter uses to replace or change character patterns.regular expressions

Arguments:

pattern: the regular expression pattern to apply to the incoming text.

replacement: the text to use to replace matching patterns.

You can configure this filter in like this:schema.xml

The input need not be an HTML document. The filter removes only constructs that look like HTML. If the

input doesn't include anything that looks like HTML, the filter won't remove any input.

124Apache Solr Reference Guide 4.10

<charFilter class="solr.PatternReplaceCharFilterFactory"

pattern="([nN][oO]\.)\s*(\d+)" replacement="$1$2"/>

[...]

</analyzer>

The table below presents examples of regex-based pattern replacement:

Input pattern replacement Output Description

see-ing looking (\w+)(ing) $1 see-ing look Removes "ing" from the end of

word.

see-ing looking (\w+)ing $1 see-ing look Same as above. 2nd

parentheses can be omitted.

No.1 NO. no.

543

[nN][oO]\.\s*(\d+) #$1 #1 NO. #543 Replace some string literals

abc=1234=5678 (\w+)=(\d+)=(\d+) $3=$1=$2 5678=abc=1234 Change the order of the groups.

Related Topics

CharFilterFactories

Language Analysis

This section contains information about tokenizers and filters

related to character set conversion or for use with specific

languages. For the European languages, tokenization is fairly

straightforward. Tokens are delimited by white space and/or a

relatively small set of punctuation characters. In other languages

the tokenization rules are often not so simple. Some European

languages may require special tokenization rules as well, such

as rules for decompounding German words.

For information about language detection at index time, see Dete

.cting Languages During Indexing

Topics discussed in this section:

KeyWordMarkerFilterFactory

StemmerOverrideFilterFactory

Dictionary Compound Word

Token Filter

Unicode Collation

ASCII Folding Filter

Language-Specific Factories

Related Topics

KeyWordMarkerFilterFactory

Protects words from being modified by stemmers. A customized protected word list may be specified with the

"protected" attribute in the schema. Any words in the protected word list will not be modified by any stemmer in Solr.

A sample Solr with comments can be found in the directory:protwords.txt /solr/conf/

125Apache Solr Reference Guide 4.10

</analyzer>

</fieldtype>

StemmerOverrideFilterFactory

Overrides stemming algorithms by applying a custom mapping, then protecting these terms from being modified by

stemmers.

A customized mapping of words to stems, in a tab-separated file, can be specified to the "dictionary" attribute in the

schema. Words in this mapping will be stemmed to the stems from the file, and will not be further changed by any

stemmer.

A sample with comments can be found in the Source Repository.stemdict.txt

</analyzer>

</fieldtype>

Dictionary Compound Word Token Filter

This filter splits, or , compound words into individual words using a dictionary of the component words.decompounds

Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is

also added to the stream at the same logical position.

Compound words are most commonly found in Germanic languages.

Factory class: solr.DictionaryCompoundWordTokenFilterFactory

Arguments:

dictionary: (required) The path of a file that contains a list of simple words, one per line. Blank lines and lines

that begin with "#" are ignored. This path may be an absolute path, or path relative to the Solr config directory.

minWordSize: (integer, default 5) Any token shorter than this is not decompounded.

minSubwordSize: (integer, default 2) Subwords shorter than this are not emitted as tokens.

maxSubwordSize: (integer, default 15) Subwords longer than this are not emitted as tokens.

onlyLongestMatch: (true/false) If true (the default), only the longest matching subwords will generate new tokens.

Example:

Assume that contains at least the following words: germanwords.txt dumm kopf donau dampf schiff

126Apache Solr Reference Guide 4.10

<filter class="solr.DictionaryCompoundWordTokenFilterFactory"

dictionary="germanwords.txt"/>

</analyzer>

In: "Donaudampfschiff dummkopf"

Tokenizer to Filter: "Donaudampfschiff"(1), "dummkopf"(2),

Out: "Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2), "kopf"(2)

Unicode Collation

Unicode Collation is a language-sensitive method of sorting text that can also be used for advanced search

purposes.

Unicode Collation in Solr is fast, because all the work is done at index time.

Rather than specifying an analyzer within , the <fieldtype ... class="solr.TextField"> solr.Collatio

and field type classes provide this functionality. nField solr.ICUCollationField solr.ICUCollationFiel

, which is backed by , provides more flexible configuration, has more locales, is significantly faster,dthe ICU4J library

and requires less memory and less index space, since its keys are smaller than those produced by the JDK

implementation that backs .solr.CollationField

solr.ICUCollationField is included in the Solr contrib - see analysis-extras solr/contrib/analysis

for instructions on which jars you need to add to your in order to use it.-extras/README.txt SOLR_HOME/lib

solr.ICUCollationField and fields can be created in two ways:solr.CollationField

Based upon a system collator associated with a Locale.

Based upon a tailored ruleset.RuleBasedCollator

Arguments for , specified as attributes within the element:solr.ICUCollationField <fieldtype>

Using a System collator:

locale: (required) locale ID. See for a list of supported locales.RFC 3066 the ICU locale explorer

strength: Valid values are , , , , or . See primary secondary tertiary quaternary identical Comparison

for more information.Levels in ICU Collation Concepts

decomposition: Valid values are or . See for moreno canonical Normalization in ICU Collation Concepts

information.

Using a Tailored ruleset:

custom: (required) Path to a UTF-8 text file containing rules supported by the ICU RuleBasedCollator

strength: Valid values are , , , , or . See primary secondary tertiary quaternary identical Comparison

CollationKeyFilterFactory and are deprecated token filterICUCollationKeyFilterFactory

implementations of the same functionality as and ,solr.CollationField solr.ICUCollationField

respectively. These classes will no longer be available in Solr 5.0.

127Apache Solr Reference Guide 4.10

for more information.Levels in ICU Collation Concepts

decomposition: Valid values are or . See for moreno canonical Normalization in ICU Collation Concepts

information.

Expert options:

alternate: Valid values are or . Can be used to ignore punctuation/whitespace.shifted non-ignorable

caseLevel: (true/false) If true, in combination with , accents are ignored but case is takenstrength="primary"

into account. The default is false. See for more information.CaseLevel in ICU Collation Concepts

caseFirst: Valid values are or . Useful to control which is sorted first when case is not ignored.lower upper

numeric: (true/false) If true, digits are sorted according to numeric value, e.g. foobar-9 sorts before foobar-10. The

default is false.

variableTop: Single character or contraction. Controls what is variable for alternate

Sorting Text for a Specific Language

In this example, text is sorted according to the default German rules provided by ICU4J.

Locales are typically defined as a combination of language and country, but you can specify just the language if you

want. For example, if you specify "de" as the language, you will get sorting that works well for the German language.

If you specify "de" as the language and "CH" as the country, you will get German sorting specifically tailored for

Switzerland.

<fieldType name="collatedGERMAN" class="solr.ICUCollationField"

locale="de"

strength="primary" />

...

<field name="manuGERMAN" type="collatedGERMAN" indexed="false" stored="false"

docValues="true"/>

...

<!-- Copy the text to this field. We could create French, English, Spanish versions

too,

and sort differently for different users! -->

In the example above, we defined the strength as "primary". The strength of the collation determines how strict the

sort order will be, but it also depends upon the language. For example, in English, "primary" strength ignores

differences in case and accents.

Another example:

128Apache Solr Reference Guide 4.10

<fieldType name="polishCaseInsensitive" class="solr.ICUCollationField"

locale="pl_PL"

strength="secondary" />

...

...

...

The type will be used for the fields where the data contains Polish text. The "secondary" strength will ignore case

differences, but, unlike "primary" strength, a letter with diacritic(s) will be sorted differently from the same base letter

without diacritics.

An example using the "city_sort" field to sort:

q=*:*&fl=city&sort=city_sort+asc

Sorting Text for Multiple Languages

There are two approaches to supporting multiple languages: if there is a small list of languages you wish to support,

consider defining collated fields for each language and using . However, adding a large number of sortcopyField

fields can increase disk and indexing costs. An alternative approach is to use the Unicode collator.default

The Unicode or locale has rules that are designed to work well for most languages. To use the default ROOT defa

locale, simply define the locale as the empty string. This Unicode default sort is still significantly more advancedult

than the standard Solr sort.

<fieldType name="collatedROOT" class="solr.ICUCollationField"

locale=""

strength="primary" />

Sorting Text with Custom Rules

You can define your own set of sorting rules. It's easiest to take existing rules that are close to what you want and

customize them.

In the example below, we create a custom rule set for German called DIN 5007-2. This rule set treats umlauts in

German differently: it treats ö as equivalent to oe, ä as equivalent to ae, and ü as equivalent to ue. For more

information, see the .ICU RuleBasedCollator javadocs

This example shows how to create a custom rule set for and dump it to a file:solr.ICUCollationField

129Apache Solr Reference Guide 4.10

// get the default rules for Germany

// these are called DIN 5007-1 sorting

RuleBasedCollator baseCollator = (RuleBasedCollator) Collator.getInstance(new

ULocale("de", "DE"));

// define some tailorings, to make it DIN 5007-2 sorting.

// For example, this makes ö equivalent to oe

String DIN5007_2_tailorings =

"& ae , a\u0308 & AE , A\u0308"+

"& oe , o\u0308 & OE , O\u0308"+

"& ue , u\u0308 & UE , u\u0308";

// concatenate the default rules to the tailorings, and dump it to a String

RuleBasedCollator tailoredCollator = new RuleBasedCollator(baseCollator.getRules() +

DIN5007_2_tailorings);

String tailoredRules = tailoredCollator.getRules();

// write these to a file, be sure to use UTF-8 encoding!!!

FileOutputStream os = new FileOutputStream(new

File("/solr_home/conf/customRules.dat"));

IOUtils.write(tailoredRules, os, "UTF-8");

This rule set can now be used for custom collation in Solr:

<fieldType name="collatedCUSTOM" class="solr.ICUCollationField"

custom="customRules.dat"

strength="primary" />

JDK Collation

As mentioned above, ICU Unicode Collation is better in several ways than JDK Collation, but if you cannot use

ICU4J for some reason, you can use .solr.CollationField

The principles of JDK Collation are the same as those of ICU Collation; you just specify , and language country v

arguments instead of the combined argument.ariant locale

Arguments for , specified as attributes within the element:solr.CollationField <fieldtype>

Using a System collator (see ):Oracle's list of locales supported in Java 7

language: (required) language codeISO-639

country: country codeISO-3166

variant: Vendor or browser-specific code

strength: Valid values are , , or . See primary secondary tertiary identical Oracle Java 7 Collator

for more information.javadocs

decomposition: Valid values are , , or . See for moreno canonical full Oracle Java 7 Collator javadocs

information.

Using a Tailored ruleset:

custom: (required) Path to a UTF-8 text file containing rules supported by the JDK RuleBasedCollator

130Apache Solr Reference Guide 4.10
strength: Valid values are  ,  ,   or  . See primary secondary tertiary identical Oracle Java 7 Collator
 for more information.javadocs
decomposition: Valid values are  ,  , or  . See   for moreno canonical full Oracle Java 7 Collator javadocs
information.
A   example:solr.CollationField
<fieldType name="collatedGERMAN" class="solr.CollationField"
 language="de"
 country="DE"
 strength="primary" /> <!-- ignore Umlauts and letter case when sorting -->
...
<field name="manuGERMAN" type="collatedGERMAN" indexed="false" stored="false"
docValues="true" />
...
<copyField source="manu" dest="manuGERMAN"/>
ASCII Folding Filter
This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII
characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. Only those characters with
reasonable ASCII alternatives are converted:
This can increase recall by causing more matches. On the other hand, it can reduce precision because
language-specific character differences may be lost.
Factory class: solr.ASCIIFoldingFilterFactory
Arguments: None
Example:
<analyzer>
 <tokenizer class="solr.StandardTokenizerFactory"/>
 <filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
In: "Björn Ångström"
Tokenizer to Filter: "Björn", "Ångström"
Out: "Bjorn", "Angstrom"
Language-Specific Factories
These factories are each designed to work with specific languages. The languages covered here are:
Arabic
Brazilian
Portuguese
Bulgarian
Catalan
Chinese
Simplified
Chinese
Danish
Dutch
Finnish
French
Galician
German
Greek
Hebrew,
Hindi
Indonesian
Italian
Irish
Kuromoji
(Japanese)
Latvian
Norwegian
Polish
Portuguese
Romanian
Russian
Spanish
Swedish
Thai
Turkish

131Apache Solr Reference Guide 4.10

CJK

Czech

Lao,

Myanmar,

Khmer

Persian

Arabic

Solr provides support for the (PDF) stemming algorithm, and Lucene includes an example stopword list.Light-10

This algorithm defines both character normalization and stemming, so these are split into two filters to provide more

flexibility.

Factory classes: , solr.ArabicStemFilterFactory solr.ArabicNormalizationFilterFactory

Arguments: None

Example:

</analyzer>

Brazilian Portuguese

This is a Java filter written specifically for stemming the Brazilian dialect of the Portuguese language. It uses the

Lucene class . Although that stemmer can beorg.apache.lucene.analysis.br.BrazilianStemmer

configured to use a list of protected words (which should not be stemmed), this factory does not accept any

arguments to specify such a list.

Factory class: solr.BrazilianStemFilterFactory

Arguments: None

Example:

</analyzer>

In: "praia praias"

Tokenizer to Filter: "praia", "praias"

Out: "pra", "pra"

Bulgarian

Solr includes a light stemmer for Bulgarian, following (PDF), and Lucene includes an examplethis algorithm

stopword list.

Factory class: solr.BulgarianStemFilterFactory

Arguments: None

Example:

132Apache Solr Reference Guide 4.10

</analyzer>

Catalan

Solr can stem Catalan using the Snowball Porter Stemmer with an argument of . Solrlanguage="Catalan"

includes a set of contractions for Catalan, which can be stripped using .solr.ElisionFilterFactory

Factory class: solr.SnowballPorterFilterFactory

Arguments:

language: (required) stemmer language, "Catalan" in this case

Example:

<filter class="solr.ElisionFilterFactory"

articles="lang/contractions_ca.txt"/>

</analyzer>

In: "llengües llengua"

Tokenizer to Filter: "llengües"(1) "llengua"(2),

Out: "llengu"(1), "llengu"(2)

Chinese

Chinese Tokenizer

The Chinese Tokenizer is deprecated as of Solr 3.4. Use the instead. solr.StandardTokenizerFactory

Factory class: solr.ChineseTokenizerFactory

Arguments: None

Example:

</analyzer>

Chinese Filter Factory

The Chinese Filter Factory is deprecated as of Solr 3.4. Use the instead. solr.StopFilterFactory

Factory class: solr.ChineseFilterFactory

Arguments: None

133Apache Solr Reference Guide 4.10

Example:

</analyzer>

Simplified Chinese

For Simplified Chinese, Solr provides support for Chinese sentence and word segmentation with the solr.HMMChi

in the contrib module. This component includes a large dictionaryneseTokenizerFactory analysis-extras

and segments Chinese text into words with the Hidden Markov Model. To use this filter, see solr/contrib/anal

for instructions on which jars you need to add to your .ysis-extras/README.txt solr_home/lib

Factory class: solr.HMMChineseTokenizerFactory

Arguments: None

Examples:

To use the default setup with fallback to English Porter stemmer for English words, use:

Or to configure your own analysis setup, use the along with your customsolr.HMMChineseTokenizerFactory

filter setup.

<filter class="solr.StopFilterFactory

words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>

</analyzer>

CJK

This tokenizer breaks Chinese, Japanese and Korean language text into tokens. These are not whitespace delimited

languages. The tokens generated by this tokenizer are "doubles", overlapping pairs of CJK characters found in the

field text.

Factory class: solr.CJKTokenizerFactory

Arguments: None

Example:

</analyzer>

Czech

Solr includes a light stemmer for Czech, following , and Lucene includes an example stopword list.this algorithm

Factory class: solr.CzechStemFilterFactory

134Apache Solr Reference Guide 4.10

Arguments: None

Example:

In: "prezidenští, prezidenta, prezidentského"

Tokenizer to Filter: "prezidenští", "prezidenta", "prezidentského"

Out: "preziden", "preziden", "preziden"

Danish

Solr can stem Danish using the Snowball Porter Stemmer with an argument of .language="Danish"

Factory class: solr.SnowballPorterFilterFactory

Arguments:

language: (required) stemmer language, "Danish" in this case

Example:

</analyzer>

In: "undersøg undersøgelse"

Tokenizer to Filter: "undersøg"(1) "undersøgelse"(2),

Out: "undersøg"(1), "undersøg"(2)

Dutch

This is a Java filter written specifically for stemming the Dutch language. It uses the Lucene class org.apache.lu

. Although that stemmer can be configured to use a list of protected wordscene.analysis.nl.DutchStemmer

(which should not be stemmed), this factory does not accept any arguments to specify such a list.

Another option for stemming Dutch words is to use the Snowball Porter Stemmer with an argument of language="

.Dutch"

Factory class: solr.DutchStemFilterFactory

Arguments: None

Example:

135Apache Solr Reference Guide 4.10

</analyzer>

In: "kanaal kanalen"

Tokenizer to Filter: "kanaal", "kanalen"

Out: "kanal", "kanal"

Finnish

Solr includes support for stemming Finnish, and Lucene includes an example stopword list.

Factory class: solr.FinnishLightStemFilterFactory

Arguments: None

Example:

</analyzer>

In: "kala kalat"

Tokenizer to Filter: "kala", "kalat"

Out: "kala", "kala"

French

Elision Filter

Removes article elisions from a token stream. This filter can be useful for languages such as French, Catalan,

Italian, and Irish.

Factory class: solr.ElisionFilterFactory

Arguments:

articles: The pathname of a file that contains a list of articles, one per line, to be stripped. Articles are words such

as "le", which are commonly abbreviated, such as in (the plane). This file should include the abbreviatedl'avion

form, which precedes the apostrophe. In this case, simply " ". If no attribute is specified, a default set oflarticles

French articles is used.

ignoreCase: (boolean) If true, the filter ignores the case of words when comparing them to the common word file.

Defaults to false

Example:

136Apache Solr Reference Guide 4.10

<filter class="solr.ElisionFilterFactory"

ignoreCase="true"

articles="lang/contractions_fr.txt"/>

</analyzer>

In: "L'histoire d'art"

Tokenizer to Filter: "L'histoire", "d'art"

Out: "histoire", "art"

French Light Stem Filter

Solr includes three stemmers for French: one in the , a lighter stemmersolr.SnowballPorterFilterFactory

called , and an even less aggressive stemmer called solr.FrenchLightStemFilterFactory solr.FrenchMi

. Lucene includes an example stopword list.nimalStemFilterFactory

Factory classes: , solr.FrenchLightStemFilterFactory solr.FrenchMinimalStemFilterFactory

Arguments: None

Examples:

<filter class="solr.ElisionFilterFactory"

articles="lang/contractions_fr.txt"/>

</analyzer>

<filter class="solr.ElisionFilterFactory"

articles="lang/contractions_fr.txt"/>

</analyzer>

In: "le chat, les chats"

Tokenizer to Filter: "le", "chat", "les", "chats"

Out: "le", "chat", "le", "chat"

Galician

Solr includes a stemmer for Galician following , and Lucene includes an example stopword list.this algorithm

Factory class: solr.GalicianStemFilterFactory

Arguments: None

Example:

137Apache Solr Reference Guide 4.10

</analyzer>

In: "felizmente Luzes"

Tokenizer to Filter: "felizmente", "luzes"

Out: "feliz", "luz"

German

Solr includes four stemmers for German: one in the solr.SnowballPorterFilterFactory

, a stemmer called , a lighter stemmer called language="German" solr.GermanStemFilterFactory solr.Ge

, and an even less aggressive stemmer called rmanLightStemFilterFactory solr.GermanMinimalStemFil

. Lucene includes an example stopword list.terFactory

Factory classes: , , solr.GermanStemFilterFactory solr.LightGermanStemFilterFactory solr.Mini

malGermanStemFilterFactory

Arguments: None

Examples:

</analyzer>

</analyzer>

</analyzer>

In: "hund hunden"

Tokenizer to Filter: "hund", "hunden"

Out: "hund", "hund"

Greek

This filter converts uppercase letters in the Greek character set to the equivalent lowercase character.

Factory class: solr.GreekLowerCaseFilterFactory

Arguments: None

138Apache Solr Reference Guide 4.10

Example:

</analyzer>

Hindi

Solr includes support for stemming Hindi following (PDF), support for common spelling differencesthis algorithm

through the , support for encoding differences through the solr.HindiNormalizationFilterFactory solr.I

following , and Lucene includes an example stopword list.ndicNormalizationFilterFactory this algorithm

Factory classes: , solr.IndicNormalizationFilterFactory solr.HindiNormalizationFilterFactor

, y solr.HindiStemFilterFactory

Arguments: None

Example:

</analyzer>

Indonesian

Solr includes support for stemming Indonesian (Bahasa Indonesia) following (PDF), and Lucenethis algorithm

includes an example stopword list.

Factory class: solr.IndonesianStemFilterFactory

Arguments: None

Example:

</analyzer>

In: "sebagai sebagainya"

Tokenizer to Filter: "sebagai", "sebagainya"

Out: "bagai", "bagai"

Use of custom charsets is not longer supported as of Solr 3.1. If you need to index text in these encodings,

please use Java's character set conversion facilities (InputStreamReader, and so on.) during I/O, so that

Lucene can analyze this text as Unicode instead.

139Apache Solr Reference Guide 4.10

Italian

Solr includes two stemmers for Italian: one in the solr.SnowballPorterFilterFactory

, and a lighter stemmer called . Lucenelanguage="Italian" solr.ItalianLightStemFilterFactory

includes an example stopword list.

Factory class: solr.ItalianStemFilterFactory

Arguments: None

Example:

<filter class="solr.ElisionFilterFactory"

articles="lang/contractions_it.txt"/>

</analyzer>

In: "propaga propagare propagamento"

Tokenizer to Filter: "propaga", "propagare", "propagamento"

Out: "propag", "propag", "propag"

Irish

Solr can stem Irish using the Snowball Porter Stemmer with an argument of . Solr includes language="Irish" so

, which can handle Irish-specific constructs. Solr also includes a set of contractionslr.IrishLowerCaseFilter

for Irish which can be stripped using .solr.ElisionFilterFactory

Factory class: solr.SnowballPorterFilterFactory

Arguments:

language: (required) stemmer language, "Irish" in this case

Example:

<filter class="solr.ElisionFilterFactory"

articles="lang/contractions_ga.txt"/>

</analyzer>

In: "siopadóireacht síceapatacha b'fhearr m'athair"

Tokenizer to Filter: "siopadóireacht", "síceapatacha", "b'fhearr", "m'athair"

Out: "siopadóir", "síceapaite", "fearr", "athair"

Kuromoji (Japanese)

Solr includes support for stemming Kuromoji (Japanese), and Lucene includes an example stopword list. Kuromoji

140Apache Solr Reference Guide 4.10

has a search mode (default) that does segmentation useful for search. A heuristic is used to segment compounds

into its parts and the compound itself is kept as a synonym.

With Solr 4, the now is included to normalize Japanese iterationJapaneseIterationMarkCharFilterFactory

marks.

You can also make discarding punctuation configurable in the , by setting JapaneseTokenizerFactory discard

to (to show punctuation) or (to discard punctuation).Punctuation false true

Factory class: solr.KuromojiStemFilterFactory

Arguments:

mode: Use search-mode to get a noun-decompounding effect useful for search. Search mode improves

segmentation for search at the expense of part-of-speech accuracy. Valid values for mode are:

normal: default segmentation

search: segmentation useful for search (extra compound splitting)

extended: search mode with unigramming of unknown words (experimental)

For some applications it might be good to use search mode for indexing and normal mode for queries to reduce

recall and prevent parts of compounds from being matched and highlighted.

Kuromoji also has a convenient user dictionary feature that allows overriding the statistical model with your own

entries for segmentation, part-of-speech tags and readings without a need to specify weights. Note that user

dictionaries have not been subject to extensive testing. User dictionary attributes are:

userDictionary: user dictionary filename

: user dictionary encoding (default is UTF-8)userDictionaryEncoding

See for a sample user dictionary file.lang/userdict_ja.txt

Punctuation characters are discarded by default. Use to keep them.discardPunctuation="false"

Example:

<fieldType name="text_ja" positionIncrementGap="100"

autoGeneratePhraseQueries="false">

<tokenizer class="solr.JapaneseTokenizerFactory" mode="search"

userDictionary="lang/userdict_ja.txt"/>

<filter class="solr.JapanesePartOfSpeechStopFilterFactory"

tags="lang/stoptags_ja.txt" enablePositionIncrements="true"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"

words="lang/stopwords_ja.txt" enablePositionIncrements="true" />

</analyzer>

</fieldType>

Hebrew, Lao, Myanmar, Khmer

Lucene provides support, in addition to UAX#29 word break rules, for Hebrew's use of the double and single quote

characters, and for segmenting Lao, Myanmar, and Khmer into syllables with the insolr.ICUTokenizerFactory

141Apache Solr Reference Guide 4.10

the contrib module. To use this tokenizer, see analysis-extras solr/contrib/analysis-extras/README.

instructions on which jars you need to add to your .txt for solr_home/lib

See for more information.the ICUTokenizer

Latvian

Solr includes support for stemming Latvian, and Lucene includes an example stopword list.

Factory class: solr.LatvianStemFilterFactory

Arguments: None

Example:

</analyzer>

</fieldType>

In: "tirgiem tirgus"

Tokenizer to Filter: "tirgiem", "tirgus"

Out: "tirg", "tirg"

Norwegian

Solr includes two classes for stemming Norwegian, and NorwegianLightStemFilterFactory NorwegianMini

. Lucene includes an example stopword list.malStemFilterFactory

Another option is to use the Snowball Porter Stemmer with an argument of language="Norwegian".

Norwegian Light Stemmer

The requires a "two-pass" sort for the -dom and -het endings. This meansNorwegianLightStemFilterFactory

that in the first pass the word "kristendom" is stemmed to "kristen", and then all the general rules apply so it will be

further stemmed to "krist". The effect of this is that "kristen," "kristendom," "kristendommen," and "kristendommens"

will all be stemmed to "krist."

The second pass is to pick up -dom and -het endings. Consider this example:

One pass Two passes

Before After Before After

forlegen forleg forlegen forleg

forlegenhet forlegen forlegenhet forleg

forlegenheten forlegen forlegenheten forleg

forlegenhetens forlegen forlegenhetens forleg

142Apache Solr Reference Guide 4.10

firkantet firkant firkantet firkant

firkantethet firkantet firkantethet firkant

firkantetheten firkantet firkantetheten firkant

Factory class: solr.NorwegianLightStemFilterFactory

Arguments: Choose the Norwegian language variant to use. Valid values are:variant:

nb: Bokmål (default)

nn: Nynorsk

no: both

Example:

<filter class="solr.StopFilterFactory" ignoreCase="true"

words="lang/stopwords_no.txt" format="snowball" enablePositionIncrements="true"/>

</analyzer>

</fieldType>

In: "Forelskelsen"

Tokenizer to Filter: "forelskelsen"

Out: "forelske"

Norwegian Minimal Stemmer

The stems plural forms of Norwegian nouns only.NorwegianMinimalStemFilterFactory

Factory class: solr.NorwegianMinimalStemFilterFactory

Arguments: Choose the Norwegian language variant to use. Valid values are:variant:

nb: Bokmål (default)

nn: Nynorsk

no: both

Example:

<filter class="solr.StopFilterFactory" ignoreCase="true"

words="lang/stopwords_no.txt" format="snowball" enablePositionIncrements="true"/>

</analyzer>

</fieldType>

143Apache Solr Reference Guide 4.10

In: "Bilens"

Tokenizer to Filter: "bilens"

Out: "bil"

Persian

Persian Filter Factories

Solr includes support for normalizing Persian, and Lucene includes an example stopword list.

Factory class: solr.PersianNormalizationFilterFactory

Arguments: None

Example:

</analyzer>

Polish

Solr provides support for Polish stemming with the , and solr.StempelPolishStemFilterFactory solr.Morp

for lemmatization, in the module. The hologikFilterFactory contrib/analysis-extras solr.StempelP

component includes an algorithmic stemmer with tables for Polish. To use either ofolishStemFilterFactory

these filters, see for instructions on which jars you need to addsolr/contrib/analysis-extras/README.txt

to your .solr_home/lib

Factory class: and solr.StempelPolishStemFilterFactory solr.MorfologikFilterFactory

Arguments: None

Example:

</analyzer>

</analyzer>

In: ""studenta studenci"

Tokenizer to Filter: "studenta", "studenci"

Out: "student", "student"

144Apache Solr Reference Guide 4.10

More information about the Stempel stemmer is available in .the Lucene javadocs

The Morfologik param value is a constant specifying which dictionary to choose.dictionary-resource

The dictionary resource must be named andmorfologik/dictionaries/{dictionaryResource}.dict

have an associated metadata file. See for details..info the Morfologik project

Portuguese

Solr includes four stemmers for Portuguese: one in the , an alternativesolr.SnowballPorterFilterFactory

stemmer called , a lighter stemmer called solr.PortugueseStemFilterFactory solr.PortugueseLightSt

, and an even less aggressive stemmer called emFilterFactory solr.PortugueseMinimalStemFilterFact

. Lucene includes an example stopword list.ory

Factory classes: , , solr.PortugueseStemFilterFactory solr.PortugueseLightStemFilterFactory s

olr.PortugueseMinimalStemFilterFactory

Arguments: None

Example:

</analyzer>

</analyzer>

</analyzer>

In: "praia praias"

Tokenizer to Filter: "praia", "praias"

Out: "pra", "pra"

Romanian

Solr can stem Romanian using the Snowball Porter Stemmer with an argument of .language="Romanian"

Factory class: solr.SnowballPorterFilterFactory

Arguments:

language: (required) stemmer language, "Romanian" in this case

145Apache Solr Reference Guide 4.10

Example:

</analyzer>

Russian

Russian Stem Filter

Solr includes two stemmers for Russian: one in the solr.SnowballPorterFilterFactory

, and a lighter stemmer called . Lucenelanguage="Russian" solr.RussianLightStemFilterFactory

includes an example stopword list.

Factory class: solr.RussianLightStemFilterFactory

Arguments: None

Example:

</analyzer>

Spanish

Solr includes two stemmers for Spanish: one in the solr.SnowballPorterFilterFactory

, and a lighter stemmer called . Lucenelanguage="Spanish" solr.SpanishLightStemFilterFactory

includes an example stopword list.

Factory class: solr.SpanishStemFilterFactory

Arguments: None

Example:

</analyzer>

In: "torear toreara torearlo"

Tokenizer to Filter: "torear", "toreara", "torearlo"

Use of custom charsets is no longer supported as of Solr 3.4. If you need to index text in these encodings,

please use Java's character set conversion facilities (InputStreamReader, and so on.) during I/O, so that

Lucene can analyze this text as Unicode instead.

146Apache Solr Reference Guide 4.10

Out: "tor", "tor", "tor"

Swedish

Swedish Stem Filter

Solr includes two stemmers for Swedish: one in the solr.SnowballPorterFilterFactory

, and a lighter stemmer called . Lucenelanguage="Swedish" solr.SwedishLightStemFilterFactory

includes an example stopword list.

Factory class: solr.SwedishStemFilterFactory

Arguments: None

Example:

</analyzer>

In: "kloke klokhet klokheten"

Tokenizer to Filter: "kloke", "klokhet", "klokheten"

Out: "klok", "klok", "klok"

Thai

This filter converts sequences of Thai characters into individual Thai words. Unlike European languages, Thai does

not use whitespace to delimit words.

Factory class: solr.ThaiTokenizerFactory

Arguments: None

Example:

</analyzer>

Turkish

Solr includes support for stemming Turkish through the ; support forsolr.SnowballPorterFilterFactory

case-insensitive search through the ; support for strippingsolr.TurkishLowerCaseFilterFactory

apostrophes and following suffixes through (see solr.ApostropheFilterFactory Role of Apostrophes in

); support for a form of stemming that truncating tokens at a configurable maximumTurkish Information Retrieval

length through the solr.TruncateTokenFilterFactory (see ); and LuceneInformation Retrieval on Turkish Texts

includes an example stopword list.

Factory class: solr.TurkishLowerCaseFilterFactory

Arguments: None

147Apache Solr Reference Guide 4.10

Example:

</analyzer>

Another example, illustrating diacritics-insensitive search:

</analyzer>

Related Topics

LanguageAnalysis

Phonetic Matching

Introduced with Solr v3.6, Beider-Morse Phonetic Matching (BMPM) is a "soundalike" tool that lets you search using

a new phonetic matching system. BMPM helps you search for personal names (or just surnames) in a Solr/Lucene

index, and is far superior to the existing phonetic codecs, such as regular soundex, metaphone, caverphone, etc.

In general, phonetic matching lets you search a name list for names that are phonetically equivalent to the desired

name. BMPM is similar to a soundex search in that an exact spelling is not required. Unlike soundex, it does not

generate a large quantity of false hits.

From the spelling of the name, BMPM attempts to determine the language. It then applies phonetic rules for that

particular language to transliterate the name into a phonetic alphabet. If it is not possible to determine the language

with a fair degree of certainty, it uses generic phonetic instead. Finally, it applies language-independent rules

regarding such things as voiced and unvoiced consonants and vowels to further insure the reliability of the matches.

For example, assume that the matches found when searching for Stephen in a database are "Stefan", "Steph",

"Stephen", "Steve", "Steven", "Stove", and "Stuffin". "Stefan", "Stephen", and "Steven" are probably relevant, and

are names that you want to see. "Stuffin", however, is probably not relevant. Also rejected were "Steph", "Steve",

and "Stove". Of those, "Stove" is probably not one that we would have wanted. But "Steph" and "Steve" are possibly

ones that you might be interested in.

For Solr, BMPM searching is available for the following languages:

English

French

German

Greek

Hebrew written in Hebrew letters

Hungarian

Italian

Lithuanian and Latvian

Polish

Romanian

Russian written in Cyrillic letters

Russian transliterated into English letters

Spanish

Turkish

148Apache Solr Reference Guide 4.10

The name matching is also applicable to non-Jewish surnames from the countries in which those languages are

spoken.

For more information, see here: and http://stevemorse.org/phoneticinfo.htm http://stevemorse.org/phonetics/bmpm.h

tm.

Running Your Analyzer

Once you've defined a field type in and specified the analysis steps that you want applied to it, youschema.xml

should test it out to make sure that it behaves the way you expect it to. Luckily, there is a very handy page in the

Solr that lets you do just that. You can invoke the analyzer for any text field, provide sample input,admin interface

and display the resulting token stream.

For example, assume that the following field type definition has been added to :schema.xml

</analyzer>

</analyzer>

</fieldType>

The objective here (during indexing) is to reconstruct hyphenated words, which may have been split across lines in

the text, then to set all words to lowercase. For queries, you want to skip the de-hyphenation step.

To test this out, point your browser at the of the Solr Admin Web interface. By default, this will be atAnalysis Screen

the following URL (adjust the hostname and/or port to match your configuration): http://localhost:8983/solr/#/collectio

. You should see a page like this.n1/analysis

Empty Analysis screen

We want to test the field type definition for "mytextfield", defined above. The drop-down labeled "Analyse

149Apache Solr Reference Guide 4.10

Fieldname/FieldType" allows choosing the field or field type to use for the analysis.

There are two "Field Value" boxes, one for how text will be analyzed during indexing and a second for how text will

be analyzed for query processing. In the "Field Value (Index)" box enter some sample text "Super-computer" in this

example) to be processed by the analyzer. We will leave the query field value empty for now.

The result we expect is that will join the hyphenated pair "Super-" and "computer" intoHyphenatedWordsFilter

the single word "Supercomputer", and then will set it to "supercomputer". Let's see whatLowerCaseFilter

happens:

Running index-time analyzer, verbose output.

The result is two distinct tokens rather than the one we expected. What went wrong? Looking at the first token that

came out of , we can see the trailing hyphen has been stripped off of "Super-". Checking theStandardTokenizer

documentation for , we see that it treats all punctuation characters as delimiters and discardsStandardTokenizer

them. What we really want in this case is a whitespace tokenizer that will preserve the hyphen character when it

breaks the text into tokens.

Let's make this change and try again:

</analyzer>

</analyzer>

</fieldType>

Re-submitting the form by clicking "Analyse Values" again, we see the result in the screen shot below.

150Apache Solr Reference Guide 4.10

Using WhitespaceTokenizer, expected results.

That's more like it. Because the whitespace tokenizer preserved the trailing hyphen on the first token, Hyphenated

was able to reconstruct the hyphenated word, which then passed it on to , whereWordsFilter LowerCaseFilter

capital letters are set to lowercase.

Now let's see what happens when invoking the analyzer for query processing. For query terms, we don't want to do

de-hyphenation and we want to discard punctuation, so let's try the same input on it. We'll copy the same text todo

the "Field Value (Query)" box and clear the one for index analysis. We'll also include the full, unhyphenated word as

another term to make sure it is processed to lower case as we expect. Submitting again yields these results:

Query-time analyzer, good results.

We can see that for queries the analyzer behaves the way we want it to. Punctuation is stripped out, HyphenatedW

doesn't run, and we wind up with the three tokens we expected.ordsFilter

151Apache Solr Reference Guide 4.10
Indexing and Basic Data Operations
This section describes how Solr adds data to its index. It covers the following topics:
Introduction to Solr Indexing: An overview of
Solr's indexing process.
Simple Post Tool: Information about using post.
 to quickly upload some content to yourjar
system.
Uploading Data with Index Handlers: Information
about using Solr's Index Handlers to upload
XML/XSLT, JSON and CSV data.
Uploading Data with Solr Cell using Apache
Tika: Information about using the Solr Cell
framework to upload data for indexing.
Uploading Structured Data Store Data with the
Data Import Handler: Information about uploading
and indexing data from a structured data store.
Updating Parts of Documents: Information about
how to use atomic updates and optimistic
concurrency with Solr.
Detecting Languages During Indexing:
Information about using language identification
during the indexing process.
De-Duplication: Information about configuring Solr
to mark duplicate documents as they are indexed.
Content Streams: Information about streaming
content to Solr Request Handlers.
UIMA Integration: Information about integrating
Solr with Apache's Unstructured Information
Management Architecture (UIMA). UIMA lets you
define custom pipelines of Analysis Engines that
incrementally add metadata to your documents as
annotations.
Indexing Using Client APIs
Using client APIs, such as  , from your applications is an important option for updating Solr indexes. See the SolrJ Cli
 section for more information.ent APIs
 
Introduction to Solr Indexing
This section describes the process of indexing: adding content to a Solr index and, if necessary, modifying that
content or deleting it. By adding content to an index, we make it searchable by Solr.
A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV) files,
data extracted from tables in a database, and files in common file formats such as Microsoft Word or PDF.
Here are the three most common ways of loading data into a Solr index:
Using the   framework built on Apache Tika for ingesting binary files or structured files such as Office,Solr Cell
Word, PDF, and other proprietary formats.
Uploading XML files by sending HTTP requests to the Solr server from any environment where such requests
can be generated.
Writing a custom Java application to ingest data through Solr's Java Client API (which is described in more
detail in  . Using the Java API may be the best choice if you're working with an application, such asClient APIs
a Content Management System (CMS), that offers a Java API.
Regardless of the method used to ingest data, there is a common basic data structure for data being fed into a Solr
index: a   containing multiple   each with a   and containing   which may be empty. One ofdocument fields, name content,
the fields is usually designated as a unique ID field (analogous to a primary key in a database), although the use of
a unique ID field is not strictly required by Solr.

152Apache Solr Reference Guide 4.10

If the field name is defined in the file that is associated with the index, then the analysis stepsschema.xml

associated with that field will be applied to its content when the content is tokenized. Fields that are not explicitly

defined in the schema will either be ignored or mapped to a dynamic field definition (see Documents, Fields, and

), if one matching the field name exists.Schema Design

For more information on indexing in Solr, see the .Solr Wiki

The Solr Example Directory

The directory includes a sample Solr implementation, along with sample documents for uploading into anexample/

index. You will find the example docs in .$SOLR_HOME/example/exampledocs

The Utility for Transferring Filescurl

Many of the instructions and examples in this section make use of the utility for transferring content through acurl

URL. posts and retrieves data over HTTP, FTP, and many other protocols. Most Linux distributions include acurl

copy of . You'll find curl downloads for Linux, Windows, and many other operating systems at curl http://curl.haxx.s

. Documentation for is available here: .e/download.html curl http://curl.haxx.se/docs/manpage.html

Simple Post Tool

Solr includes a simple command line tool for POSTing raw XML to a Solr port. XML data can be read from files

specified as command line arguments, as raw commandline argument strings, or via STDIN.

The tool is called and is found in the 'exampledocs' directory: post.jar $SOLR/example/exampledocs/post.

includes a cross-platform Java tool for POST-ing XML documents.jar

To run it, open a window and enter:

java -jar post.jar <list of files with messages>

By default, this will contact the server at . The '-help' (or simply '-h' option will output informationlocalhost:8983

on its usage (i.e., .java -jar post.jar -help

Using the Simple Post Tool

Options controlled by System Properties include the Solr URL to post to, the of the data, whether aContent-Type

commit or optimize should be executed, and whether the response should be written to . You may overrideSTDOUT

any other request parameter through the property.-Dparams

This table lists the supported system properties and their defaults:

Parameter Values Default Description

Using or other command line tools for posting data is just fine for examples or tests, but it's not thecurl

recommended method for achieving the best performance for updates in production environments. You will

achieve better performance with Solr Cell or the other methods described in this section.

Instead of , you can use utilities such as GNU ( ) or managecurl wget http://www.gnu.org/software/wget/

GETs and POSTS with Perl, although the command line options will differ.

153Apache Solr Reference Guide 4.10

-Ddata args, stdin, files, web files Use to passargs

arguments along the

command line (such as a

command to delete a

document). Use tofiles

pass a filename or regex

pattern indicating paths

and filenames. Use stdin

to use standard input.

Use for a veryweb

simple web crawler

(arguments for this would

be the URL to crawl).

-Dtype <content-type> application/xml Defines the content-type,

if is not used.-Dauto

-Durl <solr-update-url> http://localhost:8983/solr/update The Solr URL to send

the updates to.

-Dauto yes, no no If yes, the tool will guess

the file type from file

name suffix, and set type

and url accordingly. It

also sets the ID and file

name automatically.

-Drecursive yes, no no Will recurse into

sub-folders and index all

files.

-Dfiletypes <type>[,<type>,..] xml, json, csv, pdf, doc, docx,

ppt, pptx, xls, xlsx, odt, odp,

ods, rtf, htm, html

Specifies the file types to

consider when indexing

folders.

-Dparams "<key>=<value>[&<key>=<value>...]" none HTTP GET params to

add to the request, so

you don't need to write

the whole URL again.

Values must be

URL-encoded.

-Dcommit yes, no yes Perform a commit after

adding the documents.

-Doptimize yes, no no Perform an optimize after

adding the documents.

-Dout yes, no no Write the response to an

output file.

Examples

154Apache Solr Reference Guide 4.10

There are several ways to use . Here are a few examples:post.jar

Add all documents with file extension ..xml

java -jar post.jar *.xml

Send XML arguments to delete a document from the index.

java -Ddata=args -jar post.jar '<delete><id>42</id></delete>'

Index all CSV files.

java -Dtype=text/csv -jar post.jar *.csv

Index all JSON files.

java -Dtype=application/json -jar post.jar *.json

Use the to index a PDF file.extracting request handler

java -Durl=[http://localhost:8983/solr/update/extract] -Dparams=literal.id=a

-Dtype=application/pdf -jar post.jar a.pdf

Automatically detect the content type based on the file extension.

java -Dauto=yes -jar post.jar a.pdf

Automatically detect content types in a folder, and recursively scan it for documents.

java -Dauto=yes -Drecursive=yes -jar post.jar afolder

Automatically detect content types in a folder, but limit it to PPT and HTML files.

java -Dauto=yes -Dfiletypes=ppt,html -jar post.jar afolder

Uploading Data with Index Handlers

Index Handlers are Request Handlers designed to add, delete and

update documents to the index. In addition to having plugins for

importing rich documents or from structured data sourcesusing Tika

using the , Solr natively supports indexingData Import Handler

structured documents in XML, CSV and JSON.

The recommended way to configure & use request handlers is with path

based names, that map to paths in the request url - but request

handlers can also be specified with the (query type) parameter if the qt

is apprpriately configured.requestDispatcher

The example URLs given here reflect the handler configuration in the

155Apache Solr Reference Guide 4.10

supplied . If the name associated with the handler issolrconfig.xml

changed then the URLs will need to be modified. It is possible to access

the same handler using more than one name, which can be useful if you

wish to specify different sets of default options.

Topics covered in this section:

UpdateRequestHandler

Configuration

XML Formatted Index

Updates

JSON Formatted Index

Updates

CSV Formatted Index

Updates

Nested Child

Documents

The Combined UpdateRequestHandler

Prior to Solr 4, uploading content with an update request handler required declaring a unique request handler for the

format of the content in the request. Now, there is a unified update request handler that supports XML, CSV, JSON,

and javabin update requests, delegating to the appropriate based on the ContentStreamLoader Content-Type

of the .ContentStream

UpdateRequestHandler Configuration

The default configuration file has the update request handler configured by default.

XML Formatted Index Updates

Index update commands can be sent as XML message to the update handler using Content-type:

or .application/xml Content-type: text/xml

Adding Documents

The XML schema recognized by the update handler for adding documents is very straightforward:

The element introduces one more documents to be added.<add>

The element introduces the fields making up a document.<doc>

The element presents the content for a specific field.<field>

For example:

156Apache Solr Reference Guide 4.10

<add>

<doc>

<field name="authors">Patrick Eagar</field>

<field name="subject">Sports</field>

<field name="title" boost="2.0">Summer of the all-rounder: Test and championship

cricket in England 1982</field>

<field name="publisher">Collins</field>

</doc>

...

</doc>

</add>

Each element has certain optional attributes which may be specified.

Command Optional

Parameter

Parameter Description

<add> commitWithin=

number

Add the document within the specified number of milliseconds

<add> overwrite=bool

ean

Default is true. Indicates if the unique key constraints should be checked to

overwrite previous versions of the same document (see below)

<doc> boost=float Default is 1.0. Sets a boost value for the document.To learn more about boosting,

see .Searching

<field> boost=float Default is 1.0. Sets a boost value for the field.

If the document schema defines a unique key, then by default an operation to add a document will/update

overwrite (ie: replace) any document in the index with the same unique key. If no unique key has been defined,

indexing performance is somewhat faster, as no check has to be made for an existing documents to replace.

If you have a unique key field, but you feel confident that you can safely bypass the uniqueness check (eg: you build

your indexes in batch, and your indexing code guarantees it never adds the same document more then once) you

can specify the {{overwrite="false"} option when adding your documents.

Commit and Optimize Operations

The operation writes all documents loaded since the last commit to one or more segment files on the<commit>

disk. Before a commit has been issued, newly indexed content is not visible to searches. The commit operation

opens a new searcher, and triggers any event listeners that have been configured.

Commits may be issued explicitly with a message, and can also be triggered from par<commit/> <autocommit>

ameters in .solrconfig.xml

The operation requests Solr to merge internal data structures in order to improve search performance.<optimize>

For a large index, optimization will take some time to complete, but by merging many small segment files into a

157Apache Solr Reference Guide 4.10

larger one, search performance will improve. If you are using Solr's replication mechanism to distribute searches

across many systems, be aware that after an optimize, a complete index will need to be transferred. In contrast,

post-commit transfers are usually much smaller.

The and elements accept these optional attributes:<commit> <optimize>

Optional

Attribute

Description

waitSearcher Default is true. Blocks until a new searcher is opened and registered as the main query

searcher, making the changes visible.

expungeDeletes (commit only) Default is false. Merges segments that have more than 10% deleted docs,

expunging them in the process.

maxSegments (optimize only) Default is 1. Merges the segments down to no more than this number of

segments.

Here are examples of <commit> and <optimize> using optional attributes:

Delete Operations

Documents can be deleted from the index in two ways. "Delete by ID" deletes the document with the specified ID,

and can be used only if a UniqueID field has been defined in the schema. "Delete by Query" deletes all documents

matching a specified query, although is ignored for a Delete by Query. A single delete message cancommitWithin

contain multiple delete operations.

<query>subject:sport</query>

<query>publisher:penguin</query>

</delete>

Rollback Operations

The rollback command rolls back all add and deletes made to the index since the last commit. It neither calls any

event listeners nor creates a new searcher. Its syntax is simple: .<rollback/>

Using to Perform Updates with the Update Request Handler.curl

You can use the utility to perform any of the above commands, using its option to appendcurl --data-binary

the XML message to the command, and generating a HTTP POST request. For example:curl

158Apache Solr Reference Guide 4.10

curl http://localhost:8983/update -H "Content-Type: text/xml" --data-binary '

<add>

<doc>

<field name="authors">Patrick Eagar</field>

<field name="subject">Sports</field>

<field name="publisher">Collins</field>

</doc>

</add>'

For posting XML messages contained in a file, you can use the alternative form:

curl http://localhost:8983/update -H "Content-Type: text/xml" --data-binary

@myfile.xml

Short requests can also be sent using a HTTP GET command, URL-encoding the request, as in the following. Note

the escaping of "<" and ">":

curl http://localhost:8983/update?stream.body=%3Ccommit/%3E

Responses from Solr take the form shown here:

<?xml version="1.0" encoding="UTF-8"?>

</lst>

</response>

The status field will be non-zero in case of failure. The servlet container will generate an appropriate

HTML-formatted message in the case of an error at the HTTP layer.

Using XSLT to Transform XML Index Updates

The UpdateRequestHandler allows you to index any arbitrary XML using the parameter to apply an <tr> XSL

. You must have an XSLT stylesheet in the solr/conf/xslt directory that can transform the incomingtransformation

data to the expected format, and use the parameter to specify the name of that<add><doc/></add> tr

stylesheet.

Here is an example XSLT stylesheet:

159Apache Solr Reference Guide 4.10

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

<xsl:template match="/">

<add>

<xsl:apply-templates select="/random/document"/>

</add>

</xsl:template>

<xsl:template match="document">

<xsl:apply-templates select="*"/>

</doc>

</xsl:template>

<xsl:template match="node">

<xsl:if test="@enhance!=''">

<xsl:attribute name="boost"><xsl:value-of select="@enhance"/></xsl:attribute>

</xsl:if>

<xsl:value-of select="@value"/>

</field>

</xsl:template>

</xsl:stylesheet>

This stylesheet transforms Solr's XML search result format into Solr's Update XML syntax. One example is to copy a

Solr1.3 index (which does not have CSV response writer) into a format which can be indexed into another Solr file

(provided that all fields are stored):

http://localhost:8983/solr/select?q=*:*&wt=xslt&tr=updateXml.xsl&rows=1000

You can also use the stylesheet in to transform an index when updating:XsltUpdateRequestHandler

curl "http://localhost:8983/solr/update?commit=true&tr=updateXml.xsl" -H

"Content-Type: text/xml" --data-binary @myexporteddata.xml

For more information about the XML Update Request Handler, see https://wiki.apache.org/solr/UpdateXmlMessages

JSON Formatted Index Updates

JSON formatted update requests may be sent to Solr's handler using the "/update Content-Type application

" or " "./json text/json

JSON formatted updates can take 3 basic forms, described in depth below:

A sequence of update commands, expressed as a top level JSON Object (aka: Map)

A list of documents to add, expressed as a top level JSON Array containing a JSON Object per document

A single document to add, expressed as a top level JSON Object – to differentiate this from a set of

commands, the request parameter is required.json.command=false

Adding a Single JSON Document

The simplest way to add Documents via JSON is to send each document individually as a JSON Object, using the j

160Apache Solr Reference Guide 4.10

request parameter:son.command=false

curl -X POST -H 'Content-Type: application/json'

'http://localhost:8983/solr/collection1/update?json.command=false' --data-binary '

{

"id": "1",

"title": "Doc 1"

Adding Multiple JSON Documents

Adding multiple documents at one time via JSON can be done via a JSON Array of JSON Objects, where each

object represents a document:

curl -X POST -H 'Content-Type: application/json'

'http://localhost:8983/solr/collection1/update' --data-binary '

[

{

"id": "1",

"title": "Doc 1"

{

"id": "2",

"title": "Doc 2"

}

A sample JSON file is provided at that you can use to add someexample/exampledocs/books.json

documents to the Solr example server using an Array of objects:

cd example/exampledocs

curl 'http://localhost:8983/solr/collection1/update?commit=true'

--data-binary @books.json -H 'Content-type:application/json'

Sending Arbitrary JSON Update Commands

In general, the JSON update syntax supports accepts all of the update commands that the XML update handler

supports, through a straightforward mapping. Multiple commands, adding and deleting documents, may be

contained in one message:

161Apache Solr Reference Guide 4.10

curl -X POST -H 'Content-Type: application/json'

'http://localhost:8983/solr/collection1/update' --data-binary '

{

"add": {

"doc": {

"id": "DOC1",

"my_boosted_field": { /* use a map with boost/value for a boosted field

"boost": 2.3,

"value": "test"

"my_multivalued_field": [ "aaa", "bbb" ] /* Can use an array for a

multi-valued field */

}

"add": {

"commitWithin": 5000, /* commit this document within 5 seconds */

"overwrite": false, /* don't check for existing documents with the same

uniqueKey */

"boost": 3.45, /* a document boost */

"doc": {

"f1": "v1", /* Can use repeated keys for a multi-valued field

"f1": "v2"

}

"commit": {},

"optimize": { "waitSearcher":false },

"delete": { "id":"ID" }, /* delete by ID */

"delete": { "query":"QUERY" } /* delete by query */

As with other update handlers, parameters such as , , , and may becommit commitWithin optimize overwrite

specified in the URL instead of in the body of the message.

The JSON update format allows for a simple delete-by-id. The value of a can be an array which contains adelete

list of zero or more specific document id's (not a range) to be deleted. For example:

{ "delete":"myid" }

{ "delete":["id1","id2"] }

The value of a "delete" can be an array which contains a list of zero or more id's to be deleted. It is not a range (start

and end).

You can also specify with each "delete":_version_

Comments are not allowed in JSON, but duplicate names are.

The comments in the above example are for illustrative purposes only, and can not be included in actual

commands sent to Solr.

162Apache Solr Reference Guide 4.10

{

"delete":"id":50,

"_version_":12345

}

You can specify the version of deletes in the body of the update request as well.

JSON Update Convenience Paths

In addition to the handler, there are a few additional JSON specific request handler paths available by/update

default in Solr, that implicitly override the behavior of some request parameters:

Path Default Parameters

/update/json stream.contentType=application/json

/update/json/docs stream.contentType=application/json

json.command=false

The path may be useful for clients sending in JSON formatted update commands from applications/update/json

where setting the Content-Type proves difficult, while the path can be particularly convenient/update/json/docs

for clients that always want to send in documents – either individually or as a list – with out needing to worry about

the full JSON command syntax.

For more information about the JSON Update Request Handler, see .https://wiki.apache.org/solr/UpdateJSON

CSV Formatted Index Updates

CSV formatted update requests may be sent to Solr's handler using "/update Content-Type application/cs

" or " ".v text/csv

A sample CSV file is provided at that you can use to add some documentsexample/exampledocs/books.csv

to the Solr example server:

cd example/exampledocs

curl 'http://localhost:8983/solr/collection1/update?commit=true'

--data-binary @books.csv -H 'Content-type:application/csv'

CSV Update Parameters

The CSV handler allows the specification of many parameters in the URL in the form: f.parameter.optional_f

.ieldname=value

The table below describes the parameters for the update handler.

Parameter Usage Global

(g) or

Per

Field

(f)

Example

163Apache Solr Reference Guide 4.10

separator Character used as field separator; default is "," g,(f:

see

split)

separator=%

trim If true, remove leading and trailing whitespace from

values. Default=false.

g,f f.isbn.trim=true

trim=false

header Set to true if first line of input contains field names.

These will be used if the parameter isfield_name

absent.

field_name Comma separated list of field names to use when

adding documents.

g field_name=isbn,price,title

literal.<field_name> Comma separated list of field names to use when

processing literal values.

g literal.color=red,blue,black

skip Comma separated list of field names to skip. g skip=uninteresting,shoesize

skipLines Number of lines to discard in the input stream

before the CSV data starts, including the header, if

present. Default=0.

g skipLines=5

encapsulator The character optionally used to surround values to

preserve characters such as the CSV separator or

whitespace. This standard CSV format handles the

encapsulator itself appearing in an encapsulated

value by doubling the encapsulator.

g,(f:

see

split)

encapsulator="

escape The character used for escaping CSV separators or

other reserved characters. If an escape is

specified, the encapsulator is not used unless also

explicitly specified since most formats use either

encapsulation or escaping, not both

g escape=\

keepEmpty Keep and index zero length (empty) fields.

Default=false.

g,f f.price.keepEmpty=true

map Map one value to another. Format is

value:replacement (which can be empty.)

g,f map=left:right

f.subject.map=history:bunk

split If true, split a field into multiple values by a

separate parser.

overwrite If true (the default), check for and overwrite

duplicate documents, based on the uniqueKey field

declared in the Solr schema. If you know the

documents you are indexing do not contain any

duplicates then you may see a considerable speed

up setting this to false.

commit Issues a commit after the data has been ingested. g

164Apache Solr Reference Guide 4.10

commitWithin Add the document within the specified number of

milliseconds.

g commitWithin=10000

rowid Map the rowid (line number) to a field specified by

the value of the parameter, for instance if your CSV

doesn't have a unique key and you want to use the

row id as such.

g rowid=id

rowidOffset Add the given offset (as an int) to the rowid before

adding it to the document. Default is 0

g rowidOffset=10

CSV Update Convenience Paths

In addition to the handler, there is an additional CSV specific request handler path available by default in/update

Solr, that implicitly override the behavior of some request parameters:

Path Default Parameters

/update/csv stream.contentType=application/csv

The path may be useful for clients sending in CSV formatted update commands from applications/update/csv

where setting the Content-Type proves difficult.

For more information on the CSV Update Request Handler, see .https://wiki.apache.org/solr/UpdateCSV

Nested Child Documents

Solr nested documents using a "Block Join" when indexing as a way to model documents containing other

documents, such as a blog post parent document and comments as child documents -- or products as parent

documents and sizes, colors, or other variations as child documents. At query time, the cBlock Join Query Parsers

an be used search against these relationships. In terms of performance, indexing the relationships between

documents may be more efficient than attempting to do joins only at query time, since the relationships are already

stored in the index and do not need to be computed.

Nested documents may be indexed via either the XML or JSON data syntax (or using - but regardless ofSolrJ)

syntax, you must include a field that identifies the parent document as a parent; it can be any field that suits this

purpose, and it will be used as input for the .block join query parsers

XML Examples

For example, here are two documents and their child documents:

165Apache Solr Reference Guide 4.10

<add>

<doc>

<field name="title">Solr adds block join support</field>

<field name="content_type">parentDocument</field>

<doc>

<field name="comments">SolrCloud supports it too!</field>

</doc>

<doc>

<field name="title">Lucene and Solr 4.5 is out</field>

<field name="content_type">parentDocument</field>

<doc>

<field name="comments">Lots of new features</field>

</doc>

</add>

In this example, we have indexed the parent documents with the field , which has the valuecontent_type

"parentDocument". We could have also used a boolean field, such as , with a value of "true", or any otherisParent

similar approach.

JSON Examples

This example is equivalent to the XML example above, note the special key need to indicate_childDocuments_

the nested documents in JSON.

[

{

"id": "1",

"title": "Solr adds block join support",

"content_type": "parentDocument",

"_childDocuments_": [

{

"id": "2",

"comments": "SolrCloud supports it too!"

}

]

{

"id": "3",

"title": "Lucene and Solr 4.5 is out",

"content_type": "parentDocument",

"_childDocuments_": [

{

"id": "4",

"comments": "Lots of new features"

}

]

}

]

166Apache Solr Reference Guide 4.10

Transforming and Indexing custom JSON data

This helps index JSON into a valid Solr document according to the users configuration. It lets the user to split a

single JSON file into 1 or more Solr documents. The final indexed document can be controlled using the mapping

passed along the request . One or more valid JSON documents can be sent to the /update/json/docs path with the

configuration params.

Mapping params

split : This parameter is required if you wish to transform the input JSON . This is the path at which the JSON

must be split . If the entire JSON makes a single solr document , the path must be . “/”

f : This is a multivalued mapping parameter . At least one field mapping must be provided . The format of the

parameter is . The ‘json-path’ is a required part . target-field-name is the name{target-field-name}:{json-path}

of the field in the input Solr document. It is optional and it is automatically derived from the input json

echo : This is for debugging purpose only . set it to true , if you want the docs to be returned as a response.

Nothing will be indexed

example 1:

curl 'http://localhost:8983/solr/collection1/update/json/docs'\

'?split=/exams'\

'&f=first:/first'\

'&f=last:/last'\

'&f=grade:/grade'\

'&f=subject:/exams/subject'\

'&f=test:/exams/test'\

'&f=marks:/exams/marks'\

-H 'Content-type:application/json' -d '

{

"first": "John",

"last": "Doe",

"grade": 8,

"exams": [

{

"subject": "Maths",

"test" : "term1",

"marks" : 90},

{

"subject": "Biology",

"test" : "term1",

"marks" : 86}

]

This indexes the following two docs

167Apache Solr Reference Guide 4.10

{

"first":"John",

"last":"Doe",

"marks":90,

"test":"term1",

"subject":"Maths",

"grade":8

}

{

"first":"John",

"last":"Doe",

"marks":86,

"test":"term1",

"subject":"Biology",

"grade":8

}

As the final field names are the same as the input document fields, the request can be simplified as,

example 2 :

curl 'http://localhost:8983/solr/collection1/update/json/docs'\

'?split=/exams'\

'&f=/first'\

'&f=/last'\

'&f=/grade'\

'&f=/exams/subject'\

'&f=/exams/test'\

'&f=/exams/marks'\

-H 'Content-type:application/json' -d '

{

"first": "John",

"last": "Doe",

"grade": 8,

"exams": [

{

"subject": "Maths",

"test" : "term1",

"marks" : 90},

{

"subject": "Biology",

"test" : "term1",

"marks" : 86}

]

Wildcards

Instead of specifying all the field names explicitly, it is possible to specify wildcards to map fields automatically.

There are two restrictions: wildcards can only be used at the end of the json-path; and the split path cannot use

wildcards. A single asterisk "*" maps only to direct children, and a double asterisk "**" maps recursively to all

descendants. The following are example wildcard path mappings:

168Apache Solr Reference Guide 4.10

f=/docs/* : maps all the fields under docs and in the name as given in json

f=/docs/** : maps all the fields under docs and its children in the name as given in json

f=searchField:/docs/* : maps all fields under /docs to a single field called ‘searchField’

f=searchField:/docs/** : maps all fields under /docs and its children to searchField

With wildcards we can simplify our previous example as follows

example 3:

curl 'http://localhost:8983/solr/collection1/update/json/docs'\

'?split=/exams'\

'&f=/**'\

-H 'Content-type:application/json' -d '

{

"first": "John",

"last": "Doe",

"grade": 8,

"exams": [

{

"subject": "Maths",

"test" : "term1",

"marks" : 90},

{

"subject": "Biology",

"test" : "term1",

"marks" : 86}

]

It is also possible to send all the values to a single field and do a full text search on that . This is a good

option to blindly index and query JSON documents without worrying about fields and schema

example 4 :

curl 'http://localhost:8983/solr/collection1/update/json/docs'\

'?split=/'\

'&f=txt:/**'\

-H 'Content-type:application/json' -d '

{

"first": "John",

"last": "Doe",

"grade": 8,

"exams": [

{

"subject": "Maths",

"test" : "term1",

"marks" : 90},

{

"subject": "Biology",

"test" : "term1",

"marks" : 86}

]

169Apache Solr Reference Guide 4.10
 
Uploading Data with Solr Cell using Apache Tika
Solr uses code from the   project to provide aApache Tika
framework for incorporating many different file-format parsers
such as   and   into Solr itself.Apache PDFBox Apache POI
Working with this framework, Solr's ExtractingRequestHa
 can use Tika to support uploading binary files,ndler
including files in popular formats such as Word and PDF, for
data extraction and indexing.
When this framework was under development, it was called
the Solr Content Extraction Library or CEL; from that
abbreviation came this framework's name: Solr Cell.
If you want to supply your own   for Solr toContentHandler
use, you can extend the   andExtractingRequestHandler
override the   method. This factory iscreateFactory()
responsible for constructing the   thatSolrContentHandler
interacts with Tika, and allows literals to override Tika-parsed
values. Set the parameter  , whichliteralsOverride
normally defaults to *true, to *false to append Tika-parsed
values to literal values.
For more information on Solr's Extracting Request Handler,
see  .https://wiki.apache.org/solr/ExtractingRequestHandler
Topics covered in this section:
Key Concepts
Trying out Tika with the Solr
Example Directory
Input Parameters
Order of Operations
Configuring the Solr
ExtractingRequestHandler
Indexing Encrypted Documents
with the
ExtractingUpdateRequestHandler
Examples
Sending Documents to Solr with
a POST
Sending Documents to Solr with
Solr Cell and SolrJ
Related Topics
Key Concepts
When using the Solr Cell framework, it is helpful to keep the following in mind:
Tika will automatically attempt to determine the input document type (Word, PDF, HTML) and extract the
content appropriately. If you like, you can explicitly specify a MIME type for Tika with the   parastream.type
meter.
Tika works by producing an XHTML stream that it feeds to a SAX ContentHandler. SAX is a common
interface implemented for many different XML parsers. For more information, see http://www.saxproject.org/q
.uickstart.html
Solr then responds to Tika's SAX events and creates the fields to index.
Tika produces metadata such as Title, Subject, and Author according to specifications such as the
DublinCore. See   for the file types supported.http://tika.apache.org/1.5/formats.html
Tika adds all the extracted text to the   field. This field is defined as "stored" in  . It iscontent schema.xml
also copied to the   field with a   rule.text copyField
You can map Tika's metadata fields to Solr fields. You can also boost these fields.
You can pass in literals for field values. Literals will override Tika-parsed values, including fields in the Tika
metadata object, the Tika content field, and any "captured content" fields.
You can apply an XPath expression to the Tika XHTML to restrict the content that is produced.
As of version 4.8, Solr uses Apache Tika v1.5.

170Apache Solr Reference Guide 4.10

Trying out Tika with the Solr Example Directory

You can try out the Tika framework using the example application included in Solr.

Start the Solr example server:

cd example -jar start.jar

In a separate window go to the directory (which contains some nice example docs), or the site directory ifdocs/

you built Solr from source, and send Solr a file via HTTP POST:

curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' -F

"myfile=@tutorial.html"

The URL above calls the Extraction Request Handler, uploads the file and assigns it the unique IDtutorial.html

. Here's a closer look at the components of this command:doc1

The parameter provides the necessary unique ID for the document being indexed.literal.id=doc1

The causes Solr to perform a commit after indexing the document, making itcommit=true parameter

immediately searchable. For optimum performance when loading many documents, don't call the commit

command until you are done.

The flag instructs curl to POST data using the Content-Type and supports the-F multipart/form-data

uploading of binary files. The @ symbol instructs curl to upload the attached file.

The argument needs a valid path, which can be absolute or relative (formyfile=@tutorial.html

example, if you are still in exampledocs directory).myfile=@../../site/tutorial.html

Now you should be able to execute a query and find that document (open the following link in your browser): http://lo

.calhost:8983/solr/select?q=tutorial

You may notice that although you can search on any of the text in the sample document, you may not be able to see

that text when the document is retrieved. This is simply because the "content" field generated by Tika is mapped to

the Solr field called , which is indexed but not stored. This operation is controlled by default map rule in the text /u

handler in , and its behavior can be easily changed or overridden. Forpdate/extract solrconfig.xml

example, to store and see all metadata and content, execute the following:

curl

'http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=

attr_content&commit=true' -F "myfile=@tutorial.html"

In this command, the parameter causes all generated fields that aren't defined in the schema touprefix=attr_

be prefixed with , which is a dynamic field that is stored.attr_

The parameter overrides the default causing the contentfmap.content=attr_content fmap.content=text

to be added to the field instead.attr_content

While Apache Tika is quite powerful, it is not perfect and fails on some files. PDF files are particularly

problematic, mostly due to the PDF format itself. In case of a failure processing any file, the ExtractingRe

does not have a secondary mechanism to try to extract some text from the file; it will throwquestHandler

an exception and fail.

171Apache Solr Reference Guide 4.10

Then run this command to query the document: http://localhost:8983/solr/select?q=attr_content:tutorial

Input Parameters

The table below describes the parameters accepted by the Extraction Request Handler.

Parameter Description

boost.< >fieldname Boosts the specified field by the defined float amount. (Boosting a field alters its

importance in a query response. To learn about boosting fields, see .)Searching

capture Captures XHTML elements with the specified name for a supplementary addition to

the Solr document. This parameter can be useful for copying chunks of the XHTML

into a separate field. For instance, it could be used to grab paragraphs ( ) and<p>

index them into a separate field. Note that content is still also captured into the overall

"content" field.

captureAttr Indexes attributes of the Tika XHTML elements into separate fields, named after the

element. If set to true, for example, when extracting from HTML, Tika can return the

href attributes in <a> tags as fields named "a". See the examples below.

commitWithin Add the document within the specified number of milliseconds.

date.formats Defines the date format patterns to identify in the documents.

defaultField If the uprefix parameter (see below) is not specified and a field cannot be determined,

the default field will be used.

extractOnly Default is false. If true, returns the extracted content from Tika without indexing the

document. This literally includes the extracted XHTML as a string in the response.

When viewing manually, it may be useful to use a response format other than XML to

aid in viewing the embedded XHTML tags.For an example, see http://wiki.apache.org/

.solr/TikaExtractOnlyExampleOutput

extractFormat Default is "xml", but the other option is "text". Controls the serialization format of the

extract content. The xml format is actually XHTML, the same format that results from

passing the command to the Tika command line application, while the text format-x

is like that produced by Tika's command. This parameter is valid only if -t extractO

is set to true.nly

fmap.< >source_field Maps (moves) one field name to another. The must be a field insource_field

incoming documents, and the value is the Solr field to map to. Example: fmap.cont

causes the data in the field generated by Tika to be moved toent=text content

the Solr's field.text

literal.< >fieldname Populates a field with the name supplied with the specified value for each document.

The data can be multivalued if the field is multivalued.

literalsOverride If true (the default), literal field values will override other values with the same field

name. If false, literal values defined with will be appendedliteral.< >fieldname

to data already in the fields extracted from Tika. If setting toliteralsOverride

"false", the field must be multivalued.

172Apache Solr Reference Guide 4.10

lowernames Values are "true" or "false". If true, all field names will be mapped to lowercase with

underscores, if needed. For example, "Content-Type" would be mapped to

"content_type."

multipartUploadLimitInKB Useful if uploading very large documents, this defines the KB size of documents to

allow.

passwordsFile Defines a file path and name for a file of file name to password mappings.

resource.name Specifies the optional name of the file. Tika can use it as a hint for detecting a file's

MIME type.

resource.password Defines a password to use for a password-protected PDF or OOXML file

tika.config Defines a file path and name to a customized Tika configuration file. This is only

required if you have customized your Tika implementation.

uprefix Prefixes all fields that are not defined in the schema with the given prefix. This is very

useful when combined with dynamic field definitions. Example: uprefix=ignored_

would effectively ignore all unknown fields generated by Tika given the example

schema contains <dynamicField name="ignored_*" type="ignored"/>

xpath When extracting, only return Tika XHTML content that satisfies the given XPath

expression. See for details on the format of Tikahttp://tika.apache.org/1.5/index.html

XHTML. See also .http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput

Order of Operations

Here is the order in which the Solr Cell framework, using the Extraction Request Handler and Tika, processes its

input.

Tika generates fields or passes them in as literals specified by . If literal.<fieldname>=<value> liter

, literals will be appended as multi-value to the Tika-generated field.alsOverride=false

If , Tika maps fields to lowercase.lowernames=true

Tika applies the mapping rules specified by parameters.fmap. source = target

If is specified, any unknown field names are prefixed with that value, else if isuprefix defaultField

specified, any unknown fields are copied to the default field.

Configuring the Solr ExtractingRequestHandler

If you are not working in the supplied directory, you must copy all libraries from example/solr example/solr/l

into a directory within your own solr directory or to a directory you've specified in usingibs libs solrconfig.xml

the new directive. The is not incorporated into the Solr WAR file, so you havelibs ExtractingRequestHandler

to install it separately.

Here is an example of configuring the in .ExtractingRequestHandler solrconfig.xml

173Apache Solr Reference Guide 4.10

<requestHandler name="/update/extract"

class="org.apache.solr.handler.extraction.ExtractingRequestHandler">

<str name="fmap.Last-Modified">last_modified</str>

<str name="uprefix">ignored_</str>

</lst>

<!--Optional. Specify a path to a tika configuration file. See the Tika docs for

details.-->

<str name="tika.config">/my/path/to/tika.config</str>

<!-- Optional. Specify one or more date formats to parse. See

DateUtil.DEFAULT_DATE_FORMATS

for default date formats -->

</lst>

</requestHandler>

In the defaults section, we are mapping Tika's Last-Modified Metadata attribute to a field named .last_modified

We are also telling it to ignore undeclared fields. These are all overridden parameters.

The entry points to a file containing a Tika configuration. The allows you to specifytika.config date.formats

various date formats for working with transforming extracted input to a Date.java.text.SimpleDateFormats

Solr comes configured with the following date formats (see the in Solr):DateUtil

yyyy-MM-dd'T'HH:mm:ss'Z'

yyyy-MM-dd'T'HH:mm:ss

yyyy-MM-dd

yyyy-MM-dd hh:mm:ss

yyyy-MM-dd HH:mm:ss

EEE MMM d hh:mm:ss z yyyy

EEE, dd MMM yyyy HH:mm:ss zzz

EEEE, dd-MMM-yy HH:mm:ss zzz

EEE MMM d HH:mm:ss yyyy

You may also need to adjust the attribute as follows if you are submitting very largemultipartUploadLimitInKB

documents.

...

Multi-Core Configuration

For a multi-core configuration, specify in the section of in order for Solr tosharedLib='lib' <solr/> solr.xml

find the JAR files in .example/solr/lib

For more information about Solr cores, see .The Well-Configured Solr Instance

Indexing Encrypted Documents with the ExtractingUpdateRequestHandler

The ExtractingRequestHandler will decrypt encrypted files and index their content if you supply a password in either

on the request, or in a file.resource.password passwordsFile

174Apache Solr Reference Guide 4.10

In the case of , the file supplied must be formatted so there is one line per rule. Each rule containspasswordsFile

a file name regular expression, followed by "=", then the password in clear-text. Because the passwords are in

clear-text, the file should have strict access restrictions.

# This is a comment

myFileName = myPassword

.*\.docx$ = myWordPassword

.*\.pdf$ = myPdfPassword

Examples

Metadata

As mentioned before, Tika produces metadata about the document. Metadata describes different aspects of a

document, such as the author's name, the number of pages, the file size, and so on. The metadata produced

depends on the type of document submitted. For instance, PDFs have different metadata than Word documents do.

In addition to Tika's metadata, Solr adds the following metadata (defined in ):ExtractingMetadataConstants

Solr Metadata Description

stream_name The name of the Content Stream as uploaded to Solr. Depending on how the file is

uploaded, this may or may not be set

stream_source_info Any source info about the stream. (See the section on Content Streams later in this

section.)

stream_size The size of the stream in bytes.

stream_content_type The content type of the stream, if available.

Examples of Uploads Using the Extraction Request Handler

Capture and Mapping

The command below captures tags separately, and then maps all the instances of that field to a dynamic<div>

field named .foo_t

curl

"http://localhost:8983/solr/update/extract?literal.id=doc2&captureAttr=true&defaultFie

ld=text&fmap.div=foo_t&capture=div" -F "tutorial=@tutorial.pdf"

Capture, Mapping, and Boosting

The command below captures tags separately, maps the field to a dynamic field named , then boosts <div> foo_t

by 3.foo_t

We recommend that you try using the option to discover which values Solr is setting forextractOnly

these metadata elements.

175Apache Solr Reference Guide 4.10

curl

"http://localhost:8983/solr/update/extract?literal.id=doc3&captureAttr=true&defaultFie

ld=text&capture=div&fmap.div=foo_t&boost.foo_t=3" -F "tutorial=@tutorial.pdf"

Using Literals to Define Your Own Metadata

To add in your own metadata, pass in the literal parameter along with the file:

curl

"http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultFie

ld=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah" -F

"tutorial=@tutorial.pdf"

XPath

The example below passes in an XPath expression to restrict the XHTML returned by Tika:

curl

"http://localhost:8983/solr/update/extract?literal.id=doc5&captureAttr=true&defaultFie

ld=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.id=id&xpath=/xhtml:html/xhtml

:body/xhtml:div/descendant:node()" -F "tutorial=@tutorial.pdf"

Extracting Data without Indexing It

Solr allows you to extract data without indexing. You might want to do this if you're using Solr solely as an extraction

server or if you're interested in testing Solr extraction.

The example below sets the parameter to extract data without indexing it.extractOnly=true

curl "http://localhost:8983/solr/update/extract?&extractOnly=true" --data-binary

@tutorial.html -H 'Content-type:text/html'

The output includes XML generated by Tika (and further escaped by Solr's XML) using a different output format to

make it more readable:

curl "http://localhost:8983/solr/update/extract?&extractOnly=true&wt=ruby&indent=true"

--data-binary @tutorial.html -H 'Content-type:text/html'

Sending Documents to Solr with a POST

The example below streams the file as the body of the POST, which does not, then, provide information to Solr

about the name of the file.

curl "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text"

--data-binary @tutorial.html -H 'Content-type:text/html'

Sending Documents to Solr with Solr Cell and SolrJ

SolrJ is a Java client that you can use to add documents to the index, update the index, or query the index. You'll

find more information on SolrJ in .Client APIs

176Apache Solr Reference Guide 4.10

Here's an example of using Solr Cell and SolrJ to add documents to a Solr index.

First, let's use SolrJ to create a new SolrServer, then we'll construct a request containing a ContentStream

(essentially a wrapper around a file) and sent it to Solr:

public class SolrCellRequestDemo {

public static void main (String[] args) throws IOException, SolrServerException {

SolrServer server = new HttpSolrServer("http://localhost:8983/solr");

ContentStreamUpdateRequest req = new

ContentStreamUpdateRequest("/update/extract");

req.addFile(new File("apache-solr/site/features.pdf"));

req.setParam(ExtractingParams.EXTRACT_ONLY, "true");

NamedList<Object> result = server.request(req);

System.out.println("Result: " + result);

}

This operation streams the file into the Solr index.features.pdf

The sample code above calls the extract command, but you can easily substitute other commands that are

supported by Solr Cell. The key class to use is the , which makes sure theContentStreamUpdateRequest

ContentStreams are set properly. SolrJ takes care of the rest.

Note that the is not just specific to Solr Cell. You can send CSV to the CSVContentStreamUpdateRequest

Update handler and to any other Request Handler that works with Content Streams for updates.

Related Topics

ExtractingRequestHandler

Uploading Structured Data Store Data with the Data Import Handler

Many search applications store the content to be indexed in a structured data

store, such as a relational database. The Data Import Handler (DIH) provides a

mechanism for importing content from a data store and indexing it. In addition to

relational databases, DIH can index content from HTTP based data sources such

as RSS and ATOM feeds, e-mail repositories, and structured XML where an

XPath processor is used to generate fields.

For more information about the Data Import Handler, see https://wiki.apache.org/s

.olr/DataImportHandler

The DataImportHandler jars are no longer included in the Solr WAR. You

should add them to Solr's lib directory, or reference them via the di<lib>

rective in .solrconfig.xml

177Apache Solr Reference Guide 4.10
Topics covered in
this section:
Concepts
and
Terminology
Configuration
Data Import
Handler
Commands
Property
Writer
Data
Sources
Entity
Processors
Transformers
Special
Commands
for the Data
Import
Handler
Concepts and Terminology
Descriptions of the Data Import Handler use several familiar terms, such as entity and processor, in specific ways,
as explained in the table below.
Term Definition
Datasource As its name suggests, a datasource defines the location of the data of interest. For a database, it's
a DSN. For an HTTP datasource, it's the base URL.
Entity Conceptually, an entity is processed to generate a set of documents, containing multiple fields,
which (after optionally being transformed in various ways) are sent to Solr for indexing. For a
RDBMS data source, an entity is a view or table, which would be processed by one or more SQL
statements to generate a set of rows (documents) with one or more columns (fields).
Processor An entity processor does the work of extracting content from a data source, transforming it, and
adding it to the index. Custom entity processors can be written to extend or replace the ones
supplied.
Transformer Each set of fields fetched by the entity may optionally be transformed. This process can modify the
fields, create new fields, or generate multiple rows/documents form a single row. There are several
built-in transformers in the DIH, which perform functions such as modifying dates and stripping
HTML. It is possible to write custom transformers using the publicly available interface.
Configuration

178Apache Solr Reference Guide 4.10

Configuring solrconfig.xml

The Data Import Handler has to be registered in . For example:solrconfig.xml

<requestHandler name="/dataimport"

class="org.apache.solr.handler.dataimport.DataImportHandler">

<str name="config">/path/to/my/DIHconfigfile.xml</str>

</lst>

</requestHandler>

The only required parameter is the parameter, which specifies the location of the DIH configuration file thatconfig

contains specifications for the data source, how to fetch data, what data to fetch, and how to process it to generate

the Solr documents to be posted to the index.

You can have multiple DIH configuration files. Each file would require a separate definition in the solrconfig.xml

file, specifying a path to the file.

Configuring the DIH Configuration File

There is a sample DIH application distributed with Solr in the directory . This accesses aexample/example-DIH

small hsqldb database. Details of how to run this example can be found in the README.txt file. The sample DIH

configuration can be found in .example/example-DIH/solr/db/conf/db-data-config.xml

An annotated configuration file, based on the sample, is shown below. It extracts fields from the four tables defining

a simple product database, with this schema. More information about the parameters and options shown here are

described in the sections following.

179Apache Solr Reference Guide 4.10

<!-- The first element is the dataSource, in this case an HSQLDB database.

The path to the JDBC driver and the JDBC URL and login credentials are all

specified here.

Other permissible attributes include whether or not to autocommit to Solr,the

batchsize

used in the JDBC connection, a 'readOnly' flag -->

<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:./example-DIH/hsqldb/ex"

user="sa" />

<!-- a 'document' element follows, containing multiple 'entity' elements.

Note that 'entity' elements can be nested, and this allows the entity

relationships in the sample database to be mirrored here, so that we can

generate a denormalized Solr record which may include multiple features

for one item, for instance -->

<!-- The possible attributes for the entity element are described below.

Entity elements may contain one or more 'field' elements, which map

the data source field names to Solr fields, and optionally specify

per-field transformations -->

<entity name="item" query="select * from item"

deltaQuery="select id from item where last_modified >

'${dataimporter.last_index_time}'">

<!-- This entity is nested and reflects the one-to-many relationship between an item

and its multiple features.

Note the use of variables; ${item.ID} is the value of the column 'ID' for the

current item

('item' referring to the entity name) -->

<entity name="feature"

query="select DESCRIPTION from FEATURE where ITEM_ID='${item.ID}'"

deltaQuery="select ITEM_ID from FEATURE where last_modified >

'${dataimporter.last_index_time}'"

parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}">

</entity>

<entity name="item_category"

query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'"

deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where

last_modified > '${dataimporter.last_index_time}'"

parentDeltaQuery="select ID from item where

ID=${item_category.ITEM_ID}">

<entity name="category"

query="select DESCRIPTION from category where ID =

'${item_category.CATEGORY_ID}'"

deltaQuery="select ID from category where last_modified >

'${dataimporter.last_index_time}'"

parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category where

CATEGORY_ID=${category.ID}">

</entity>

</document>

</dataConfig>

180Apache Solr Reference Guide 4.10

Datasources can still be specified in . These must be specified in the defaults section of thesolrconfig.xml

handler in . However, these are not parsed until the main configuration is loaded.solrconfig.xml

The entire configuration itself can be passed as a request parameter using the parameter rather thandataConfig

using a file. When configuration errors are encountered, the error message is returned in XML format.

In Solr 4.1, a new property was added, the element, which allows defining the date format andpropertyWriter

locale for use with delta queries. It also allows customizing the name and location of the properties file.

The command is still supported, which is useful for validating a new configuration file, or if youreload-config

want to specify a file, load it, and not have it reloaded again on import. If there is an mistake in the configurationxml

a user-friendly message is returned in format. You can then fix the problem and do a .xml reload-config

Data Import Handler Commands

DIH commands are sent to Solr via an HTTP request. The following operations are supported.

Command Description

abort Aborts an ongoing operation. The URL is http://<host>:<port>/solr/dataimport?c

.ommand=abort

delta-import For incremental imports and change detection. The command is of the form http://<host>

. It supports the same clean,:<port>/solr/dataimport?command=delta-import

commit, optimize and debug parameters as full-import command.

full-import A Full Import operation can be started with a URL of the form http://<host>:<port>/so

. The command returns immediately. Thelr/dataimport?command=full-import

operation will be started in a new thread and the attribute in the response should bestatus

shown as . The operation may take some time depending on the size of dataset. Queriesbusy

to Solr are not blocked during full-imports.

When a full-import command is executed, it stores the start time of the operation in a file

located at . This stored timestamp is used when aconf/dataimport.properties

delta-import operation is executed.

For a list of parameters that can be passed to this command, see below.

reload-config If the configuration file has been changed and you wish to reload it without restarting Solr, run

the command .http://<host>:<port>/solr/dataimport?command=reload-config

status The URL is . It returnshttp://<host>:<port>/solr/dataimport?command=status

statistics on the number of documents created, deleted, queries run, rows fetched, status,

and so on.

Parameters for the Commandfull-import

The command accepts the following parameters:full-import

Parameter Description

clean Default is true. Tells whether to clean up the index before the indexing is started.

You can also view the DIH configuration in the Solr Admin UI. There is also an interface to import content.

181Apache Solr Reference Guide 4.10

commit Default is true. Tells whether to commit after the operation.

debug Default is false Runs the command in debug mode. It is used by the interactive development mode.

Note that in debug mode, documents are never committed automatically. If you want to run debug

mode and commit the results too, add as a request parameter.commit=true

entity The name of an entity directly under the tag in the configuration file. Use this to<document>

execute one or more entities selectively. Multiple "entity" parameters can be passed on to run

multiple entities at once. If nothing is passed, all entities are executed.

optimize Default is true. Tells Solr whether to optimize after the operation.

Property Writer

The element defines the date format and locale for use with delta queries. It is an optionalpropertyWriter

configuration. Add the element to the DIH configuration file, directly under the element.dataConfig

<propertyWriter dateFormat="yyyy-MM-dd HH:mm:ss" type="SimplePropertiesWriter"

directory="data" filename="my_dih.properties" locale="en_US" />

The parameters available are:

Parameter Description

dateFormat A java.text.SimpleDateFormat to use when converting the date to text. The default is "yyyy-MM-dd

HH:mm:ss".

type The implementation class. Use for non-SolrCloud installations. IfSimplePropertiesWriter

using SolrCloud, use . If this is not specified, it will default to theZKPropertiesWriter

appropriate class depending on if SolrCloud mode is enabled.

directory Used with the only). The directory for the properties file. If notSimplePropertiesWriter

specified, the default is "conf".

filename Used with the only). The name of the properties file. If not specified,SimplePropertiesWriter

the default is the requestHandler name (as defined in , appended bysolrconfig.xml

".properties" (i.e., "dataimport.properties").

locale The locale. If not defined, the ROOT locale is used. It must be specified as language-country. For

example, .en-US

Data Sources

A data source specifies the origin of data and its type. Somewhat confusingly, some data sources are configured

within the associated entity processor. Data sources can also be specified in , which is usefulsolrconfig.xml

when you have multiple environments (for example, development, QA, and production) differing only in their data

sources.

You can create a custom data source by writing a class that extends org.apache.solr.handler.dataimport

..DataSource

The mandatory attributes for a data source definition are its name and type. The name identifies the data source to

an Entity element.

182Apache Solr Reference Guide 4.10

The types of data sources available are described below.

ContentStreamDataSource

This takes the POST data as the data source. This can be used with any EntityProcessor that uses a DataSource<

.Reader>

FieldReaderDataSource

This can be used where a database field contains XML which you wish to process using the XPathEntityProcessor.

You would set up a configuration with both JDBC and FieldReader data sources, and two entities, as follows:

<entity name ="e1" dataSource="a1" processor="SQLEntityProcessor" pk="docid"

query="select * from t1 ...">

<!-- nested XpathEntity; the field in the parent which is to be used for

Xpath is set in the "datafield" attribute in place of the "url" attribute -->

<entity name="e2" dataSource="a2" processor="XPathEntityProcessor"

dataField="e1.fieldToUseForXPath">

...

</entity>

The FieldReaderDataSource can take an parameter, which will default to "UTF-8" if not specified.It mustencoding

be specified as language-country. For example, .en-US

FileDataSource

This can be used like an , but is used to fetch content from files on disk. The only difference fromURLDataSource

URLDataSource, when accessing disk files, is how a pathname is specified.

This data source accepts these optional attributes.

Optional Attribute Description

basePath The base path relative to which the value is evaluated if it is not absolute.

encoding Defines the character encoding to use. If not defined, UTF-8 is used.

JdbcDataSource

This is the default datasource. It's used with the . See the example in the SQLEntityProcessor FieldReaderDataSour

section for details on configuration.ce

URLDataSource

This data source is often used with XPathEntityProcessor to fetch content from an underlying or file:// http://

183Apache Solr Reference Guide 4.10

location. Here's an example:

<dataSource name="a"

type="URLDataSource"

baseUrl="http://host:port/"

encoding="UTF-8"

connectionTimeout="5000"

readTimeout="10000"/>

The URLDataSource type accepts these optional parameters:

Optional

Parameter

Description

baseURL Specifies a new baseURL for pathnames. You can use this to specify host/port changes

between Dev/QA/Prod environments. Using this attribute isolates the changes to be made

to the solrconfig.xml

connectionTimeout Specifies the length of time in milliseconds after which the connection should time out. The

default value is 5000ms.

encoding By default the encoding in the response header is used. You can use this property to

override the default encoding.

readTimeout Specifies the length of time in milliseconds after which a read operation should time out. The

default value is 10000ms.

Entity Processors

Entity processors extract data, transform it, and add it to a Solr index. Examples of entities include views or tables in

a data store.

Each processor has its own set of attributes, described in its own section below. In addition, there are non-specific

attributes common to all entities which may be specified.

Attribute Use

datasource The name of a data source. Used if there are multiple data sources, specified, in which

case each one must have a name.

name Required. The unique name used to identify an entity.

pk The primary key for the entity. It is optional, and required only when using delta-imports.

It has no relation to the uniqueKey defined in but they can both be theschema.xml

same. It is mandatory if you do delta-imports and then refers to the column name in ${d

} which is used as the primary key.ataimporter.delta.<column-name>

processor Default is SQLEntityProcessor. Required only if the datasource is not RDBMS.

onError Permissible values are (abort|skip|continue) . The default value is 'abort'. 'Skip' skips the

current document. 'Continue' ignores the error and processing continues.

preImportDeleteQuery Before a full-import command, use this query this to cleanup the index instead of using

'*:*'. This is honored only on an entity that is an immediate sub-child of .<document>

184Apache Solr Reference Guide 4.10

postImportDeleteQuery Similar to the above, but executed after the import has completed.

rootEntity By default the entities immediately under the are root entities. If this<document>

attribute is set to false, the entity directly falling under that entity will be treated as the

root entity (and so on). For every row returned by the root entity, a document is created

in Solr.

transformer Optional. One or more transformers to be applied on this entity.

cacheImpl Optional. A class (which must implement ) to use for caching this entity whenDIHCache

doing lookups from an entity which wraps it. Provided implementation is "SortedMapBa

".chedCache

cacheKey The name of a property of this entity to use as a cache key if is specified.cacheImpl

cacheLookup An entity + property name that will be used to lookup cached instances of this entity if c

is specified.acheImpl

Caching of entities in DIH is provided to avoid repeated lookups for same entities again and again. The default Sort

is a where a key is a field in the row and the value is a bunch of rows for that sameedMapBachedCache HashMap

key.

In the example below, each entity is cached using the ' ' property as a cache key. Cache lookupsmanufacturer id

will be performed for each entity based on the product's " " property. When the cache has no data forproduct manu

a particular key, the query is run and the cache is populated

<entity name="manufacturer" query="select id, name from manufacturer" cacheKey="id"

cacheLookup="product.manu" cacheImpl="SortedMapBackedCache"/>

</entity>

The SQL Entity Processor

The SqlEntityProcessor is the default processor. The associated should be a JDBC URL.data source

The entity attributes specific to this processor are shown in the table below.

Attribute Use

query Required. The SQL query used to select rows.

deltaQuery SQL query used if the operation is delta-import. This query selects the primary keys of the

rows which will be parts of the delta-update. The pks will be available to the deltaImportQuery

through the variable }.${dataimporter.delta.<column-name>

parentDeltaQuery SQL query used if the operation is delta-import.

deletedPkQuery SQL query used if the operation is delta-import.

185Apache Solr Reference Guide 4.10

deltaImportQuery SQL query used if the operation is delta-import. If this is not present, DIH tries to construct

the import query by(after identifying the delta) modifying the 'query' (this is error prone).

There is a namespace } which can be used in${dataimporter.delta.<column-name>

this query. For example, }select * from tbl where id=${dataimporter.delta.id

The XPathEntityProcessor

This processor is used when indexing XML formatted data. The data source is typically or URLDataSource FileData

. Xpath can also be used with the described below, to generate a document fromSource FileListEntityProcessor

each file.

The entity attributes unique to this processor are shown below.

Attribute Use

Processor Required. Must be set to "XpathEntityProcessor".

url Required. HTTP URL or file location.

stream Optional: Set to true for a large file or download.

forEach Required unless you define . The Xpath expression whichuseSolrAddSchema

demarcates each record. This will be used to set up the processing loop.

xsl Optional: Its value (a URL or filesystem path) is the name of a resource used as a

preprocessor for applying the XSL transformation.

useSolrAddSchema Set this to true if the content is in the form of the standard Solr update XML schema.

flatten Optional: If set true, then text from under all the tags is extracted into one field.

Each field element in the entity can have the following attributes as well as the default ones.

Attribute Use

xpath Required. The XPath expression which will extract the content from the record for this field. Only

a subset of Xpath syntax is supported.

commonField Optional. If true, then when this field is encountered in a record it will be copied to future records

when creating a Solr document.

Example:

186Apache Solr Reference Guide 4.10

<entity name="slashdot"

pk="link"

url="http://rss.slashdot.org/Slashdot/slashdot"

processor="XPathEntityProcessor"

<!-- forEach sets up a processing loop ; here there are two

expressions-->

forEach="/RDF/channel | /RDF/item"

transformer="DateFormatTransformer">

<field column="date" xpath="/RDF/item/date"

dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />

</entity>

</document>

</dataConfig>

The MailEntityProcessor

The MailEntityProcessor uses the Java Mail API to index email messages using the IMAP protocol. The

MailEntityProcessor works by connecting to a specified mailbox using a username and password, fetching the email

headers for each message, and then fetching the full email contents to construct a document (one document for

each mail message). The directory in Solr's example directory includes anexample-DIH/solr/mail/conf

example used to configure a MailEntityProcessor:mail-data-config.xml

<entity processor="MailEntityProcessor"

user="email@gmail.com"

password="password"

host="imap.gmail.com"

protocol="imaps"

fetchMailsSince="2009-09-20 00:00:00"

batchSize="20"

folders="inbox"

processAttachement="false"

name="sample_entity"/>

</document>

</dataConfig>

The entity attributes unique to the MailEntityProcessor are shown below.

187Apache Solr Reference Guide 4.10

Attribute Use

processor Required. Must be set to "MailEntityProcessor".

user Required. Username for authenticating to the IMAP server; this is typically the email

address of the mailbox owner

password Required. Password for authenticating to the IMAP server

host Required. The IMAP server to connect to

protocol Required. The IMAP protocol to use, valid values are: imap, imaps, gimap, and gimaps

fetchMailsSince Optional. Date/time used to set a filter to import messages that occur after the specified

date; expected format is: yyyy-MM-dd HH:mm:ss

folders Required. Comma-delimited list of folder names to pull messages from, such as "inbox"

recurse Optional (default is true). Flag to indicate if the processor should recurse all child folders

when looking for messages to import

include Optional. Comma-delimited list of folder patterns to include when processing folders (can

be a literal value or regular expression)

exclude Optional. Comma-delimited list of folder patterns to exclude when processing folders (can

be a literal value or regular expression);

excluded folder patterns take precedent over include folder patterns.

processAttachement

processAttachments

Optional (default is true). Use Tika to process message attachments.

includeContent Optional (default is true). Include the message body when constructing Solr documents for

indexing

Importing New Emails Only

After running a full import, the MailEntityProcessor keeps track of the timestamp of the previous import so that

subsequent imports can use the fetchMailsSince filter to only pull new messages from the mail server. This occurs

automatically using the Data Import Handler dataimport.properties file (stored in conf). For instance, if you set

fetchMailsSince=2014-08-22 00:00:00 in your mail-data-config.xml, then all mail messages that occur after this date

will be imported on the first run of the importer. Subsequent imports will use the date of the previous import as the

fetchMailsSince filter, so that only new emails since the last import are indexed each time.

GMail Extensions

When connecting to a GMail account, you can improve the efficiency of the MailEntityProcessor by setting the

protocol to or . This allows the processor to send the fetchMailsSince filter to the GMail server togimap gimaps

have the date filter applied on the server, which means the processor only receives new messages from the server.

However, GMail only supports date granularity, so the server-side filter may return previously seen messages if run

more than once a day.

The TikaEntityProcessor

188Apache Solr Reference Guide 4.10

The TikaEntityProcessor uses Apache Tika to process incoming documents. This is similar to Uploading Data with

, but using the DataImportHandler options instead.Solr Cell using Apache Tika

The directory in Solr's directory shows one option for using the TikaEntityProcessor. Hereexample-DIH example

is the sample file:data-config.xml

<entity name="tika-test" processor="TikaEntityProcessor"

url="../contrib/extraction/src/test-files/extraction/solr-word.pdf"

format="text">

</entity>

</document>

</dataConfig>

The parameters for this processor are described in the table below:

Attribute Use

dataSource This parameter defines the data source and an optional name which can be referred to in later

parts of the configuration if needed. This is the same dataSource explained in the description of

general entity processor attributes above.

The available data source types for this processor are:

BinURLDataSource: used for HTTP resources, but can also be used for files.

BinContentStreamDataSource: used for uploading content as a stream.

BinFileDataSource: used for content on the local filesystem.

url The path to the source file(s), as a file path or a traditional internet URL. This parameter is required.

htmlMapper Allows control of how Tika parses HTML. The "default" mapper strips much of the HTML from

documents while the "identity" mapper passes all HTML as-is with no modifications. If this

parameter is defined, it must be either or ; if it is absent, "default" is assumed.default identity

format The output format. The options are , , or . The default is "text" if not defined. Thetext xml html none

format "none" can be used if metadata only should be indexed and not the body of the documents.

parser The default parser is . If a custom or otherorg.apache.tika.parser.AutoDetectParser

parser should be used, it should be entered as a fully-qualified name of the class and path.

fields The list of fields from the input documents and how they should be mapped to Solr fields. If the

attribute is defined as "true", the field will be obtained from the metadata of the document andmeta

not parsed from the body of the main text.

The FileListEntityProcessor

This processor is basically a wrapper, and is designed to generate a set of files satisfying conditions specified in the

attributes which can then be passed to another processor, such as the . The entity informationXPathEntityProcessor

for this processor would be nested within the FileListEnitity entry. It generates four implicit fields: fileAbsolutePa

189Apache Solr Reference Guide 4.10

, , , which can be used in the nested processor. This processor doesth fileSize fileLastModified fileName

not use a data source.

The attributes specific to this processor are described in the table below:

Attribute Use

fileName Required. A regular expression pattern to identify files to be included.

basedir Required. The base directory (absolute path).

recursive Whether to search directories recursively. Default is 'false'.

excludes A regular expression pattern to identify files which will be excluded.

newerThan A date in the format or a date math expression ( ).yyyy-MM-ddHH:mm:ss NOW - 2YEARS

olderThan A date, using the same formats as newerThan.

rootEntity This should be set to false. This ensures that each row (filepath) emitted by this processor is

considered to be a document.

dataSource Must be set to null.

The example below shows the combination of the FileListEntityProcessor with another processor which will generate

a set of fields from each file found.

<!-- this outer processor generates a list of files satisfying the conditions

specified in the attributes -->

<entity name="f" processor="FileListEntityProcessor"

fileName=".*xml"

newerThan="'NOW-30DAYS'"

recursive="true"

rootEntity="false"

dataSource="null"

baseDir="/my/document/directory">

<entity name="nested" processor="XPathEntityProcessor"

forEach="/rootelement" url="${f.fileAbsolutePath}" >

</entity>

</document>

</dataConfig>

LineEntityProcessor

This EntityProcessor reads all content from the data source on a line by line basis and returns a field called rawLin

for each line read. The content is not parsed in any way; however, you may add transformers to manipulate thee

data within the field, or to create other additional fields.rawLine

190Apache Solr Reference Guide 4.10

The lines read can be filtered by two regular expressions specified with the and acceptLineRegex omitLineReg

attributes. The table below describes the LineEntityProcessor's attributes:ex

Attribute Description

url A required attribute that specifies the location of the input file in a way that is compatible with

the configured data source. If this value is relative and you are using FileDataSource or

URLDataSource, it assumed to be relative to baseLoc.

acceptLineRegex An optional attribute that if present discards any line which does not match the regExp.

omitLineRegex An optional attribute that is applied after any acceptLineRegex and that discards any line

which matches this regExp.

For example:

<entity name="jc"

processor="LineEntityProcessor"

acceptLineRegex="^.*\.xml$"

omitLineRegex="/obsolete"

url="file:///Volumes/ts/files.lis"

rootEntity="false"

dataSource="myURIreader1"

transformer="RegexTransformer,DateFormatTransformer">

...

While there are use cases where you might need to create a Solr document for each line read from a file, it is

expected that in most cases that the lines read by this processor will consist of a pathname, which in turn will be

consumed by another EntityProcessor, such as XPathEntityProcessor.

PlainTextEntityProcessor

This EntityProcessor reads all content from the data source into an single implicit field called . TheplainText

content is not parsed in any way, however you may add transformers to manipulate the data within the plainText

as needed, or to create other additional fields.

For example:

<entity processor="PlainTextEntityProcessor" name="x" url="http://abc.com/a.txt"

dataSource="data-source-name">

</entity>

Ensure that the dataSource is of type ( , ).DataSource<Reader> FileDataSource URLDataSource

Transformers

Transformers manipulate the fields in a document returned by an entity. A transformer can create new fields or

modify existing ones. You must tell the entity which transformers your import operation will be using, by adding an

attribute containing a comma separated list to the element.<entity>

191Apache Solr Reference Guide 4.10

Specific transformation rules are then added to the attributes of a element, as shown in the examples<field>

below. The transformers are applied in the order in which they are specified in the transformer attribute.

The Data Import Handler contains several built-in transformers. You can also write your own custom transformers,

as described in the Solr Wiki (see ). The ScriptTransformerhttp://wiki.apache.org/solr/DIHCustomTransformer

(described below) offers an alternative method for writing your own transformers.

Solr includes the following built-in transformers:

Transformer Name Use

ClobTransformer Used to create a String out of a Clob type in database.

DateFormatTransformer Parse date/time instances.

HTMLStripTransformer Strip HTML from a field.

LogTransformer Used to log data to log files or a console.

NumberFormatTransformer Uses the NumberFormat class in java to parse a string into a number.

RegexTransformer Use regular expressions to manipulate fields.

ScriptTransformer Write transformers in Javascript or any other scripting language supported by Java.

TemplateTransformer Transform a field using a template.

These transformers are described below.

ClobTransformer

You can use the ClobTransformer to create a string out of a CLOB in a database. A CLOB is a character large

object: a collection of character data typically stored in a separate location that is referenced in the database. See ht

. Here's an example of invoking the ClobTransformer.tp://en.wikipedia.org/wiki/Character_large_object

...

</entity>

The ClobTransformer accepts these attributes:

Attribute Description

clob Boolean value to signal if ClobTransformer should process this field or not. If this attribute is

omitted, then the corresponding field is not transformed.

sourceColName The source column to be used as input. If this is absent source and target are same

The DateFormatTransformer

This transformer converts dates from one format to another. This would be useful, for example, in a situation where

you wanted to convert a field with a fully specified date/time into a less precise date format, for use in faceting.

DateFormatTransformer applies only on the fields with an attribute . Other fields are not modified.dateTimeFormat

This transformer recognizes the following attributes:

192Apache Solr Reference Guide 4.10

Attribute Description

dateTimeFormat The format used for parsing this field. This must comply with the syntax of the Java

class.SimpleDateFormat

sourceColName The column on which the dateFormat is to be applied. If this is absent source and target are

same.

locale The locale to use for date transformations. If not specified, the ROOT locale will be used. It

must be specified as language-country. For example, .en-US

Here is example code that returns the date rounded up to the month "2007-JUL":

...

</entity>

The HTMLStripTransformer

You can use this transformer to strip HTML out of a field. For example:

...

</entity>

There is one attribute for this transformer, , which is a boolean value (true/false) to signal if thestripHTML

HTMLStripTransformer should process the field or not.

The LogTransformer

You can use this transformer to log data to the console or log files. For example:

<entity ...

transformer="LogTransformer"

logTemplate="The name is ${e.name}" logLevel="debug">

....

</entity>

Unlike other transformers, the LogTransformer does not apply to any field, so the attributes are applied on the entity

itself.

The NumberFormatTransformer

Use this transformer to parse a number from a string, converting it into the specified format, and optionally using a

different locale.

NumberFormatTransformer will be applied only to fields with an attribute .formatStyle

This transformer recognizes the following attributes:

Attribute Description

193Apache Solr Reference Guide 4.10

formatStyle The format used for parsing this field. The value of the attribute must be one of (number|perc

). This uses the semantics of the Java NumberFormat class.ent|integer|currency

sourceColName The column on which the NumberFormat is to be applied. This is attribute is absent. The

source column and the target column are the same.

locale The locale to be used for parsing the strings. If this is absent, the ROOT locale is used. It must

be specified as language-country. For example, .en-US

For example:

...

</entity>

The RegexTransformer

The regex transformer helps in extracting or manipulating values from fields (from the source) using Regular

Expressions. The actual class name is . But asorg.apache.solr.handler.dataimport.RegexTransformer

it belongs to the default package the package-name can be omitted.

The table below describes the attributes recognized by the regex transformer.

Attribute Description

regex The regular expression that is used to match against the column or sourceColName's value(s).

If replaceWith is absent, each regex is taken as a value and a list of values is returned.group

sourceColName The column on which the regex is to be applied. If not present, then the source and target are

identical.

splitBy Used to split a string. It returns a list of values.

groupNames A comma separated list of field column names, used where the regex contains groups and

each group is to be saved to a different field. If some groups are not to be named leave a

space between commas.

replaceWith Used along with regex . It is equivalent to the method new

.String(<sourceColVal>).replaceAll(<regex>, <replaceWith>)

Here is an example of configuring the regex transformer:

194Apache Solr Reference Guide 4.10

<entity name="foo" transformer="RegexTransformer"

query="select full_name, emailids from foo">

</entity>

In this example, regex and sourceColName are custom attributes used by the transformer. The transformer reads

the field from the resultset and transforms it to two new target fields, and . Evenfull_name firstName lastName

though the query returned only one column, , in the result set, the Solr document gets two extra fields full_name f

and which are "derived" fields. These new fields are only created if the regexp matches.irstName lastName

The emailids field in the table can be a comma-separated value. It ends up producing one or more email IDs, and

we expect the to be a multivalued field in Solr.mailId

Note that this transformer can either be used to split a string into tokens based on a splitBy pattern, or to perform a

string substitution as per replaceWith, or it can assign groups within a pattern to a list of groupNames. It decides

what it is to do based upon the above attributes , and which are looked for insplitBy replaceWith groupNames

order. This first one found is acted upon and other unrelated attributes are ignored.

The ScriptTransformer

The script transformer allows arbitrary transformer functions to be written in any scripting language supported by

Java, such as Javascript, JRuby, Jython, Groovy, or BeanShell. Javascript is integrated into Java 7; you'll need to

integrate other languages yourself.

Each function you write must accept a row variable (which corresponds to a , thusJava Map<String,Object>

permitting operations). Thus you can modify the value of an existing field or add new fields. Theget,put,remove

return value of the function is the returned object.

The script is inserted into the DIH configuration file at the top level and is called once for each row.

Here is a simple example.

195Apache Solr Reference Guide 4.10

<!-- simple script to generate a new row, converting a temperature from Fahrenheit

to Centigrade -->

<script><![CDATA[

function f2c(row) {

var tempf, tempc;

tempf = row.get('temp_f');

if (tempf != null) {

tempc = (tempf - 32.0)*5.0/9.0;

row.put('temp_c', temp_c);

}

return row;

}

]]>

</script>

....

</entity>

</document>

</dataConfig>

The TemplateTransformer

You can use the template transformer to construct or modify a field value, perhaps using the value of other fields.

You can insert extra text into the template.

...

</entity>

Special Commands for the Data Import Handler

You can pass special commands to the DIH by adding any of the variables listed below to any row returned by any

component:

Variable Description

$skipDoc Skip the current document; that is, do not add it to Solr. The value can be the string true|

.false

$skipRow Skip the current row. The document will be added with rows from other entities. The value

can be the string true|false

$docBoost Boost the current document. The boost value can be a number or the conversiotoString

n of a number.

196Apache Solr Reference Guide 4.10

$deleteDocById Delete a document from Solr with this ID. The value has to be the value of theuniqueKey

document.

$deleteDocByQuery Delete documents from Solr using this query. The value must be a Solr Query.

Updating Parts of Documents

Once you have indexed the content you need in your Solr index, you will want to start thinking about your strategy

for dealing with changes to those documents. Solr supports two approaches to updating documents that have only

partially changed.

The first is . This approach allows changing only one or more fields of a document without having toatomic updates

re-index the entire document.

The second approach is known as or . It is a feature of many NoSQLoptimistic concurrency optimistic locking

databases, and allows conditional updating a document based on it's version. This approach includes semantics and

rules for how to deal with version matches or mis-matches.

Atomic Updates and Optimistic Concurrency may be used as independent strategies for managing changes to

documents, or they may be combined: you can use optimistic concurrency to conditionally apply an atomic update.

Atomic Updates

Solr supports several modifiers that atomically update values of a document. This allows updating only specific

fields, which can help speed indexing processes in an environment where speed of index additions is critical to the

application.

To use atomic updates, add a modifier to the field that needs to be updated. The content can be updated, added to,

or incrementally increased if a number.

Modifier Usage

set Set or replace the field value(s) with the specified value(s), or remove the values if 'null' or empty list is

specified as the new value.

May be specified as a single value, or as a list for multivalued fields

add Adds the specified values to a multivalued field.

May be specified as a single value, or as a list.

remove Removes (all occurrences of) the specified values from a multivalued field.

May be specified as a single value, or as a list.

inc Increments a numeric value by a specific amount.

Must be specified as a single numeric value.

For example, if the following document exists in our collection:

All original source fields must be stored for field modifiers to work correctly, which is the Solr default.

197Apache Solr Reference Guide 4.10

{"id":"mydoc",

"price":10,

"popularity":42,

"categories":["kids"],

"promo_ids":["a123x"],

"tags":["free_to_try","buy_now","clearance","on_sale"]

}

And we apply the following update command:

{"id":"mydoc",

"price":{"set":99},

"popularity":{"inc":20},

"categories":{"add":["toys","games"]},

"promo_ids":{"remove":"a123x"},

"tags":{"remove":["free_to_try","on_sale"]}

}

The resulting document in our collection will be:

{"id":"mydoc",

"price":999,

"popularity":62,

"categories":["kids","toys","games"],

"tags":["buy_now","clearance"]

}

Optimistic Concurrency

Optimistic Concurrency is a feature of Solr that can be used by client applications which update/replace documents

to ensure that the document they are replacing/updating has not been concurrently modified by another client

application. This feature works by requiring a field on all documents in the index, and comparing that to_version_

a specified as part of the update command. By default, Solr's includes a field_version_ schema.xml _version_

, and this field is automatically added to each new document.

In general, using optimistic concurrency involves the following work flow:

A client reads a document. In Solr, one might retrieve the document with the handler to be sure to have/get

the latest version.

A client changes the document locally.

The client resubmits the changed document to Solr, for example, perhaps with the handler./update

If there is a version conflict (HTTP error code 409), the client starts the process over.

When the client resubmits a changed document to Solr, the can be included with the update to invoke_version_

optimistic concurrency control. Specific semantics are used to define when the document should be updated or

when to report a conflict.

If the content in the field is greater than '1' (i.e., '12345'), then the in the document_version_ _version_

must match the in the index._version_

If the content in the field is equal to '1', then the document must simply exist. In this case, no_version_

version matching occurs, but if the document does not exist, the updates will be rejected.

If the content in the field is less than '0' (i.e., '-1'), then the document must exist. In this case,_version_ not

198Apache Solr Reference Guide 4.10

no version matching occurs, but if the document exists, the updates will be rejected.

If the content in the field is equal to '0', then it doesn't matter if the versions match or if the_version_

document exists or not. If it exists, it will be overwritten; if it does not exist, it will be added.

If the document being updated does not include the field, and atomic updates are not being used, the_version_

document will be treated by normal Solr rules, which is usually to discard it

For more information, please also see from ApacheYonik Seeley's presentation on NoSQL features in Solr 4

Lucene EuroCon 2012.

Document Centric Versioning Constraints

Optimistic Concurrency is extremely powerful, and works very efficiently because it uses an internally assigned,

globally unique values for the field. However, In some situations users may want to configure their own_version_

document specific version field, where the version values are assigned on a per-document basis by an external

system, and have Solr reject updates that attempt to replace a document with an "older" version. In situations like

this the can be useful.DocBasedVersionConstraintsProcessorFactory

The basic usage of is to configure it in DocBasedVersionConstraintsProcessorFactory solrconfig.xml

as part of the and specify the name of the in your schema thatUpdateRequestProcessorChain versionField

should be checked when validating updates:

<str name="versionField">my_version_l</str>

</processor>

Once configured, this update processor will reject (HTTP error code 409) any attempt to update an existing

document where the value of the field in the "new" document is not greater then the value of thatmy_version_l

field in the existing document.

DocBasedVersionConstraintsProcessorFactory supports two additional configuration params which are

optional:

ignoreOldUpdates - A boolean option which defaults to . If set to then instead of rejectingfalse true

updates where the is too low, the update will be silently ignored (and return a status 200 toversionField

the client).

deleteVersionParam - A String parameter that can be specified to indicate that this processor should also

inspect Delete By Id commands. The value of this configuration option should be the name of a request

Power Tip

The field is by default stored in the inverted index ( ). However, for some_version_ indexed="true"

systems with a very large number of documents, the increase in FieldCache memory requirements may be

too costly. A solution can be to declare the field as :_version_ DocValues

<field name="_version_" type="long" indexed="false" stored="true"

required="true" docValues="true"/>

Sample field definition

199Apache Solr Reference Guide 4.10

parameter that the processor will now consider mandatory for all attempts to Delete By Id, and must be be

used by clients to specify a value for the which is greater then the existing value of theversionField

document to be deleted. When using this request param, any Delete By Id command with a high enough

document version number to succeed will be internally converted into an Add Document command that

replaces the existing document with a new one which is empty except for the Unique Key and versionFiel

to keeping a record of the deleted version so future Add Document commands will fail if their "new" versiond

is not high enough.

Please consult the and for additional information and example usages.processor javadocs test configs

De-Duplication

Preventing duplicate or near duplicate documents from entering an index or tagging documents with a

signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash

algorithm. Solr natively supports de-duplication techniques of this type via the class and allows for<Signature>

the easy addition of new hash/signature implementations. A Signature can be implemented several ways:

Method Description

MD5Signature 128 bit hash used for exact duplicate detection.

Lookup3Signature 64 bit hash used for exact duplicate detection, much faster than MD5 and smaller to index

TextProfileSignature Fuzzy hashing implementation from nutch for near duplicate detection. Its tunable but

works best on longer text.

Other, more sophisticated algorithms for fuzzy/near hashing can be added later.

Configuration Options

In solrconfig.xml

The has to be registered in the solrconfig.xml as part of the SignatureUpdateProcessorFactory UpdateReque

:stProcessorChain

<bool name="overwriteDupes">false</bool>

<str name="fields">name,features,cat</str>

<str name="signatureClass">solr.processor.Lookup3Signature</str>

</processor>

</updateRequestProcessorChain>

Setting Default Description

Adding in the deduplication process will change the setting so that it applies to an update TermallowDups

(with in this case) rather than the unique field Term. Of course the cosignatureField signatureField

uld be the unique field, but generally you want the unique field to be unique. When a document is added, a

signature will automatically be generated and attached to the document in the specified .signatureField

200Apache Solr Reference Guide 4.10

signatureClass org.apache.solr.update.processor.Lookup3Signature A Signature implementation for

generating a signature hash.

fields all fields The fields to use to generate the

signature hash in a comma separated

list. By default, all fields on the document

will be used.

signatureField signatureField The name of the field used to hold the

fingerprint/signature. Be sure the field is

defined in schema.xml.

enabled true Enable/disable deduplication factory

processing

In schema.xml

If you are using a separate field for storing the signature you must have it indexed:

<field name="signature" type="string" stored="true" indexed="true" multiValued="false"

Be sure to change your update handlers to use the defined chain, i.e.

<str name="update.chain">dedupe</str>

</lst>

</requestHandler>

Detecting Languages During Indexing

Solr can identify languages and map text to language-specific fields during indexing using the UpdateRequlangid

estProcessor. Solr supports two implementations of this feature:

Tika's language detection feature: http://tika.apache.org/0.10/detection.html

LangDetect language detection: http://code.google.com/p/language-detection/

You can see a comparison between the two implementations here: http://blog.mikemccandless.com/2011/10/accura

. In general, the LangDetect implementation supports more languages withcy-and-performance-of-googles.html

higher performance.

For specific information on each of these language identification implementations, including a list of supported

languages for each, see the relevant project websites. For more information about the UpdateRequestProclangid

essor, see the Solr wiki: . For more information about languagehttp://wiki.apache.org/solr/LanguageDetection

analysis in Solr, see .Language Analysis

Configuring Language Detection

You can configure the UpdateRequestProcessor in . Both implementations take thelangid solrconfig.xml

The update processor can also be specified per request with a parameter of .update.chain=dedupe

201Apache Solr Reference Guide 4.10

same parameters, which are described in the following section. At a minimum, you must specify the fields for

language identification and a field for the resulting language code.

Configuring Tika Language Detection

Here is an example of a minimal Tika configuration in :langid solrconfig.xml

<processor

class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">

<str name="langid.fl">title,subject,text,keywords</str>

<str name="langid.langField">language_s</str>

</lst>

</processor>

Configuring LangDetect Language Detection

Here is an example of a minimal LangDetect configuration in :langid solrconfig.xml

<processor

class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFac

tory">

<str name="langid.fl">title,subject,text,keywords</str>

<str name="langid.langField">language_s</str>

</lst>

</processor>

langid Parameters

As previously mentioned, both implementations of the UpdateRequestProcessor take the same parameters.langid

Parameter Type Default Required Description

langid Boolean true no Enables and disables language detection.

langid.fl string none yes A comma- or space-delimited list of fields

to be processed by .langid

langid.langField string none yes Specifies the field for the returned

language code.

langid.langsField multivalued

string

none no Specifies the field for a list of returned

language codes. If you use langid.map.

, each detected language willindividual

be added to this field.

langid.overwrite Boolean false no Specifies whether the content of the langF

and fields will beield langsField

overwritten if they already contain values.

202Apache Solr Reference Guide 4.10

langid.lcmap string none false A space-separated list specifying colon

delimited language code mappings to

apply to the detected languages. For

example, you might use this to map

Chinese, Japanese, and Korean to a

common code, and map bothcjk

American and British English to a single en

code by using langid.lcmap=ja:cjk

.zh:cjk ko:cjk en_GB:en en_US:en

This affects both the values put into the la

and fields, as wellngField langsField

as the field suffixes when using langid.m

, unless overridden by ap langid.map.lc

map

langid.threshold float 0.5 no Specifies a threshold value between 0 and

1 that the language identification score

must reach before accepts it. Withlangid

longer text fields, a high threshold such at

0.8 will give good results. For shorter text

fields, you may need to lower the threshold

for language identification, though you will

be risking somewhat lower quality results.

We recommend experimenting with your

data to tune your results.

langid.whitelist string none no Specifies a list of allowed language

identification codes. Use this in

combination with to ensurelangid.map

that you only index documents into fields

that are in your schema.

langid.map Boolean false no Enables field name mapping. If true, Solr

will map field names for all fields listed in l

.angid.fl

langid.map.fl string none no A comma-separated list of fields for langi

that is different than the fieldsd.map

specified in .langid.fl

langid.map.keepOrig Boolean false no If true, Solr will copy the field during the

field name mapping process, leaving the

original field in place.

langid.map.individual Boolean false no If true, Solr will detect and map languages

for each field individually.

203Apache Solr Reference Guide 4.10

langid.map.individual.fl string none no A comma-separated list of fields for use

with that islangid.map.individual

different than the fields specified in langi

.d.fl

langid.fallbackFields string none no If no language is detected that meets the l

score, or if theangid.threshold

detected language is not on the langid.w

, this field specifies languagehitelist

codes to be used as fallback values. If no

appropriate fallback languages are found,

Solr will use the language code specified in

.langid.fallback

langid.fallback string none no Specifies a language code to use if no

language is detected or specified in langi

.d.fallbackFields

langid.map.lcmap string determined by

langid.lcmap

no A space-separated list specifying colon

delimited language code mappings to use

when mapping field names. For example,

you might use this to make Chinese,

Japanese, and Korean language fields use

a common suffix, and map both*_cjk

American and British English fields to a

single by using *_en langid.map.lcma

p=ja:cjk zh:cjk ko:cjk en_GB:en

.en_US:en

langid.map.pattern Java

regular

expression

none no By default, fields are mapped as

<field>_<language>. To change this

pattern, you can specify a Java regular

expression in this parameter.

langid.map.replace Java

replace

none no By default, fields are mapped as

<field>_<language>. To change this

pattern, you can specify a Java replace in

this parameter.

langid.enforceSchema Boolean true no If false, the processor does notlangid

validate field names against your schema.

This may be useful if you plan to rename or

delete fields later in the UpdateChain.

Content Streams

When Solr RequestHandlers are accessed using path based URLs, the object containing theSolrQueryRequest

parameters of the request may also contain a list of ContentStreams containing bulk data for the request. (The name

SolrQueryRequest is a bit misleading: it is involved in all requests, regardless of whether it is a query request or an

update request.)

204Apache Solr Reference Guide 4.10

Stream Sources

Currently RequestHandlers can get content streams in a variety of ways:

For multipart file uploads, each file is passed as a stream.

For POST requests where the content-type is not , the rawapplication/x-www-form-urlencoded

POST body is passed as a stream. The full POST body is parsed as parameters and included in the Solr

parameters.

The contents of parameter is passed as a stream.stream.body

If remote streaming is enabled and URL content is called for during request handling, the contents of each st

and parameters are fetched and passed as a stream.ream.url stream.file

By default, curl sends a header. If you need to testcontentType="application/x-www-form-urlencoded"

a SolrContentHeader content stream, you will need to set the content type with the "-H" flag.

RemoteStreaming

Remote streaming lets you send the contents of a URL as a stream to a given SolrRequestHandler. You could use

remote streaming to send a remote or local file to an update plugin. For security reasons, remote streaming is

disabled in the included in the example directory.solrconfig.xml

Debugging Requests

The example includes a "dump" RequestHandler:solrconfig.xml

This handler simply outputs the contents of the SolrQueryRequest using the specified writer type . This is a usefulwt

tool to help understand what streams are available to the RequestHandlers.

UIMA Integration

You can integrate the Apache Unstructured Information Management Architecture ( ) with Solr. UIMA lets youUIMA

define custom pipelines of Analysis Engines that incrementally add metadata to your documents as annotations.

For more information about Solr UIMA integration, see .https://wiki.apache.org/solr/SolrUIMA

Configuring UIMA

The SolrUIMA UpdateRequestProcessor is a custom update request processor that takes documents being

indexed, sends them to a UIMA pipeline, and then returns the documents enriched with the specified metadata. To

configure UIMA for Solr, follow these steps:

Copy (under ) and its libraries (under )solr-uima-4.x.y.jar /solr-4.x.y/dist/ contrib/uima/lib

to a Solr libraries directory, or set tags in appropriately to point to those jar files:<lib/> solrconfig.xml

If you enable streaming, be aware that this allows to send a request to any URL or local file. If dumpanyone

is enabled, it will allow anyone to view any file on your system.

205Apache Solr Reference Guide 4.10

Modify , adding your desired metadata fields specifying proper values for type, indexed, stored,schema.xml

and multiValued options. For example:

<field name="language" type="string" indexed="true" stored="true"

required="false"/>

<field name="concept" type="string" indexed="true" stored="true"

multiValued="true" required="false"/>

<field name="sentence" type="text" indexed="true" stored="true"

multiValued="true" required="false" />

Add the following snippet to :solrconfig.xml

206Apache Solr Reference Guide 4.10

<processor

class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">

<str name="keyword_apikey">VALID_ALCHEMYAPI_KEY</str>

<str name="concept_apikey">VALID_ALCHEMYAPI_KEY</str>

<str name="lang_apikey">VALID_ALCHEMYAPI_KEY</str>

<str name="cat_apikey">VALID_ALCHEMYAPI_KEY</str>

<str name="entities_apikey">VALID_ALCHEMYAPI_KEY</str>

<str name="oc_licenseID">VALID_OPENCALAIS_KEY</str>

</lst>

<str

name="analysisEngine">/org/apache/uima/desc/OverridingParamsExtServicesAE.xml</st

<!-- Set to true if you want to continue indexing even if text processing

fails.

Default is false. That is, Solr throws RuntimeException and

never indexed documents entirely in your session. -->

<!-- This is optional. It is used for logging when text processing fails.

If logField is not specified, uniqueKey will be used as logField.

-->

<bool name="merge">false</bool>

</arr>

</lst>

<str name="name">org.apache.uima.alchemy.ts.concept.ConceptFS</str>

<str name="field">concept</str>

</lst>

<str name="name">org.apache.uima.alchemy.ts.language.LanguageFS</str>

<str name="feature">language</str>

<str name="field">language</str>

</lst>

<str name="name">org.apache.uima.SentenceAnnotation</str>

<str name="feature">coveredText</str>

<str name="field">sentence</str>

</lst>

</processor>

</updateRequestProcessorChain>

207Apache Solr Reference Guide 4.10

4. In your replace the existing default UpdateRequestHandler or create a newsolrconfig.xml

UpdateRequestHandler:

</lst>

</requestHandler>

Once you are done with the configuration your documents will be automatically enriched with the specified fields

when you index them.

VALID_ALCHEMYAPI_KEY is your AlchemyAPI Access Key. You need to register an AlchemyAPI

Access key to use AlchemyAPI services: . http://www.alchemyapi.com/api/register.html

is your Calais Service Key. You need to register a Calais Service key toVALID_OPENCALAIS_KEY

use the Calais services: . http://www.opencalais.com/apikey

must contain an AE descriptor inside the specified path in the classpath. analysisEngine

must contain the input fields that need to be analyzed by UIMA. If theanalyzeFields merge=true

n their content will be merged and analyzed only once.

Field mapping describes which features of which types should go in a field.

208Apache Solr Reference Guide 4.10
Searching
This section describes how Solr works with search requests. It covers the following topics:
Overview of Searching in Solr: An introduction to
searching with Solr.
Velocity Search UI: A sample search UI in the
example configuration using the
VelocityResponseWriter.
Relevance: Conceptual information about
understanding relevance in search results.
Query Syntax and Parsing: A brief conceptual
overview of query syntax and parsing. It also
contains the following sub-sections:
Common Query Parameters: No matter the
query parser, there are several parameters
that are common to all of them.
The Standard Query Parser: Detailed
information about the standard Lucene
query parser.
The DisMax Query Parser: Detailed
information about Solr's DisMax query
parser.
The Extended DisMax Query Parser:
Detailed information about Solr's Extended
DisMax (eDisMax) Query Parser.
Function Queries: Detailed information
about parameters for generating relevancy
scores using values from one or more
numeric fields.
Local Parameters in Queries: How to add
local arguments to queries.
Other Parsers: More parsers designed for
use in specific situations.
Faceting: Detailed information about categorizing
search results based on indexed terms.
Highlighting: Detailed information about Solr's
highlighting utilities. Sub-sections cover the
different types of highlighters:
Standard Highlighter: Uses the most
sophisticated and fine-grained query
representation of the three highlighters.
FastVector Highlighter: Optimized for term
vector options on fields, and good for large
Transforming Result Documents: Detailed
information about using   to addDocTransformers
computed information to individual documents
Suggester: Detailed information about Solr's
powerful autosuggest component.
MoreLikeThis: Detailed information about Solr's
similar results query component.
 
Pagination of Results: Detailed information about
fetching paginated results for display in a UI, or for
fetching all documents matching a query.
 
Result Grouping: Detailed information about
grouping results based on common field values.
 
Result Clustering: Detailed information about
grouping search results based on cluster analysis
applied to text fields. A bit like "unsupervised"
faceting.
 
Spatial Search: How to use Solr's spatial search
capabilities.
 
The Terms Component: Detailed information about
accessing indexed terms and the documents that
include them.
 
The Term Vector Component: How to get term
information about specific documents.
 
The Stats Component: How to return information
from numeric fields within a document set.
 
The Query Elevation Component: How to force
documents to the top of the results for certain
queries.
 
Response Writers: Detailed information about
configuring and using Solr's response writers.
 
Near Real Time Searching: How to include
documents in search results nearly immediately
after they are indexed.

209Apache Solr Reference Guide 4.10

documents and multiple languages.

Postings Highlighter: Uses similar options as

the FastVector highlighter, but is more

compact and efficient.

Spell Checking: Detailed information about Solr's

spelling checker.

Query Re-Ranking: Detailed information about

re-ranking top scoring documents from simple

queries using more complex scores.

RealTime Get: How to get the latest version of a

document without opening a searcher.

Overview of Searching in Solr

Solr offers a rich, flexible set of features for search. To understand the extent of this flexibility, it's helpful to begin

with an overview of the steps and components involved in a Solr search.

When a user runs a search in Solr, the search query is processed by a . A request handler is a Solrrequest handler

plug-in that defines the logic to be used when Solr processes a request. Solr supports a variety of request handlers.

Some are designed for processing search queries, while others manage tasks such as index replication.

Search applications select a particular request handler by default. In addition, applications can be configured to

allow users to override the default selection in preference of a different request handler.

To process a search query, a request handler calls a , which interprets the terms and parameters of aquery parser

query. Different query parsers support different syntax. The default query parser is the query parser. SolrDisMax

also includes an earlier "standard" (Lucene) query parser, and an (eDisMax) query parser. The Extended DisMax st

query parser's syntax allows for greater precision in searches, but the DisMax query parser is much moreandard

tolerant of errors. The DisMax query parser is designed to provide an experience similar to that of popular search

engines such as Google, which rarely display syntax errors to users. The Extended DisMax query parser is an

improved version of DisMax that handles the full Lucene query syntax while still tolerating syntax errors. It also

includes several additional features.

In addition, there are that are accepted by all query parsers.common query parameters

Input to a query parser can include:

search strings---that is, to search for in the indexterms

parameters for fine-tuning the query by increasing the importance of particular strings or fields, by applying

Boolean logic among the search terms, or by excluding content from the search results

parameters for controlling the presentation of the query response, such as specifying the order in which

results are to be presented or limiting the response to particular fields of the search application's schema.

Search parameters may also specify a . As part of a search response, a query filter runs a query againstquery filter

the entire index and caches the results. Because Solr allocates a separate cache for filter queries, the strategic use

of filter queries can improve search performance. (Despite their similar names, query filters are not related to

analysis filters. Query filters perform queries at search time against data already in the index, while analysis filters,

such as Tokenizers, parse content for indexing, following specified rules).

A search query can request that certain terms be highlighted in the search response; that is, the selected terms will

be displayed in colored boxes so that they "jump out" on the screen of search results. can make itHighlighting

easier to find relevant passages in long documents returned in a search. Solr supports multi-term highlighting. Solr

includes a rich set of search parameters for controlling how terms are highlighted.

210Apache Solr Reference Guide 4.10

Search responses can also be configured to include (document excerpts) featuring highlighted text.snippets

Popular search engines such as Google and Yahoo! return snippets in their search results: 3-4 lines of text offering

a description of a search result.

To help users zero in on the content they're looking for, Solr supports two special ways of grouping search results to

aid further exploration: faceting and clustering.

Faceting is the arrangement of search results into categories (which are based on indexed terms). Within each

category, Solr reports on the number of hits for relevant term, which is called a facet constraint. Faceting makes it

easy for users to explore search results on sites such as movie sites and product review sites, where there are

many categories and many items within a category.

The image below shows an example of faceting from the CNET Web site, which was the first site to use Solr.

Faceting makes use of fields defined when the search applications were indexed. In the example above, these fields

include categories of information that are useful for describing digital cameras: manufacturer, resolution, and zoom

range.

Clustering groups search results by similarities discovered when a search is executed, rather than when content is

indexed. The results of clustering often lack the neat hierarchical organization found in faceted search results, but

clustering can be useful nonetheless. It can reveal unexpected commonalities among search results, and it can help

users rule out content that isn't pertinent to what they're really searching for.

Solr also supports a feature called , which enables users to submit new queries that focus on particularMoreLikeThis

terms returned in an earlier query. MoreLikeThis queries can make use of faceting or clustering to provide additional

aid to users.

A Solr component called a manages the final presentation of the query response. Solr includes aresponse writer

variety of response writers, including an and a .XML Response Writer JSON Response Writer

The diagram below summarizes some key elements of the search process.

211Apache Solr Reference Guide 4.10

Velocity Search UI

Solr includes a sample search UI based on the (also known as Solritas) that demonstratesVelocityResponseWriter

several useful features, such as searching, faceting, highlighting, autocomplete, and geospatial searching.

You can access the Velocity sample Search UI here: http://localhost:8983/solr/browse

212Apache Solr Reference Guide 4.10

The Velocity Search UI

For more information about the Velocity Response Writer, see the .Response Writer page

Relevance

Relevance is the degree to which a query response satisfies a user who is searching for information.

The relevance of a query response depends on the context in which the query was performed. A single search

application may be used in different contexts by users with different needs and expectations. For example, a search

engine of climate data might be used by a university researcher studying long-term climate trends, a farmer

interested in calculating the likely date of the last frost of spring, a civil engineer interested in rainfall patterns and the

frequency of floods, and a college student planning a vacation to a region and wondering what to pack. Because the

motivations of these users vary, the relevance of any particular response to a query will vary as well.

How comprehensive should query responses be? Like relevance in general, the answer to this question depends on

the context of a search. The cost of finding a particular document in response to a query is high in somenot

contexts, such as a legal e-discovery search in response to a subpoena, and quite low in others, such as a search

for a cake recipe on a Web site with dozens or hundreds of cake recipes. When configuring Solr, you should weigh

comprehensiveness against other factors such as timeliness and ease-of-use.

The e-discovery and recipe examples demonstrate the importance of two concepts related to relevance:

Precision is the percentage of documents in the returned results that are relevant.

Recall is the percentage of relevant results returned out of all relevant results in the system. Obtaining

perfect recall is trivial: simply return every document in the collection for every query.

213Apache Solr Reference Guide 4.10

Returning to the examples above, it's important for an e-discovery search application to have 100% recall returning

all the documents that are relevant to a subpoena. It's far less important that a recipe application offer this degree of

precision, however. In some cases, returning too many results in casual contexts could overwhelm users. In some

contexts, returning fewer results that have a higher likelihood of relevance may be the best approach.

Using the concepts of precision and recall, it's possible to quantify relevance across users and queries for a

collection of documents. A perfect system would have 100% precision and 100% recall for every user and every

query. In other words, it would retrieve all the relevant documents and nothing else. In practical terms, when talking

about precision and recall in real systems, it is common to focus on precision and recall at a certain number of

results, the most common (and useful) being ten results.

Through faceting, query filters, and other search components, a Solr application can be configured with the flexibility

to help users fine-tune their searches in order to return the most relevant results for users. That is, Solr can be

configured to balance precision and recall to meet the needs of a particular user community.

The configuration of a Solr application should take into account:

the needs of the application's various users (which can include ease of use and speed of response, in

addition to strictly informational needs)

the categories that are meaningful to these users in their various contexts (e.g., dates, product categories, or

regions)

any inherent relevance of documents (e.g., it might make sense to ensure that an official product description

or FAQ is always returned near the top of the search results)

whether or not the age of documents matters significantly (in some contexts, the most recent documents

might always be the most important)

Keeping all these factors in mind, it's often helpful in the planning stages of a Solr deployment to sketch out the

types of responses you think the search application should return for sample queries. Once the application is up and

running, you can employ a series of testing methodologies, such as focus groups, in-house testing, tests andTREC

A/B testing to fine tune the configuration of the application to best meet the needs of its users.

For more information about relevance, see Grant Ingersoll's tech article Debugging Search Application Relevance

which is available on SearchHub.org.Issues

Query Syntax and Parsing

Solr supports several query parsers, offering search application designers great flexibility in controlling how queries

are parsed.

This section explains how to specify the query parser to be used. It also describes the syntax and features

supported by the main query parsers included with Solr and describes some other parsers that may be useful for

particular situations. There are some query parameters common to all Solr parsers; these are discussed in the

section .Common Query Parameters

The parsers discussed in this Guide are:

The Standard Query Parser

The DisMax Query Parser

The Extended DisMax Query Parser

Other Parsers

The query parser plugins are all subclasses of . If you have custom parsing needs, you may want toQParserPlugin

extend that class to create your own query parser.

For more detailed information about the many query parsers available in Solr, see https://wiki.apache.org/solr/SolrQ

214Apache Solr Reference Guide 4.10
.uerySyntax
Common Query Parameters
The table below summarizes Solr's common query parameters, which are supported by the  ,  , and Standard DisMax
 Request Handlers.eDisMax
Parameter Description
defType Selects the query parser to be used to process the query.
sort Sorts the response to a query in either ascending or descending order based on the response's
score or another specified characteristic.
start Specifies an offset (by default, 0) into the responses at which Solr should begin displaying
content.
rows Controls how many rows of responses are displayed at a time (default value: 10)
fq Applies a filter query to the search results.
fl With version 3.6, Solr limited the query's responses to a listed set of fields. With version 4.0, this
parameter returns only the score.
debug Request additional debugging information in the response. Specifying the   paradebug=timing
meter returns just the timing information; specifying the   parameter returnsdebug=results
"explain" information for each of the documents returned; specifying the debug=query
 returns all of the debug information.parameter
explainOther Allows clients to specify a Lucene query to identify a set of documents. If non-blank, the explain
info of each document which matches this query, relative to the main query (specified by the q
parameter) will be returned along with the rest of the debugging information.
timeAllowed Defines the time allowed for the query to be processed. If the time elapses before the query
response is complete, partial information may be returned.
omitHeader Excludes the header from the returned results, if set to true. The header contains information
about the request, such as the time the request took to complete. The default is false.
wt Specifies the Response Writer to be used to format the query response.
logParamsList By default, Solr logs all parameters. From version 4.7, set this parameter to restrict which
parameters are logged. Valid entries are the parameters to be logged, separated by commas
(i.e.,  ). An empty list will log no parameters, so if logging alllogParamsList=param1,param2
parameters is desired, do not define this additional parameter at all.
The following sections describe these parameters in detail.
The   ParameterdefType
The defType parameter selects the query parser that Solr should use to process the main query parameter ( ) in theq
request. For example:
defType=dismax
By default, the   is used.The Standard Query Parser

215Apache Solr Reference Guide 4.10

The Parametersort

The parameter arranges search results in either ascending ( ) or descending ( ) order. The parametersort asc desc

can be used with either numerical or alphabetical content. The directions can be entered in either all lowercase or all

uppercase letters (i.e., both or ).asc ASC

Solr can sort query responses according to document scores or the value of any indexed field with a single value

(that is, any field whose attributes in include and ),schema.xml multiValued="false" indexed="true"

provided that:

the field is non-tokenized (that is, the field has no analyzer and its contents have been parsed into tokens,

which would make the sorting inconsistent), or

the field uses an analyzer (such as the KeywordTokenizer) that produces only a single term.

If you want to be able to sort on a field whose contents you want to tokenize to facilitate searching, use the <copyF

directive in the file to clone the field. Then search on the field and sort on its clone.ield> schema.xml

The table explains how Solr responds to various settings of the parameter.sort

Example Result

If the sort parameter is omitted, sorting is performed as though the parameter were set to score

.desc

score desc Sorts in descending order from the highest score to the lowest score.

price asc Sorts in ascending order of the price field

inStock desc,

price asc

Sorts by the contents of the field in descending order, then within those results sorts ininStock

ascending order by the contents of the price field.

Regarding the sort parameter's arguments:

A sort ordering must include a field name (or as a pseudo field), followed by whitespace (escaped asscore

+ or in URL strings), followed by a sort direction ( or ).%20 asc desc

Multiple sort orderings can be separated by a comma, using this syntax: sort=<field

name>+<direction>,<field name>+<direction>],...

When more than one sort criteria is provided, the second entry will only be used if the first entry results

in a tie. If there is a third entry, it will only be used if the first AND second entries are tied. This pattern

continues with further entries.

The Parameterstart

When specified, the parameter specifies an offset into a query's result set and instructs Solr to beginstart

displaying results from this offset.

The default value is "0". In other words, by default, Solr returns results without an offset, beginning where the results

themselves begin.

Setting the parameter to some other number, such as 3, causes Solr to skip over the preceding records andstart

start at the document identified by the offset.

You can use the parameter this way for paging. For example, if the parameter is set to 10, you couldstart rows

display three successive pages of results by setting start to 0, then re-issuing the same query and setting start to 10,

216Apache Solr Reference Guide 4.10

then issuing the query again and setting start to 20.

The Parameterrows

You can use the rows parameter to paginate results from a query. The parameter specifies the maximum number of

documents from the complete result set that Solr should return to the client at one time.

The default value is 10. That is, by default, Solr returns 10 documents at a time in response to a query.

The (Filter Query) Parameterfq

The parameter defines a query that can be used to restrict the superset of documents that can be returned,fq

without influencing score. It can be very useful for speeding up complex queries, since the queries specified with fq

are cached independently of the main query. When a later query uses the same filter, there's a cache hit, and filter

results are returned quickly from the cache.

When using the parameter, keep in mind the following:fq

The parameter can be specified multiple times in a query. Documents will only be included in the result iffq

they are in the intersection of the document sets resulting from each instance of the parameter. In the

example below, only documents which have a popularity greater then 10 and have a section of 0 will match.

fq=popularity:[10 TO *]&fq=section:0

Filter queries can involve complicated Boolean queries. The above example could also be written as a single

with two mandatory clauses like so:fq

fq=+popularity:[10 TO *]+section:0

The document sets from each filter query are cached independently. Thus, concerning the previous

examples: use a single containing two mandatory clauses if those clauses appear together often, and usefq

two separate parameters if they are relatively independent. (To learn about tuning cache sizes and makingfq

sure a filter cache actually exists, see .)The Well-Configured Solr Instance

As with all parameters: special characters in an URL need to be properly escaped and encoded as hex

values. Online tools are available to help you with URL-encoding. For example: http://meyerweb.com/eric/tool

.s/dencoder/

The (Field List) Parameterfl

The parameter limits the information included in a query response to a specified list of fields. The fields need tofl

have been indexed as stored for this parameter to work correctly.

The field list can be specified as a space-separated or comma-separated list of field names. The string "score" can

be used to indicate that the score of each document for the particular query should be returned as a field. The

wildcard character "*" selects all the stored fields in a document. You can also add psuedo-fields, functions and

transformers to the field list request.

This table shows some basic examples of how to use :fl

Field List Result

id name price Return only the id, name, and price fields.

217Apache Solr Reference Guide 4.10

id,name,price Return only the id, name, and price fields.

id name, price Return only the id, name, and price fields.

id score Return the id field and the score.

* Return all the fields in each document. This is the default value of the fl parameter.

* score Return all the fields in each document, along with each field's score.

Function Values

Functions can be computed for each document in the result and returned as a psuedo-field:

fl=id,title,product(price,popularity)

Document Transformers

Document Transformers can be used to modify the information returned about each documents in the results of a

query:

fl=id,title,[explain]

Field Name Aliases

You can change the key used to in the response for a field, function, or transformer by prefixing it with a "displayN

". For example::ame

fl=id,sales_price:price,secret_sauce:prod(price,popularity),why_score:[explain

style=nl]

"response":{"numFound":2,"start":0,"docs":[

{

"id":"6H500F0",

"secret_sauce":2100.0,

"sales_price":350.0,

"why_score":{

"match":true,

"value":1.052226,

"description":"weight(features:cache in 2) [DefaultSimilarity], result of:",

"details":[{

...

The Parameterdebug

In Solr 4, requesting debugging information with results has been simplified from a suite of related parameters to a

single parameter that takes format information as arguments. The parameter is now simply , with thedebug

following arguments:

debug=true: return debug information about the query only.

debug=query: return debug information about the query only.

debug=timing: return debug information about how long the query took to process.

debug=results: return debug information about the results (also known as "explain")

218Apache Solr Reference Guide 4.10

The default behavior is not to include debugging information.

The ParameterexplainOther

The parameter specifies a Lucene query in order to identify a set of documents. If this parameter isexplainOther

included and is set to a non-blank value, the query will return debugging information, along with the "explain info" of

each document that matches the Lucene query, relative to the main query (which is specified by the q parameter).

For example:

q=supervillians&debugQuery=on&explainOther=id:juggernaut

The query above allows you to examine the scoring explain info of the top matching documents, compare it to the

explain info for documents matching , and determine why the rankings are not as you expect.id:juggernaut

The default value of this parameter is blank, which causes no extra "explain info" to be returned.

The ParametertimeAllowed

This parameter specifies the amount of time, in milliseconds, allowed for a search to complete. If this time expires

before the search is complete, any partial results will be returned.

The ParameteromitHeader

This parameter may be set to either true or false.

If set to true, this parameter excludes the header from the returned results. The header contains information about

the request, such as the time it took to complete. The default value for this parameter is false.

The Parameterwt

The parameter selects the Response Writer that Solr should use to format the query's response. For detailedwt

descriptions of Response Writers, see .Response Writers

The cache=false Parameter

Solr caches the results of all queries and filter queries by default. To disable result caching, set the pcache=false

arameter.

You can also use the option to control the order in which non-cached filter queries are evaluated. This allowscost

you to order less expensive non-cached filters before expensive non-cached filters.

For very high cost filters, if and and the query implements the interface, acache=false cost>=100 PostFilter

Collector will be requested from that query and used to filter documents after they have matched the main query and

all other filter queries. There can be multiple post filters; they are also ordered by cost.

For example:

219Apache Solr Reference Guide 4.10

// normal function range query used as a filter, all matching documents

// generated up front and cached

fq={!frange l=10 u=100}mul(popularity,price)

// function range query run in parallel with the main query like a traditional

// lucene filter

fq={!frange l=10 u=100 cache=false}mul(popularity,price)

// function range query checked after each document that already matches the query

// and all other filters. Good for really expensive function queries.

fq={!frange l=10 u=100 cache=false cost=100}mul(popularity,price)

The ParameterlogParamsList

By default, Solr logs all parameters of requests. From version 4.7, set this parameter to restrict which parameters of

a request are logged. This may help control logging to only those parameters considered important to your

organization.

For example, you could define this like:

logParamsList=q,fq

And only the 'q' and 'fq' parameters will be logged.

If no parameters should be logged, you can send as empty (i.e., ).logParamsList logParamsList=

The Standard Query Parser

Before Solr 1.3, the Standard Request Handler called the standard query parser

as the default query parser. In versions since Solr 1.3, the Standard Request

Handler calls the DisMax query parser as the default query parser. You can

configure Solr to call the standard query parser instead, if you like.

The advantage of the standard query parser is that it enables users to specify

very precise queries. The disadvantage is that it is less tolerant of syntax errors

than the query parser. The DisMax query parser is designed to throw asDisMax

few errors as possible.

This parameter does not only apply to query requests, but to any kind of request to Solr.

220Apache Solr Reference Guide 4.10
Topics covered in
this section:
Standard
Query Parser
Parameters
The Standard
Query
Parser's
Response
Specifying
Terms for the
Standard
Query Parser
Specifying
Fields in a
Query to the
Standard
Query Parser
Boolean
Operators
Supported by
the Standard
Query Parser
Grouping
Terms to
Form
Sub-Queries
Differences
between
Lucene Query
Parser and
the Solr
Standard
Query Parser
Related
Topics
Standard Query Parser Parameters
In addition to the  ,  ,  , and Common Query Parameters Faceting Parameters Highlighting Parameters MoreLikeThis
, the standard query parser supports the parameters described in the table below.Parameters
Parameter Description
q Defines a query using standard query syntax. This parameter is mandatory.

221Apache Solr Reference Guide 4.10

q.op Specifies the default operator for query expressions, overriding the default operator specified in the

file. Possible values are "AND" or "OR".schema.xml

df Specifies a default field, overriding the definition of a default field in the file.schema.xml

Default parameter values are specified in , or overridden by query-time values in the request.solrconfig.xml

The Standard Query Parser's Response

By default, the response from the standard query parser contains one block, which is unnamed. If the <result> de

is used, then an additional block will be returned, using the name "debug". This will contain parameterbug <lst>

useful debugging info, including the original query string, the parsed query string, and explain info for each

document in the <result> block. If the is also used, then additional explain info will be parameterexplainOther

provided for all the documents matching that query.

Sample Responses

This section presents examples of responses from the standard query parser.

The URL below submits a simple query and requests the XML Response Writer to use indentation to make the XML

response more readable.

http://yourhost.tld:9999/solr/select?q=id:SP2514N&version=2.1&indent=1

Results:

<?xml version="1.0" encoding="UTF-8"?>

<doc>

<arr name="cat"><str>electronics</str><str>hard drive</str></arr>

<arr name="features"><str>7200RPM, 8MB cache, IDE Ultra ATA-133</str>

<str>NoiseGuard, SilentSeek technology, Fluid Dynamic Bearing (FDB)

motor</str></arr>

<str name="manu">Samsung Electronics Co. Ltd.</str>

<str name="name">Samsung SpinPoint P120 SP2514N - hard drive - 250 GB -

ATA-133</str>

</doc>

</result>

</response>

Here's an example of a query with a limited field list.

http://yourhost.tld:9999/solr/select?q=id:SP2514N&version=2.1&indent=1&fl=id+name

Results:

222Apache Solr Reference Guide 4.10

<?xml version="1.0" encoding="UTF-8"?>

<doc>

<str name="name">Samsung SpinPoint P120 SP2514N - hard drive - 250 GB -

ATA-133</str>

</doc>

</result>

</response>

Specifying Terms for the Standard Query Parser

A query to the standard query parser is broken up into terms and operators. There are two types of terms: single

terms and phrases.

A single term is a single word such as "test" or "hello"

A phrase is a group of words surrounded by double quotes such as "hello dolly"

Multiple terms can be combined together with Boolean operators to form more complex queries (as described

below).

Term Modifiers

Solr supports a variety of term modifiers that add flexibility or precision, as needed, to searches. These modifiers

include wildcard characters, characters for making a search "fuzzy" or more general, and so on. The sections below

describe these modifiers in detail.

Wildcard Searches

Solr's standard query parser supports single and multiple character wildcard searches within single terms. Wildcard

characters can be applied to single terms, but not to search phrases.

Wildcard Search Type Special

Character

Example

Single character (matches a single

character)

? The search string would match both test andte?t

text.

It is important that the analyzer used for queries parses terms and phrases in a way that is consistent with

the way the analyzer used for indexing parses terms and phrases; otherwise, searches may produce

unexpected results.

223Apache Solr Reference Guide 4.10

Multiple characters (matches zero or more

sequential characters)

* The wildcard search:

tes*

would match test, testing, and tester.

You can also use wildcard characters in the middle

of a term. For example:

te*t

would match test and text.

*est

would match pest and test.

Fuzzy Searches

Solr's standard query parser supports fuzzy searches based on the Damerau-Levenshtein Distance or Edit Distance

algorithm. Fuzzy searches discover terms that are similar to a specified term without necessarily being an exact

match. To perform a fuzzy search, use the tilde ~ symbol at the end of a single-word term. For example, to search

for a term similar in spelling to "roam," use the fuzzy search:

roam~

This search will match terms like roams, foam, & foams. It will also match the word "roam" itself.

An optional distance parameter specifies the maximum number of edits allowed, between 0 and 2, defaulting to 2.

For example:

roam~1

This will match terms like roams & foam - but not foams since it has an edit distance of "2".

Proximity Searches

A proximity search looks for terms that are within a specific distance from one another.

To perform a proximity search, add the tilde character ~ and a numeric value to the end of a search phrase. For

example, to search for a "apache" and "jakarta" within 10 words of each other in a document, use the search:

"jakarta apache"~10

The distance referred to here is the number of term movements needed to match the specified phrase. In the

As of Solr 1.4, you can use a * or ? symbol as the first character of a search with the standard query parser.

In many cases, stemming (reducing terms to a common stem) can produce similar effects to fuzzy searches

and wildcard searches.

224Apache Solr Reference Guide 4.10

example above, if "apache" and "jakarta" were 10 spaces apart in a field, but "apache" appeared before "jakarta",

more than 10 term movements would be required to move the terms together and position "apache" to the right of

"jakarta" with a space in between.

Range Searches

A range search specifies a range of values for a field (a range with an upper bound and a lower bound). The query

matches documents whose values for the specified field or fields fall within the range. Range queries can be

inclusive or exclusive of the upper and lower bounds. Sorting is done lexicographically, except on numeric fields. For

example, the range query below matches all documents whose field has a value between 20020101 andmod_date

20030101, inclusive.

mod_date:[20020101 TO 20030101]

Range queries are not limited to date fields or even numerical fields. You could also use range queries with

non-date fields:

title:{Aida TO Carmen}

This will find all documents whose titles are between Aida and Carmen, but not including Aida and Carmen.

The brackets around a query determine its inclusiveness.

Square brackets [ ] denote an inclusive range query that matches values including the upper and lower

bound.

Curly brackets { } denote an exclusive range query that matches values between the upper and lower

bounds, but excluding the upper and lower bounds themselves.

You can mix these types so one end of the range is inclusive and the other is exclusive. Here's an example:

count:{1 TO 10]

Boosting a Term with ^

Lucene/Solr provides the relevance level of matching documents based on the terms found. To boost a term use the

caret symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor,^

the more relevant the term will be.

Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching

for

"jakarta apache" and you want the term "jakarta" to be more relevant, you can boost it by adding the ^ symbol along

with the boost factor immediately after the term. For example, you could type:

jakarta^4 apache

This will make documents with the term jakarta appear more relevant. You can also boost Phrase Terms as in the

example:

"jakarta apache"^4 "Apache Lucene"

By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (for example, it

could be 0.2).

Specifying Fields in a Query to the Standard Query Parser

Data indexed in Solr is organized in fields, which are defined in the Solr file. Searches can takeschema.xml

advantage of fields to add precision to queries. For example, you can search for a term only in a specific field, such

225Apache Solr Reference Guide 4.10

as a title field.

The file defines one field as a default field. If you do not specify a field in a query, Solr searches onlyschema.xml

the default field. Alternatively, you can specify a different field or a combination of fields in a query.

To specify a field, type the field name followed by a colon ":" and then the term you are searching for within the field.

For example, suppose an index contains two fields, title and text,and that text is the default field. If you want to find a

document called "The Right Way" which contains the text "don't go this way," you could include either of the

following terms in your search query:

title:"The Right Way" AND text:go

title:"Do it right" AND go

Since text is the default field, the field indicator is not required; hence the second query above omits it.

The field is only valid for the term that it directly precedes, so the query will find only "Do" intitle:Do it right

the title field. It will find "it" and "right" in the default field (in this case the text field).

Boolean Operators Supported by the Standard Query Parser

Boolean operators allow you to apply Boolean logic to queries, requiring the presence or absence of specific terms

or conditions in fields in order to match documents. The table below summarizes the Boolean operators supported

by the standard query parser.

Boolean

Operator

Alternative

Symbol

Description

AND && Requires both terms on either side of the Boolean operator to be present for a match.

NOT ! Requires that the following term not be present.

OR || Requires that either term (or both terms) be present for a match.

+ Requires that the following term be present.

- Prohibits the following term (that is, matches on fields or documents that do not include

that term). The - operator is functional similar to the Boolean operator !. Because it's

used by popular search engines such as Google, it may be more familiar to some user

communities.

Boolean operators allow terms to be combined through logic operators. Lucene supports AND, "+", OR, NOT and "-"

as Boolean operators.

The OR operator is the default conjunction operator. This means that if there is no Boolean operator between two

terms, the OR operator is used. The OR operator links two terms and finds a matching document if either of the

terms exist in a document. This is equivalent to a union using sets. The symbol || can be used in place of the word

When specifying Boolean operators with keywords such as AND or NOT, the keywords must appear in all

uppercase.

The standard query parser supports all the Boolean operators listed in the table above. The DisMax query

parser supports only + and -.

226Apache Solr Reference Guide 4.10

OR.

In the file, you can specify which symbols can take the place of Boolean operators such as OR. Toschema.xml

search for documents that contain either "jakarta apache" or just "jakarta," use the query:

"jakarta apache" jakarta

"jakarta apache" OR jakarta

The Boolean Operator +

The + symbol (also known as the "required" operator) requires that the term after the + symbol exist somewhere in a

field in at least one document in order for the query to return a match.

For example, to search for documents that must contain "jakarta" and that may or may not contain "lucene," use the

following query:

+jakarta lucene

The Boolean Operator AND (&&)

The AND operator matches documents where both terms exist anywhere in the text of a single document. This is

equivalent to an intersection using sets. The symbol && can be used in place of the word AND.

To search for documents that contain "jakarta apache" and "Apache Lucene," use either of the following queries:

"jakarta apache" AND "Apache Lucene"

"jakarta apache" && "Apache Lucene"

The Boolean Operator NOT (!)

The NOT operator excludes documents that contain the term after NOT. This is equivalent to a difference using

sets. The symbol ! can be used in place of the word NOT.

The following queries search for documents that contain the phrase "jakarta apache" but do not contain the phrase

"Apache Lucene":

"jakarta apache" NOT "Apache Lucene"

"jakarta apache" ! "Apache Lucene"

The Boolean Operator -

The - symbol or "prohibit" operator excludes documents that contain the term after the - symbol.

For example, to search for documents that contain "jakarta apache" but not "Apache Lucene," use the following

query:

"jakarta apache" -"Apache Lucene"

Escaping Special Characters

Solr gives the following characters special meaning when they appear in a query:

This operator is supported by both the standard query parser and the DisMax query parser.

227Apache Solr Reference Guide 4.10

+ - && || ! ( ) { } [ ] ^ " ~ * ? : /

To make Solr interpret any of these characters literally, rather as a special character, precede the character with a

backslash character \. For example, to search for (1+1):2 without having Solr interpret the plus sign and parentheses

as special characters for formulating a sub-query with two terms, escape the characters by preceding each one with

a backslash:

$1\+1$\:2

Grouping Terms to Form Sub-Queries

Lucene/Solr supports using parentheses to group clauses to form sub-queries. This can be very useful if you want to

control the Boolean logic for a query.

The query below searches for either "jakarta" or "apache" and "website":

(jakarta OR apache) AND website

This adds precision to the query, requiring that the term "website" exist, along with either term "jakarta" and

"apache."

Grouping Clauses within a Field

To apply two or more Boolean operators to a single field in a search, group the Boolean clauses within parentheses.

For example, the query below searches for a title field that contains both the word "return" and the phrase "pink

panther":

title:(+return +"pink panther")

Differences between Lucene Query Parser and the Solr Standard Query Parser

Solr's standard query parser differs from the Lucene Query Parser in the following ways:

A * may be used for either or both endpoints to specify an open-ended range query

field:[* TO 100] finds all field values less than or equal to 100

field:[100 TO *] finds all field values greater than or equal to 100

field:[* TO *] matches all documents with the field

Pure negative queries (all clauses prohibited) are allowed (only as a top-level clause)

-inStock:false finds all field values where inStock is not false

-field:[* TO *] finds all documents without a value for field

A hook into FunctionQuery syntax. You'll need to use quotes to encapsulate the function if it includes

parentheses, as shown in the second example below:

_val_:myfield

_val_:"recip(rord(myfield),1,2,3)"

Support for any type of query parser. Prior to Solr 4.1, the "magic" field " needed to be used to nest_query_

another query parser. However, with Solr 4.1, other query parsers can be used directly using the local

parameters syntax.

{ }!geodist d=10 p=20.5,30.2

Range queries ("[a TO z]"), prefix queries ("a*"), and wildcard queries ("a*b") are constant-scoring (all

matching documents get an equal score). The scoring factors TF, IDF, index boost, and "coord" are not used.

There is no limitation on the number of terms that match (as there was in past versions of Lucene).

Specifying Dates and Times

228Apache Solr Reference Guide 4.10

Queries against fields using the type (typically range queries) should use the TrieDateField appropriate date

:syntax

timestamp:[* TO NOW]

createdate:[1976-03-06T23:59:59.999Z TO *]

createdate:[1995-12-31T23:59:59.999Z TO 2007-03-06T00:00:00Z]

pubdate:[NOW-1YEAR/DAY TO NOW/DAY+1DAY]

createdate:[1976-03-06T23:59:59.999Z TO 1976-03-06T23:59:59.999Z+1YEAR]

createdate:[1976-03-06T23:59:59.999Z/YEAR TO 1976-03-06T23:59:59.999Z]

Related Topics

Local Parameters in Queries

Other Parsers

The DisMax Query Parser

The DisMax query parser is designed to process simple phrases (without complex syntax) entered by users and to

search for individual terms across several fields using different weighting (boosts) based on the significance of each

field. Additional options enable users to influence the score based on rules specific to each use case (independent

of user input).

In general, the DisMax query parser's interface is more like that of Google than the interface of the 'standard' Solr

request handler. This similarity makes DisMax the appropriate query parser for many consumer applications. It

accepts a simple syntax, and it rarely produces error messages.

The DisMax query parser supports an extremely simplified subset of the Lucene QueryParser syntax. As in Lucene,

quotes can be used to group phrases, and +/- can be used to denote mandatory and optional clauses. All other

Lucene query parser special characters (except AND and OR) are escaped to simplify the user experience. The

DisMax query parser takes responsibility for building a good query from the user's input using Boolean clauses

containing DisMax queries across fields and boosts specified by the user. It also lets the Solr administrator provide

additional boosting queries, boosting functions, and filtering queries to artificially affect the outcome of all searches.

These options can all be specified as default parameters for the handler in the file or overriddensolrconfig.xml

in the Solr query URL.

Interested in the technical concept behind the DisMax name? DisMax stands for Maximum Disjunction. Here's a

definition of a Maximum Disjunction or "DisMax" query:

A query that generates the union of documents produced by its subqueries, and that scores each

document with the maximum score for that document as produced by any subquery, plus a tie

breaking increment for any additional matching subqueries.

Whether or not you remember this explanation, do remember that the DisMax request handler was primarily

designed to be easy to use and to accept almost any input without returning an error.

DisMax Parameters

In addition to the common request parameter, highlighting parameters, and simple facet parameters, the DisMax

query parser supports the parameters described below. Like the standard query parser, the DisMax query parser

allows default parameter values to be specified in , or overridden by query-time values in thesolrconfig.xml

request.

229Apache Solr Reference Guide 4.10

Parameter Description

qDefines the raw input strings for the query.

q.alt Calls the standard query parser and defines query input strings, when the q parameter is not used.

qf Query Fields: specifies the fields in the index on which to perform the query. If absent, defaults to df

mm Minimum "Should" Match: specifies a minimum number of fields that must match in a query. If no

'mm' parameter is specified in the query, or as a default in , the effective value ofsolrconfig.xml

the parameter (either in the query, as a default in , or from theq.op solrconfig.xml

'defaultOperator' option in ) is used to influence the behavior. If is effectivelyschema.xml q.op

AND'ed, then mm=100%; if is OR'ed, then mm=1. Users who want to force the legacyq.op

behavior should set a default value for the 'mm' parameter in their file. Userssolrconfig.xml

should add this as a configured default for their request handlers. This parameter tolerates

miscellaneous white spaces in expressions (e.g., " 3 < -25% 10 < -3\n", " \n-25%\n ",

)." \n3\n "

pf Phrase Fields: boosts the score of documents in cases where all of the terms in the q parameter

appear in close proximity.

ps Phrase Slop: specifies the number of positions two terms can be apart in order to match the

specified phrase.

qs Query Phrase Slop: specifies the number of positions two terms can be apart in order to match the

specified phrase. Used specifically with the parameter.qf

tie Tie Breaker: specifies a float value (which should be something much less than 1) to use as

tiebreaker in DisMax queries.

bq Boost Query: specifies a factor by which a term or phrase should be "boosted" in importance when

considering a match.

bf Boost Functions: specifies functions to be applied to boosts. (See for details about function queries.)

The sections below explain these parameters in detail.

The Parameterq

The parameter defines the main "query" constituting the essence of the search. The parameter supports raw inputq

strings provided by users with no special escaping. The + and - characters are treated as "mandatory" and

"prohibited" modifiers for terms. Text wrapped in balanced quote characters (for example, "San Jose") is treated as

a phrase. Any query containing an odd number of quote characters is evaluated as if there were no quote characters

at all.

The Parameterq.alt

If specified, the parameter defines a query (which by default will be parsed using standard query parsingq.alt

syntax) when the main q parameter is not specified or is blank. The parameter comes in handy when youq.alt

The parameter does not support wildcard characters such as *.q

230Apache Solr Reference Guide 4.10

need something like a query to match all documents (don't forget for that one!) in order to get&rows=0

collection-wise faceting counts.

The (Query Fields) Parameterqf

The parameter introduces a list of fields, each of which is assigned a boost factor to increase or decrease thatqf

particular field's importance in the query. For example, the query below:

qf="fieldOne^2.3 fieldTwo fieldThree^0.4"

assigns a boost of 2.3, leaves with the default boost (because no boost factor is specified),fieldOne fieldTwo

and a boost of 0.4. These boost factors make matches in much more significant thanfieldThree fieldOne

matches in , which in turn are much more significant than matches in .fieldTwo fieldThree

The (Minimum Should Match) Parametermm

When processing queries, Lucene/Solr recognizes three types of clauses: mandatory, prohibited, and "optional"

(also known as "should" clauses). By default, all words or phrases specified in the parameter are treated asq

"optional" clauses unless they are preceded by a "+" or a "-". When dealing with these "optional" clauses, the parmm

ameter makes it possible to say that a certain minimum number of those clauses must match. The DisMax query

parser offers great flexibility in how the minimum number can be specified.

The table below explains the various ways that mm values can be specified.

Syntax Example Description

Positive integer 3 Defines the minimum number of clauses that must match, regardless of how

many clauses there are in total.

Negative integer -2 Sets the minimum number of matching clauses to the total number of

optional clauses, minus this value.

Percentage 75% Sets the minimum number of matching clauses to this percentage of the

total number of optional clauses. The number computed from the

percentage is rounded down and used as the minimum.

Negative percentage -25% Indicates that this percent of the total number of optional clauses can be

missing. The number computed from the percentage is rounded down,

before being subtracted from the total to determine the minimum number.

An expression

beginning with a

positive integer

followed by a > or <

sign and another value

3<90% Defines a conditional expression indicating that if the number of optional

clauses is equal to (or less than) the integer, they are all required, but if it's

greater than the integer, the specification applies. In this example: if there

are 1 to 3 clauses they are all required, but for 4 or more clauses only 90%

are required.

Multiple conditional

expressions involving

> or < signs

2<-25%

9<-3

Defines multiple conditions, each one being valid only for numbers greater

than the one before it. In the example at left, if there are 1 or 2 clauses, then

both are required. If there are 3-9 clauses all but 25% are required. If there

are more then 9 clauses, all but three are required.

When specifying values, keep in mind the following:mm

When dealing with percentages, negative values can be used to get different behavior in edge cases. 75%

231Apache Solr Reference Guide 4.10

and -25% mean the same thing when dealing with 4 clauses, but when dealing with 5 clauses 75% means 3

are required, but -25% means 4 are required.

If the calculations based on the parameter arguments determine that no optional clauses are needed, the

usual rules about Boolean queries still apply at search time. (That is, a Boolean query containing no required

clauses must still match at least one optional clause).

No matter what number the calculation arrives at, Solr will never use a value greater than the number of

optional clauses, or a value less than 1. (In other words, no matter how low or how high the calculated result,

the minimum number of required matches will never be less than 1 or greater than the number of clauses.)

The default value of is 100% (meaning that all clauses must match).mm

The (Phrase Fields) Parameterpf

Once the list of matching documents has been identified using the and parameters, the parameter can befq qf pf

used to "boost" the score of documents in cases where all of the terms in the q parameter appear in close proximity.

The format is the same as that used by the parameter: a list of fields and "boosts" to associate with each of themqf

when making phrase queries out of the entire q parameter.

The (Phrase Slop) Parameterps

The parameter specifies the amount of "phrase slop" to apply to queries specified with the pf parameter. Phraseps

slop is the number of positions one token needs to be moved in relation to another token in order to match a phrase

specified in a query.

The (Query Phrase Slop) Parameterqs

The parameter specifies the amount of slop permitted on phrase queries explicitly included in the user's queryqs

string with the parameter. As explained above, slop refers to the number of positions one token needs to beqf

moved in relation to another token in order to match a phrase specified in a query.

The (Tie Breaker) Parametertie

The parameter specifies a float value (which should be something much less than 1) to use as tiebreaker intie

DisMax queries.

When a term from the user's input is tested against multiple fields, more than one field may match. If so, each field

will generate a different score based on how common that word is in that field (for each document relative to all

other documents). The parameter lets you control how much the final score of the query will be influenced bytie

the scores of the lower scoring fields compared to the highest scoring field.

A value of "0.0" makes the query a pure "disjunction max query": that is, only the maximum scoring subquery

contributes to the final score. A value of "1.0" makes the query a pure "disjunction sum query" where it doesn't

matter what the maximum scoring sub query is, because the final score will be the sum of the subquery scores.

Typically a low value, such as 0.1, is useful.

The (Boost Query) Parameterbq

The parameter specifies an additional, optional, query clause that will be added to the user's main query tobq

influence the score. For example, if you wanted to add a relevancy boost for recent documents:

232Apache Solr Reference Guide 4.10

q=cheese

bq=date:[NOW/DAY-1YEAR TO NOW/DAY]

You can specify multiple parameters. If you want your query to be parsed as separate clauses with separatebq

boosts, use multiple parameters.bq

The (Boost Functions) Parameterbf

The parameter specifies functions (with optional boosts) that will be used to construct FunctionQueries which willbf

be added to the user's main query as optional clauses that will influence the score. Any function supported natively

by Solr can be used, along with a boost value. For example:

recip(rord(myfield),1,2,3)^1.5

Specifying functions with the bf parameter is essentially just shorthand for using the param combined with the bq {!

parser.func}

For example, if you want to show the most recent documents first, you could use either of the following:

bf=recip(rord(creationDate),1,1000,1000)

...or...

bq={!func}recip(rord(creationDate),1,1000,1000)

Examples of Queries Submitted to the DisMax Query Parser

Normal results for the word "video" using the StandardRequestHandler with the default search field:

http://localhost:8983/solr/select/?q=video&fl=name+score

The "dismax" handler is configured to search across the text, features, name, sku, id, manu, and cat fields all with

varying boosts designed to ensure that "better" matches appear first, specifically: documents which match on the

name and cat fields get higher scores.

http://localhost:8983/solr/select/?defType=dismax&q=video

Note that this instance is also configured with a default field list, which can be overridden in the URL.

http://localhost:8983/solr/select/?defType=dismax&q=video&fl=*,score

You can also override which fields are searched on and how much boost each field gets.

http://localhost:8983/solr/select/?defType=dismax&q=video&qf=features^20.0+text^0

You can boost results that have a field that matches a specific value.

http://localhost:8983/solr/select/?defType=dismax&q=video&bq=cat:electronics^5.0

Another instance of the handler is registered using the "instock" and has slightly different configuration options,qt

notably: a filter for (you guessed it) .inStock:true)

http://localhost:8983/solr/select/?defType=dismax&q=video&fl=name,score,inStock

http://localhost:8983/solr/select/?defType=dismax&q=video&qt=instock&fl=name,scor

e,inStock

233Apache Solr Reference Guide 4.10

One of the other really cool features in this handler is robust support for specifying the

"BooleanQuery.minimumNumberShouldMatch" you want to be used based on how many terms are in your user's

query. These allows flexibility for typos and partial matches. For the dismax handler, one and two word queries

require that all of the optional clauses match, but for three to five word queries one missing word is allowed.

http://localhost:8983/solr/select/?defType=dismax&q=belkin+ipod

http://localhost:8983/solr/select/?defType=dismax&q=belkin+ipod+gibberish

http://localhost:8983/solr/select/?defType=dismax&q=belkin+ipod+apple

Just like the StandardRequestHandler, it supports the debugQuery option to viewing the parsed query, and the

score explanations for each document.

http://localhost:8983/solr/select/?defType=dismax&q=belkin+ipod+gibberish&debugQu

ery=true

http://localhost:8983/solr/select/?defType=dismax&q=video+card&debugQuery=true

The Extended DisMax Query Parser

The Extended DisMax (eDisMax) query parser is an improved version of the . In addition toDisMax query parser

supporting all the DisMax query parser parameters, Extended Dismax:

supports the full Lucene query parser syntax.

supports queries such as AND, OR, NOT, -, and +.

treats "and" and "or" as "AND" and "OR" in Lucene syntax mode.

respects the 'magic field' names and . These are not a real fields in , but if used_val_ _query_ schema.xml

it helps do special things (like a function query in the case of or a nested query in the case of _val_ _query

). If is used in a term or phrase query, the value is parsed as a function._ _val_

includes improved smart partial escaping in the case of syntax errors; fielded queries, +/-, and phrase queries

are still supported in this mode.

improves proximity boosting by using word shingles; you do not need the query to match all words in the

document before proximity boosting is applied.

includes advanced stopword handling: stopwords are not required in the mandatory part of the query but are

still used in the proximity boosting part. If a query consists of all stopwords, such as "to be or not to be", then

all words are required.

includes improved boost function: in Extended DisMax, the function is a multiplier rather than anboost

addend, improving your boost results; the additive boost functions of DisMax ( and ) are also supported.bf bq

supports pure negative nested queries: queries such as will match all documents.+foo (-foo)

lets you specify which fields the end user is allowed to query, and to disallow direct fielded searches.

Extended DisMax Parameters

In addition to all the , Extended DisMax includes these query parameters:DisMax parameters

The Parameterboost

A multivalued list of strings parsed as queries with scores multiplied by the score from the main query for all

matching documents. This parameter is shorthand for wrapping the query produced by eDisMax using the BoostQP

arserPlugin

The ParameterlowercaseOperators

234Apache Solr Reference Guide 4.10

A Boolean parameter indicating if lowercase "and" and "or" should be treated the same as operators "AND" and

"OR".

The Parameterps

Default amount of slop on phrase queries built with , and/or fields (affects boosting).pf pf2 pf3

The Parameterpf2

A multivalued list of fields with optional weights, based on pairs of word shingles.

The Parameterps2

Default amount of slop on phrase queries built with , and/or fields (affects boosting). New with Solr 4, itpf pf2 pf3

is similar to but sets default slop factor for . If not specified, is used.ps pf2 ps

The Parameterpf3

A multivalued list of fields with optional weights, based on triplets of word shingles. Similar to , except that insteadpf

of building a phrase per field out of all the words in the input, it builds a set of phrases for each field out of each

triplet of word shingles.

The Parameterps3

New with Solr 4. As with but sets default slop factor for . If not specified, will be used.ps pf3 ps

The Parameterstopwords

A Boolean parameter indicating if the configured in the query analyzer should be respectedStopFilterFactory

when parsing the query: if it is false, then the in the query analyzer is ignored.StopFilterFactory

The Parameteruf

Specifies which schema fields the end user is allowed to explicitly query. This parameter supports wildcards. The

default is to allow all fields, equivalent to . To allow only title field, use . To allow title and all fieldsuf=* uf=title

ending with _s, use . To allow all fields except title, use . To disallow all fieldeduf=title,*_s uf=*-title

searches, use .uf=-*

Field aliasing using per-field overridesqf

Per-field overrides of the parameter may be specified to provide 1-to-many aliasing from field names specified inqf

the query string, to field names used in the underlying query. By default, no aliasing is used and field names

specified in the query string are treated as literal field names in the index.

Examples of Queries Submitted to the Extended DisMax Query Parser

Boost the result of the query term "hello" based on the document's popularity:

http://localhost:8983/solr/select/?defType=edismax&q=hello&pf=text&qf=text&boost=popul

arity

Search for iPods OR video:

http://localhost:8983/solr/select/?defType=edismax&q=ipod OR video

235Apache Solr Reference Guide 4.10

Search across multiple fields, specifying (via boosts) how important each field is relative each other:

http://localhost:8983/solr/select/?q=video&defType=edismax&qf=features^20.0+text^0.3

You can boost results that have a field that matches a specific value:

http://localhost:8983/solr/select/?q=video&defType=edismax&qf=features^20.0+text^0.3&b

q=cat:electronics^5.0

Using the "mm" param, 1 and 2 word queries require that all of the optional clauses match, but for queries with three

or more clauses one missing clause is allowed:

http://localhost:8983/solr/select/?q=belkin+ipod&defType=edismax&mm=2

http://localhost:8983/solr/select/?q=belkin+ipod+gibberish&defType=edismax&mm=2

http://localhost:8983/solr/select/?q=belkin+ipod+apple&defType=edismax&mm=2

In the example below, we see a per-field override of the parameter being used to alias "name" in the query stringqf

to either the " " and " " fields:last_name first_name

defType=edismax

q=sysadmin name:Mike

qf=title text last_name first_name

f.name.qf=last_name first_name

Using negative boost

Negative query boosts have been supported at the "Query" object level for a long time (resulting in negative scores

for matching documents). Now the QueryParsers have been updated to handle this too.

Using 'slop'

Dismax and can run queries against all query fields, and also run a query in the form of a phrase againstEdismax

the phrase fields. (This will work only for boosting documents, not actually for matching.) However, that phrase

query can have a 'slop,' which is the distance between the terms of the query while still considering it a phrase

match. For example:

q=foo bar

qf=field1^5 field2^10

pf=field1^50 field2^20

defType=dismax

With these parameters, the Dismax Query Parser generates a query that looks something like this:

(+(field1:foo^5 OR field2:bar^10) AND (field1:bar^5 OR field2:bar^10))

But it also generates another query that will only be used for boosting results:

field1:"foo bar"^50 OR field2:"foo bar"^20

Thus, any document that has the terms "foo" and "bar" will match; however if some of those documents have both of

the terms as a phrase, it will score much higher because it's more relevant.

236Apache Solr Reference Guide 4.10

If you add the parameter (phrase slop), the second query will instead be:ps

ps=10 field1:"foo bar"~10^50 OR field2:"foo bar"~10^20

This means that if the terms "foo" and "bar" appear in the document with less than 10 terms between each other, the

phrase will match. For example the doc that says:

*Foo* term1 term2 term3 *bar*

will match the phrase query.

How does one use phrase slop? Usually it is configured in the request handler (in ).solrconfig

With query slop ( ) the concept is similar, but it applies to explicit phrase queries from the user. For example, if youqs

want to search for a name, you could enter:

q="Hans Anderson"

A document that contains "Hans Anderson" will match, but a document that contains the middle name "Christian" or

where the name is written with the last name first ("Anderson, Hans") won't. For those cases one could configure the

query field , so that even if the user searches for an explicit phrase query, a slop is applied.qs

Finally, contains not only a phrase fields ( ) parameters, but also phrase and query fields 2 and 3. Youedismax pf

can use those fields for setting different fields or boosts. Each of those can use a different phrase slop.

Using the 'magic fields' _val_ and _query_

If the 'magic field' name is used in a term or phrase query, the value is parsed as a function._val_

The Solr Query Parser's use of and differs from the Lucene Query Parser in the following ways:_val_ _query_

If the magic field name is used in a term or phrase query, the value is parsed as a function._val_

It provides a hook into syntax. Quotes are necessary to encapsulate the function when itFunctionQuery

includes parentheses. For example:

_val_:myfield

_val_:"recip(rord(myfield),1,2,3)"

The Solr Query Parser offers nested query support for any type of query parser (via QParserPlugin). Quotes

are often necessary to encapsulate the nested query if it contains reserved characters. For example:

_query_:"{!dismax qf=myfield}how now brown cow"

Although not technically a syntax difference, note that if you use the Solr type (or the deprecated TrieDateField

type), any queries on those fields (typically range queries) should use either the Complete ISO 8601DateField

Date syntax that field supports, or the to get relative dates. For example:DateMath Syntax

237Apache Solr Reference Guide 4.10

timestamp:[* TO NOW]

createdate:[1976-03-06T23:59:59.999Z TO *]

createdate:[1995-12-31T23:59:59.999Z TO 2007-03-06T00:00:00Z]

pubdate:[NOW-1YEAR/DAY TO NOW/DAY+1DAY]

createdate:[1976-03-06T23:59:59.999Z TO 1976-03-06T23:59:59.999Z+1YEAR]

createdate:[1976-03-06T23:59:59.999Z/YEAR TO 1976-03-06T23:59:59.999Z]

Function Queries

Function queries enable you to generate a relevancy score using the actual value of

one or more numeric fields. Function queries are supported by the , DisMax Extended

, and query parsers.DisMax standard

Function queries use . The functions can be a constant (numeric or stringfunctions

literal), a field, another function or a parameter substitution argument. You can use

these functions to modify the ranking of results for users. These could be used to

change the ranking of results based on a user's location, or some other calculation.

Function query

topics covered

in this section:

Using

Function

Query

Available

Functions

Example

Function

Queries

Sort By

Function

Topics

Using Function Query

Functions must be expressed as function calls (for example, instead of simply ).sum(a,b) a+b

There are several ways of using function queries in a Solr query:

Via an explicit QParser that expects function arguments, such or . For example:func frange

q={!func}div(popularity,price)&fq={!frange l=1000}customer_ratings

In a Sort expression. For example:

sort=div(popularity,price) desc, score desc

Add the results of functions as psuedo-fields to documents in query results. For instance, for:

&fl=sum(x, y),id,a,b,c,score

TO must be uppercase, or Solr will report a 'Range Group' error.

238Apache Solr Reference Guide 4.10

the output would be:

...

...

Use in a parameter that is explicitly for specifying functions, such as the EDisMax query parser's paraboost

m, or DisMax query parser's . (Note that the parameter actually takes a list (boost function) parameterbf bf

of function queries separated by white space and each with an optional boost. Make sure you eliminate any

internal white space in single function queries when using ). For example:bf

q=dismax&bf="ord(popularity)^0.5 recip(rord(price),1,1000,1000)^0.3"

Introduce a function query inline in the lucene QParser with the keyword. For example:_val_

q=_val_:mynumericfield _val_:"recip(rord(myfield),1,2,3)"

Only functions with fast random access are recommended.

Available Functions

The table below summarizes the functions available for function queries.

Function Description Syntax Examples

abs Returns the absolute value of

the specified value or function.

abs(x)

abs(-5)

and Returns a value of true if and

only if all of its operands

evaluate to true.

and(not(exists(popularity)),exists(price)): re

turns for any document which has a value in the true pric

field, but does not have a value in the fielde popularity

"constant" Specifies a floating point

constant.

1.5

def def is short for default.

Returns the value of field

"field", or if the field does not

exist, returns the default value

specified. and yields the first

value where exists()==tru

.)e

def(rating,5): This function returns the rating, ordef()

if no rating specified in the doc, returns 5

equivalent to def(myfield, 1.0): if(exists(myfiel

d),myfield,1.0)

div Divides one value or function

by another. div(x,y) divides x

by y.

div(1,y)

div(sum(x,100),max(y,1))

239Apache Solr Reference Guide 4.10

dist Return the distance between

two vectors (points) in an

n-dimensional space. Takes in

the power, plus two or more

ValueSource instances and

calculates the distances

between the two vectors. Each

ValueSource must be a

number. There must be an

even number of ValueSource

instances passed in and the

method assumes that the first

half represent the first vector

and the second half represent

the second vector.

dist(2, x, y, 0, 0): calculates the Euclidean

distance between (0,0) and (x,y) for each document

: calculates the Manhattandist(1, x, y, 0, 0)

(taxicab) distance between (0,0) and (x,y) for each document

Euclidean distance betweendist(2, x,y,z,0,0,0):

(0,0,0) and (x,y,z) for each document.

: Euclidean distance betweendist(1,x,y,z,e,f,g)

(x,y,z) and (e,f,g) where each letter is a field name

docfreq(field,val) Returns the number of

documents that contain the

term in the field. This is a

constant (the same value for

all documents in the index).

You can quote the term if it's

more complex, or do

parameter substitution for the

term value.

docfreq(text,'solr')

...&defType=func

&q=docfreq(text,$myterm)

&myterm=solr

exists Returns TRUE if any member

of the field exists.

exists(author) returns TRUE for any document has a

value in the "author" field.

returns TRUE if "price"exists(query(price:5.00))

matches "5.00".

field Returns the numeric field

value of an indexed (not

multi-valued) field with a

maximum of one value per

document. The functfield()

ion can be called using the

name of the field as a string,

or for most conventional field

names simply use the field

name by itself.

0 is returned for documents

without a value in the field.

myFloatFieldName

field("my complex float fieldName")

240Apache Solr Reference Guide 4.10

hsin The Haversine distance

calculates the distance

between two points on a

sphere when traveling along

the sphere. The values must

be in radians. also takehsin

a Boolean argument to specify

whether the function should

convert its output to radians.

hsin(2, true, x, y, 0, 0)

idf Inverse document frequency;

a measure of whether the term

is common or rare across all

documents. Obtained by

dividing the total number of

documents by the number of

documents containing the

term, and then taking the

logarithm of that quotient. See

also .tf

idf(fieldName,'solr'): measures the inverse of the

frequency of the occurrence of the term in'solr' fieldN

.ame

if Enables conditional function

queries. In if(test,value

:1,value2)

test is or refers to a

logical value or expression

that returns a logical value

(TRUE or FALSE).

value1 is the value that is

returned by the function if

yields TRUE.test

value2 is the value that is

returned by the function if

yields FALSE.test

An expression can be any

function which outputs

boolean values, or even

functions returning numeric

values, in which case value 0

will be interpreted as false, or

strings, in which case empty

string is interpreted as false.

if(termfreq(cat,'electronics'),popularity,42)

This function checks each document for the to see if it

contains the term " " in the field. If it does,electronics cat

then the value of the field is returned,popularity

otherwise the value of is returned.42

241Apache Solr Reference Guide 4.10

linear Implements where am*x+c m

nd are constants and is anc x

arbitrary function. This is

equivalent to sum(product(

, but slightly morem,x),c)

efficient as it is implemented

as a single function.

linear(x,m,c)

returns linear(x,2,4) 2*x+4

log Returns the log base 10 of the

specified function.

log(x)

log(sum(x,100))

map Maps any values of an input

function x that fall within min

and max inclusive to the

specified target. The

arguments min and max must

be constants. The arguments

and can betarget default

constants or functions. If the

value of x does not fall

between min and max, then

either the value of x is

returned, or a default value is

returned if specified as a 5th

argument.

map(x,min,max,target)

- changes any values of 0 to 1. This can bemap(x,0,0,1)

useful in handling default 0 values.

map(x,min,max,target,default)

- changes any values between andmap(x,0,100,1,-1) 0

to , and all other values to .100 1 -1

map(x,0,100,sum(x,599),docfreq(text,solr)) -

changes any values between 0 and 100 to x+599, and all

other values to frequency of the term 'solr' in the field text.

max Returns the max of another

function and a constant, which

are specified as arguments: m

. The max function isax(x,c)

useful for "bottoming out"

another function at some

constant.

max(myfield,0)

maxdoc Returns the number of

documents in the index,

including those that are

marked as deleted but have

not yet been purged. This is a

constant (the same value for

all documents in the index).

maxdoc()

242Apache Solr Reference Guide 4.10

ms Returns milliseconds of

difference between its

arguments. Dates are relative

to the Unix or POSIX time

epoch, midnight, January 1,

1970 UTC. Arguments may be

the name of an indexed Trie

, or date mathDateField

based on a constant date or N

. OW

: Equivalent to ,ms() ms(NOW)

number of milliseconds since

the epoch.

Returns the numberms(a):

of milliseconds since the

epoch that the argument

represents.

: Returns thems(a,b)

number of milliseconds that b

occurs before a (that is, a - b)

ms(NOW/DAY)

ms(2000-01-01T00:00:00Z)

ms(mydatefield)

ms(NOW,mydatefield)

ms(mydatefield,2000-01-01T00:00:00Z)

ms(datefield1,datefield2)

norm( )field Returns the "norm" stored in

the index for the specified

field. This is the product of the

index time boost and the

length normalization factor,

according to the forSimilarity

the field.

norm(fieldName)

not The logically negated value of

the wrapped function.

not(exists(author)): TRUE only when exists(autho

is false.r)

numdocs Returns the number of

documents in the index, not

including those that are

marked as deleted but have

not yet been purged. This is a

constant (the same value for

all documents in the index).

numdocs()

or A logical disjunction. or(value1,value2): TRUE if either or ivalue1 value2

s true.

243Apache Solr Reference Guide 4.10

ord Returns the ordinal of the

indexed field value within the

indexed list of terms for that

field in Lucene index order

(lexicographically ordered by

unicode value), starting at 1.

In other words, for a given

field, all values are ordered

lexicographically; this function

then returns the offset of a

particular value in that

ordering. The field must have

a maximum of one value per

document (not multi-valued). 0

is returned for documents

without a value in the field.

Related Topics

FunctionQuery

Local Parameters in Queries

Local parameters are arguments in a Solr request that are specific to a query parameter. Local parameters provide

a way to add meta-data to certain argument types such as query strings. (In Solr documentation, local parameters

are sometimes referred to as LocalParams.)

Local parameters are specified as prefixes to arguments. Take the following query argument, for example:

q=solr rocks

249Apache Solr Reference Guide 4.10

We can prefix this query string with local parameters to provide more information to the Standard Query Parser. For

example, we can change the default operator type to "AND" and the default field to "title":

q={!q.op=AND df=title}solr rocks

These local parameters would change the query to require a match on both "solr" and "rocks" while searching the

"title" field by default.

Basic Syntax of Local Parameters

To specify a local parameter, insert the following before the argument to be modified:

Begin with {!

Insert any number of key=value pairs separated by white space

End with } and immediately follow with the query argument

You may specify only one local parameters prefix per argument. Values in the key-value pairs may be quoted via

single or double quotes, and backslash escaping works within quoted strings.

Query Type Short Form

If a local parameter value appears without a name, it is given the implicit name of "type". This allows short-form

representation for the type of query parser to use when parsing a query string. Thus

q={!dismax qf=myfield}solr rocks

is equivalent to:

q={!type=dismax qf=myfield}solr rocks

Specifying the Parameter Value with the ' ' Keyv

A special key of within local parameters is an alternate way to specify the value of that parameter.v

q={!dismax qf=myfield}solr rocks

is equivalent to

q={!type=dismax qf=myfield v='solr rocks'}

Parameter Dereferencing

Parameter dereferencing or indirection lets you use the value of another argument rather than specifying it directly.

This can be used to simplify queries, decouple user input from query parameters, or decouple front-end GUI

parameters from defaults set in .solrconfig.xml

q={!dismax qf=myfield}solr rocks

is equivalent to:

q={!type=dismax qf=myfield v=$qq}&qq=solr rocks

Other Parsers

In addition to the main query parsers discussed earlier, there are several other query parsers that can be used

instead of or in conjunction with the main parsers for specific purposes. This section details the other parsers, and

gives examples for how they might be used.

250Apache Solr Reference Guide 4.10
Many of these parsers are expressed the same way as  .Local Parameters in Queries
Query parsers discussed in this section:
Block Join Query Parsers
Boost Query Parser
Collapsing Query Parser
Complex Phrase Query Parser
Field Query Parser
Function Query Parser
Function Range Query Parser
Join Query Parser
Lucene Query Parser
Max Score Query Parser
Nested Query Parser
Old Lucene Query Parser
Prefix Query Parser
Raw Query Parser
Re-Ranking Query Parser
Simple Query Parser
Spatial Filter Query Parser
Surround Query Parser
Switch Query Parser
Term Query Parser
Terms Query Parser
Block Join Query Parsers
There are two query parsers that support block joins. These parsers allow indexing and searching for relational
content that has been . indexed as nested documents
The example usage of the query parsers below assumes these two documents and each of their child documents
have been indexed:
<add>
 <doc> 
 <field name="id">1</field>
 <field name="title">Solr adds block join support</field>
 <field name="content_type">parentDocument</field>
 <doc>
 <field name="id">2</field> 
 <field name="comments">SolrCloud supports it too!</field>
 </doc>
 </doc>
 <doc> 
 <field name="id">3</field>
 <field name="title">Lucene and Solr 4.5 is out</field>
 <field name="content_type">parentDocument</field>
 <doc>
 <field name="id">4</field>
 <field name="comments">Lots of new features</field>
 </doc>
 </doc>
</add>
Block Join Children Query Parser
This parser takes a query that matches some parent documents and returns their children. The syntax for this parser
is:  . The parameter   is a filter that matches onlyq={!child of=<allParents>}<someParents> allParents
parent documents; here you would define the field and value that you used to identify a document as a parent. The
parameter   identifies a query that will match some or all of the parent documents. The output is thesomeParents
children.
Using the example documents above, we can construct a query such as q={!child
. We only get one document in response:of="content_type:parentDocument"}title:lucene

251Apache Solr Reference Guide 4.10

<doc>

<str name="comments">Lots of new features</str>

</doc>

</result>

Block Join Parent Query Parser

This parser takes a query that matches child documents and returns their parents. The syntax for this parser is

similar: . Again the parameter The parameter q={!parent which=<allParents>}<someChildren> allParen

is a filter that matches only parent documents; here you would define the field and value that you used to identifyts

a document as a parent. The parameter is a query that matches some or all of the child documents.someChildren

Note that the query for should match only child documents or you may get an exception.someChildren

Again using the example documents above, we can construct a query such as q={!parent

. We get this document in response:which="content_type:parentDocument"}comments:SolrCloud

<doc>

<arr name="title"><str>Solr adds block join support</str></arr>

<arr name="content_type"><str>parentDocument</str></arr>

</doc>

</result>

Boost Query Parser

BoostQParser extends the and creates a boosted query from the input value. The main value isQParserPlugin

the query to be boosted. Parameter is the function query to use as the boost. The query to be boosted may be ofb

any type.

Examples:

Creates a query "foo" which is boosted (scores are multiplied) by the function query :log(popularity)

{!boost b=log(popularity)}foo

Creates a query "foo" which is boosted by the date boosting function referenced in :ReciprocalFloatFunction

{!boost b=recip(ms(NOW,mydatefield),3.16e-11,1,1)}foo

Collapsing Query Parser

The is really a that provides more performant field collapsing than Solr's standardCollapsingQParser post filter

approach when the number of distinct groups in the result set is high. This parser collapses the result set to a single

document per group before it forwards the result set to the rest of the search components. So all downstream

components (faceting, highlighting, etc...) will work with the collapsed result set.

Details about using the can be found in the section.CollapsingQParser Collapse and Expand Results

Complex Phrase Query Parser

252Apache Solr Reference Guide 4.10

The provides support for wildcards, ORs, etc., inside phrase queries using Lucene's ComplexPhraseQParser Com

. Under the covers, this query parser makes use of the Span group of queries, e.g., plexPhraseQueryParser

spanNear, spanOr, etc., and is subject to the same limitations as that family or parsers.

Parameter Description

inOrder Set to true to force phrase queries to match terms in the order specified. Default: true

df The default search field.

Examples:

{!complexphrase inOrder=true}name:"Jo* Smith"

{!complexphrase inOrder=false}name:"(john jon jonathan~) peters*"

A mix of ordered and unordered complex phrase queries:

+_query_:"{!complexphrase inOrder=true}manu:\"a* c*\"" +_query_:"{!complexphrase

inOrder=false df=name}\"bla* pla*\""

Limitations

Performance is sensitive to the number of unique terms that are associated with a pattern. For instance, searching

for "a*" will form a large OR clause (technically a SpanOr with many terms) for all of the terms in your index for the

indicated field that start with the single letter 'a'. It may be prudent to restrict wildcards to at least two or preferably

three letters as a prefix. Allowing very short prefixes may result in to many low-quality documents being returned.

MaxBooleanClauses

You may need to increase MaxBooleanClauses in as a result of the term expansion above:solrconfig.xml

This property is described in more detail in the section .Query Sizing and Warming

Stopwords

It is recommended not to use stopword elimination with this query parser. Lets say we add , , to the up to stopword

for your collection, and index a document containing the text s.txt "Stores up to 15,000 songs, 25,00 photos, or

in a field named "features". 150 yours of video"

While the query below does not use this parser:

q=features:"Stores up to 15,000"

the document is returned. The next query that use the Complex Phrase Query Parser, as in this query:does

253Apache Solr Reference Guide 4.10

q=features:"sto* up to 15*"&defType=complexphrase

does return that document because SpanNearQuery has no good way to handle stopwords in a way analogousnot

to PhraseQuery. If you must remove stopwords for your use case, use a custom filter factory or perhaps a

customized synonyms filter that reduces given stopwords to some impossible token.

Field Query Parser

The extends the and creates a field query from the input value, applying textFieldQParser QParserPlugin

analysis and constructing a phrase query if appropriate. The parameter is the field to be queried.f

Example:

{!field f=myfield}Foo Bar

This example creates a phrase query with "foo" followed by "bar" (assuming the analyzer for is a text fieldmyfield

with an analyzer that splits on whitespace and lowercase terms). This is generally equivalent to the Lucene query

parser expression .myfield:"Foo Bar"

Function Query Parser

The extends the and creates a function query from the input value. This isFunctionQParser QParserPlugin

only one way to use function queries in Solr; for another, more integrated, approach, see the section on Function

.Queries

Example:

{!func}log(foo)

Function Range Query Parser

The extends the and creates a range query over a function. This isFunctionRangeQParser QParserPlugin

also referred to as , as seen in the examples below.frange

Other parameters:

Parameter Description

l The lower bound, optional

u The upper bound, optional

incl Include the lower bound: true/false, optional, default=true

incu Include the upper bound: true/false, optional, default=true

Examples:

{!frange l=1000 u=50000}myfield

fq={!frange l=0 u=2.2} sum(user_ranking,editor_ranking)

254Apache Solr Reference Guide 4.10

Both of these examples are restricting the results by a range of values found in a declared field or a function query.

In the second example, we're doing a sum calculation, and then defining only values between 0 and 2.2 should be

returned to the user.

For more information about range queries over functions, see Yonik Seeley's introductory blog post Ranges over

, hosted at SearchHub.org.Functions in Solr 1.4

Join Query Parser

JoinQParser extends the . It allows normalizing relationships between documents with a joinQParserPlugin

operation. This is different from in concept of a join in a relational database because no information is being truly

joined. An appropriate SQL analogy would be an "inner query".

Examples:

Find all products containing the word "ipod", join them against manufacturer docs and return the list of

manufacturers:

{!join from=manu_id_s to=id}ipod

Find all manufacturer docs named "belkin", join them against product docs, and filter the list to only products with a

price less than $12:

q = {!join from=id to=manu_id_s}compName_s:Belkin

fq = price:[* TO 12]

For more information about join queries, see the Solr Wiki page on . Erick Erickson has also written a blog postJoins

about join performance called , hosted by SearchHub.org.Solr and Joins

Lucene Query Parser

The extends the by parsing Solr's variant on the Lucene QueryParser syntax.LuceneQParser QParserPlugin

This is effectively the same query parser that is used in Lucene. It uses the operators , the default operatorq.op

("OR" or "AND") and , the default field name.df

Example:

{!lucene q.op=AND df=text}myfield:foo +bar -baz

For more information about the syntax for the Lucene Query Parser, see the .Classic QueryParser javadocs

Max Score Query Parser

The extends the but returns the Max score from the clauses. It does this byMaxScoreQParser LuceneQParser

wrapping all clauses in a with tie=1.0. Any or clauses areSHOULD DisjunctionMaxQuery MUST PROHIBITED

passed through as-is. Non-boolean queries, e.g. NumericRange falls-through to the parserLuceneQParser

behavior.

Example:

{!maxscore tie=0.01}C OR (D AND E)

Nested Query Parser

255Apache Solr Reference Guide 4.10

The extends the and creates a nested query, with the ability for that query toNestedParser QParserPlugin

redefine its type via local parameters. This is useful in specifying defaults in configuration and letting clients

indirectly reference them.

Example:

{!query defType=func v=$q1}

If the parameter is price, then the query would be a function query on the price field. If the parameter isq1 q1

{!lucene}inStock:true}} then a term query is created from the Lucene syntax string that matches documents with inS

. These parameters would be defined in , in the section:tock=true solrconfig.xml defaults

<lst name="defaults"

<str name="q1">{!lucene}inStock:true</str>

</lst>

For more information about the possibilities of nested queries, see Yonik Seeley's blog post ,Nested Queries in Solr

hosted by SearchHub.org.

Old Lucene Query Parser

OldLuceneQParser extends the by parsing Solr's variant of Lucene's QueryParser syntax,QParserPlugin

including the deprecated sort specification after the query.

Example:

{!lucenePlusSort} myfield:foo +bar -baz;price asc

Prefix Query Parser

PrefixQParser extends the by creating a prefix query from the input value. Currently noQParserPlugin

analysis or value transformation is done to create this prefix query. The parameter is , the field. The string after thef

prefix declaration is treated as a wildcard query.

Example:

{!prefix f=myfield}foo

This would be generally equivalent to the Lucene query parser expression .myfield:foo*

Raw Query Parser

RawQParser extends the by creating a term query from the input value without any text analysisQParserPlugin

or transformation. This is useful in debugging, or when raw terms are returned from the terms component (this is not

the default). The only parameter is , which defines the field to search.f

Example:

{!raw f=myfield}Foo Bar

This example constructs the query: .TermQuery(Term("myfield","Foo Bar"))

256Apache Solr Reference Guide 4.10

For easy filter construction to drill down in faceting, the is recommended. For full analysis on allTermQParserPlugin

fields, including text fields, you may want to use the .FieldQParserPlugin

Re-Ranking Query Parser

The is a special purpose parser for Re-Ranking the top result of a simple query using aReRankQParserPlugin

more complex ranking query.

Details about using the can be found in the section.ReRankQParserPlugin Other Parsers

Simple Query Parser

The Simple query parser in Solr is based on Lucene's SimpleQueryParser. This query parser is designed to allow

users to enter queries however they want, and it will do its best to interpret the query and return results.

This parser takes the following parameters:

Parameter Description

q.operator Enables specific operations for parsing. By default, all operations are enabled, and this can be used

to disable specific operations as needed. Passing an empty string with this parameter disables all

operations.

Operator Description Example

+ Specifies AND token1+token2

| Specifies OR token1|token2

- Specifies NOT -token3

" Creates a phrase "term1 term2"

* Specifies a prefix query term*

~N At the end of terms, specifies a fuzzy query term~1

~N At the end of phrases, specifies a NEAR query "term1

term2"~5

( ) Specifies precedence; tokens inside the parenthesis will be

analyzed first. Otherwise, normal order is left to right.

token1 +

(token2 |

token3)

If needed, operations can be escaped with the character./

q.op Defines an operator to use by default if none are defined by the user. By default, OR is defined; an

alternative option is AND.

qf A list of query fields and boosts to use when building the query.

df Defines the default field if none is defined in , or overrides the default field if it isschema.xml

already defined.

Any errors in syntax are ignored and the query parser will interpret as best it can. This can mean, however, odd

results in some cases.

257Apache Solr Reference Guide 4.10

Spatial Filter Query Parser

SpatialFilterQParser extends the by creating a spatial Filter based on the type of spatialQParserPlugin

point used. The field must implement . All units are in Kilometers.SpatialQueryable

This query parser takes the following parameters:

Parameter Description

sfield The field on which to filter. Required.

pt The point to use as a reference. Must match the dimension of the field. Required.

d The distance in km. Required.

The distance measure used currently depends on the FieldType. defaults to using haversine, LatLonType PointT

defaults to Euclidean (2-norm).ype

This example shows the syntax:

{!geofilt sfield=<location_field> pt=<lat,lon> d=<distance>}

Here are some examples with values configured:

fq={!geofilt sfield=store pt=10.312,-20.556 d=3.5}

fq={!geofilt sfield=store}&pt=10.312,-20&d=3.5

fq={!geofilt}&sfield=store&pt=10.312,-20&d=3.5

If using with , it is capable of producing scores equal to the computed distance from the pointgeofilt LatLonType

to the field, making it useful as a component of the main query or a boosting query.

There is more information about spatial searches available in the section .Spatial Search

Surround Query Parser

SurroundQParser extends the . This provides support for the Surround query syntax, whichQParserPlugin

provides proximity search functionality. There are two operators: creates an ordered span query and creates anw n

unordered one. Both operators take a numeric value to indicate distance between two terms. The default is 1, and

the maximum is 99. Note that the query string is not analyzed in any way.

Example:

{!surround} 3w(foo, bar)

This example would find documents where the terms "foo" and "bar" were no more than 3 terms away from each

other (i.e., no more than 2 terms between them).

This query parser will also accept boolean operators (AND, OR, and NOT, in either upper- or lowercase), wildcards,

quoting for phrase searches, and boosting. The and operators can also be expressed in upper- or lowercase.w n

More information about Surround queries can be found at .http://wiki.apache.org/solr/SurroundQueryParser

258Apache Solr Reference Guide 4.10

Switch Query Parser

SwitchQParser is a that acts like a "switch" or "case" statement.QParserPlugin

The primary input string is trimmed and then prefixed with for use as a key to lookup a "switch case" in thecase.

parser's local params. If a matching local param is found the resulting param value will then be parsed as a

subquery, and returned as the parse result.

The local param can be optionally be specified as a switch case to match missing (or blank) input strings. The case

local param can optionally be specified as a default case to use if the input string does not match anydefault

other switch case local params. If default is not specified, then any input which does not match a switch case local

param will result in a syntax error.

In the examples below, the result of each query is "XXX":

{!switch case.foo=XXX case.bar=zzz case.yak=qqq}foo

{!switch case.foo=qqq case.bar=XXX case.yak=zzz} bar // extra whitespace is trimmed

{!switch case.foo=qqq case.bar=zzz default=XXX}asdf // fallback to the default

{!switch case=XXX case.bar=zzz case.yak=qqq} // blank input uses 'case'

A practical usage of this , is in specifying fq params in the configuration of aQParsePlugin appends

SearchHandler, to provide a fixed set of filter options for clients using custom parameter names. Using the example

configuration below, clients can optionally specify the custom parameters and to override thein_stock shipping

default filtering behavior, but are limited to the specific set of legal values (shipping=any|free, in_stock=yes|no|all).

</lst>

<str name="fq">{!switch case.all='*:*'

case.yes='inStock:true'

case.no='inStock:false'

v=$in_stock}</str>

<str name="fq">{!switch case.any='*:*'

case.free='shipping_cost:0.0'

v=$shipping}</str>

</lst>

</requestHandler>

Term Query Parser

TermQParser extends the by creating a single term query from the input value equivalent to QParserPlugin rea

. This is useful for generating filter queries from the external human readable terms returneddableToIndexed()

by the faceting or terms components. The only parameter is , for the field.f

Example:

259Apache Solr Reference Guide 4.10

{!term f=weight}1.5

For text fields, no analysis is done since raw terms are already returned from the faceting and terms components.

To apply analysis to text fields as well, see the , above.Field Query Parser

If no analysis or transformation is desired for any type of field, see the , above.Raw Query Parser

Terms Query Parser

TermsQParser functions similarly to the but takes in multiple values separated by commas andTerm Query Parser

returns documents matching any of the specified values. This can be useful for generating filter queries from the

external human readable terms returned by the faceting or terms components, and may be more efficient in some

cases then using the to generate an boolean query since the default implementation "Standard Query Parser metho

" avoids scoring.d

This query parser takes the following parameters:

Parameter Description

f The field on which to search. Required.

separator Separator to use when parsing the input. If set to " " (a single blank space), will trim additional white

space from the input terms. Defaults to " ".,

method The internal implementation to requested for building the query: , , termsFilter booleanQuery a

, or . Defaults to " ".utomaton docValuesTermsFilter termsFilter

Examples:

{!terms f=tags}software,apache,solr,lucene

{!terms f=categoryId method=booleanQuery separator=" "}8 6 7 5309

Faceting

As described in the section , faceting is theOverview of Searching in Solr

arrangement of search results into categories based on indexed terms.

Searchers are presented with the indexed terms, along with numerical counts of

how many matching documents were found were each term. Faceting makes it

easy for users to explore search results, narrowing in on exactly the results they

are looking for.

260Apache Solr Reference Guide 4.10
Topics covered in
this section:
General
Parameters
Field-Value
Faceting
Parameters
Range
Faceting
Date Faceting
Parameters
Pivot
(Decision
Tree)
Faceting
Interval
Faceting
Local
Parameters
for Faceting
Related
Topics
General Parameters
The table below summarizes the general parameters for controlling faceting.
Parameter Description
facet If set to true, enables faceting.
facet.query Specifies a Lucene query to generate a facet count.
These parameters are described in the sections below.
The   Parameterfacet
If set to "true," this parameter enables facet counts in the query response. If set to "false" to a blank or missing
value, this parameter disables faceting. None of the other parameters listed below will have any effect unless this
parameter is set to "true." The default value is blank.
The   Parameterfacet.query
This parameter allows you to specify an arbitrary query in the Lucene default syntax to generate a facet count. By
default, Solr's faceting feature automatically determines the unique terms for a field and returns a count for each of
those terms. Using  , you can override this default behavior and select exactly which terms orfacet.query
expressions you would like to see counted. In a typical implementation of faceting, you will specify a number of fac
 parameters. This parameter can be particularly useful for numeric-range-based facets or prefix-basedet.query
facets.

261Apache Solr Reference Guide 4.10

You can set the parameter multiple times to indicate that multiple queries should be used asfacet.query

separate facet constraints.

To use facet queries in a syntax other than the default syntax, prefix the facet query with the name of the query

notation. For example, to use the hypothetical query parser, you could set the parametermyfunc facet.query

like so:

facet.query={!myfunc}name~fred

Field-Value Faceting Parameters

Several parameters can be used to trigger faceting based on the indexed terms in a field.

When using this parameter, it is important to remember that "term" is a very specific concept in Lucene: it relates to

the literal field/value pairs that are indexed after any analysis occurs. For text fields that include stemming,

lowercasing, or word splitting, the resulting terms may not be what you expect. If you want Solr to perform both

analysis (for searching) and faceting on the full literal strings, use the directive in the filecopyField schema.xml

to create two versions of the field: one Text and one String. Make sure both are . (For moreindexed="true"

information about the directive, see .)copyField Documents, Fields, and Schema Design

The table below summarizes Solr's field value faceting parameters.

Parameter Description

facet.field Identifies a field to be treated as a facet.

facet.prefix Limits the terms used for faceting to those that begin with the specified prefix.

facet.sort Controls how faceted results are sorted.

facet.limit Controls how many constraints should be returned for each facet.

facet.offset Specifies an offset into the facet results at which to begin displaying facets.

facet.mincount Specifies the minimum counts required for a facet field to be included in the response.

facet.missing Controls whether Solr should compute a count of all matching results which have no

value for the field, in addition to the term-based constraints of a facet field.

facet.method Selects the algorithm or method Solr should use when faceting a field.

facet.enum.cache.minDF (Advanced) Specifies the minimum document frequency (the number of documents

matching a term) for which the should be used when determining thefilterCache

constraint count for that term.

facet.overrequest.count (Advanced) A number of documents, beyond the effective to requestfacet.limit

from each shard in a distributed search

facet.overrequest.ratio (Advanced) A multiplier of the effective to request from each shard in afacet.limit

distributed search

facet.threads (Advanced) Controls parallel execution of field faceting

These parameters are described in the sections below.

The Parameterfacet.field

262Apache Solr Reference Guide 4.10

The parameter identifies a field that should be treated as a facet. It iterates over each Term in thefacet.field

field and generate a facet count using that Term as the constraint. This parameter can be specified multiple times in

a query to select multiple facet fields.

The Parameterfacet.prefix

The parameter limits the terms on which to facet to those starting with the given string prefix. Thisfacet.prefix

does not limit the query in any way, only the facets that would be returned in response to the query.

This parameter can be specified on a per-field basis with the syntax of .f.<fieldname>.facet.prefix

The Parameterfacet.sort

This parameter determines the ordering of the facet field constraints.

facet.sort

Setting

Results

count Sort the constraints by count (highest count first).

index Return the constraints sorted in their index order (lexicographic by indexed term). For terms in

the ASCII range, this will be alphabetically sorted.

The default is if is greater than 0, otherwise, the default is .count facet.limit index

This parameter can be specified on a per-field basis with the syntax of .f.<fieldname>.facet.sort

The Parameterfacet.limit

This parameter specifies the maximum number of constraint counts (essentially, the number of facets for a field that

are returned) that should be returned for the facet fields. A negative value means that Solr will return unlimited

number of constraint counts.

The default value is 100.

This parameter can be specified on a per-field basis to apply a distinct limit to each field with the syntax of f.<fiel

.dname>.facet.limit

The Parameterfacet.offset

The parameter indicates an offset into the list of constraints to allow paging.facet.offset

The default value is 0.

This parameter can be specified on a per-field basis with the syntax of .f.<fieldname>.facet.offset

The Parameterfacet.mincount

The parameter specifies the minimum counts required for a facet field to be included in thefacet.mincount

If you do not set this parameter to at least one field in the schema, none of the other parameters described

in this section will have any effect.

The true/false values for this parameter were deprecated in Solr 1.4.

263Apache Solr Reference Guide 4.10

response. If a field's counts are below the minimum, the field's facet is not returned.

The default value is 0.

This parameter can be specified on a per-field basis with the syntax of .f.<fieldname>.facet.mincount

The Parameterfacet.missing

If set to true, this parameter indicates that, in addition to the Term-based constraints of a facet field, a count of all

results that match the query but which have no facet value for the field should be computed and returned in the

response.

The default value is false.

This parameter can be specified on a per-field basis with the syntax of .f.<fieldname>.facet.missing

The Parameterfacet.method

The facet.method parameter selects the type of algorithm or method Solr should use when faceting a field.

Setting Results

enum Enumerates all terms in a field, calculating the set intersection of documents that match the term with

documents that match the query. This method is recommended for faceting multi-valued fields that

have only a few distinct values. The average number of values per document does not matter. For

example, faceting on a field with U.S. States such as would leadAlabama, Alaska, ... Wyoming

to fifty cached filters which would be used over and over again. The should be largefilterCache

enough to hold all the cached filters.

fc Calculates facet counts by iterating over documents that match the query and summing the terms that

appear in each document. This is currently implemented using an cache if the fieldUnInvertedField

either is multi-valued or is tokenized (according to ). Each document isFieldType.isTokened()

looked up in the cache to see what terms/values it contains, and a tally is incremented for each value.

This method is excellent for situations where the number of indexed values for the field is high, but the

number of values per document is low. For multi-valued fields, a hybrid approach is used that uses term

filters from the for terms that match many documents. The letters stand for fieldfilterCache fc

cache.

fcs Per-segment field faceting for single-valued string fields. Enable with and controlfacet.method=fcs

the number of threads used with the local parameter. This parameter allows faceting to bethreads

faster in the presence of rapid index changes.

The default value is (except for fields using the field type) since it tends to use less memory and isfc BoolField

faster when a field has many unique terms in the index.

This parameter can be specified on a per-field basis with the syntax of .f.<fieldname>.facet.method

The Parameterfacet.enum.cache.minDf

This parameter indicates the minimum document frequency (the number of documents matching a term) for which

the filterCache should be used when determining the constraint count for that term. This is only used with the facet

method of faceting..method=enum

A value greater than zero decreases the filterCache's memory usage, but increases the time required for the query

264Apache Solr Reference Guide 4.10

to be processed. If you are faceting on a field with a very large number of terms, and you wish to decrease memory

usage, try setting this parameter to a value between 25 and 50, and run a few tests. Then, optimize the parameter

setting as necessary.

The default value is 0, causing the filterCache to be used for all terms in the field.

This parameter can be specified on a per-field basis with the syntax of f.<fieldname>.facet.enum.cache.mi

.nDF

Over-Request Parameters

In some situations, the accuracy in selecting the "top" constraints returned for a facet in a distributed Solr query can

be improved by "Over Requesting" the number of desired constraints (ie: ) from each of the individualfacet.limit

Shards. In these situations, each shard is by default asked for the top " "10 + (1.5 * facet.limit)

constraints.

In some situations, depending on how your docs are partitioned across your shards, and what valuefacet.limit

you used, you may find it advantageous to increase or decrease the amount of over-requesting Solr does. This can

be achieved by setting the (defaults to 10) and (defafacet.overrequest.count facet.overrequest.ratio

ults to 1.5) parameters.

The Parameterfacet.threads

This param will cause loading the underlying fields used in faceting to be executed in parallel with the number of

threads specified. Specify as where is the maximum number of threads used. Omitting thisfacet.threads=N N

parameter or specifying the thread count as 0 will not spawn any threads, and only the main request thread will be

used. Specifying a negative number of threads will create up to Integer.MAX_VALUE threads.

Range Faceting

You can use Range Faceting on any date field or any numeric field that supports range queries. This is particularly

useful for stitching together a series of range queries (as facet by query) for things like prices. As of Solr 3.1, Range

Faceting is preferred over (described below).Date Faceting

Parameter Description

facet.range Specifies the field to facet by range.

facet.range.start Specifies the start of the facet range.

facet.range.end Specifies the end of the facet range.

facet.range.gap Specifies the span of the range as a value to be added to the lower bound.

facet.range.hardend A boolean parameter that specifies how Solr handles a range gap that cannot be evenly

divided between the range start and end values. If true, the last range constraint will have

the value an upper bound. If false, the last range will have thefacet.range.end

smallest possible upper bound greater then such that the range is thefacet.range.end

exact width of the specified range gap. The default value for this parameter is false.

facet.range.include Specifies inclusion and exclusion preferences for the upper and lower bounds of the range.

See the topic for more detailed information.facet.range.include

265Apache Solr Reference Guide 4.10

facet.range.other Specifies counts for Solr to compute in addition to the counts for each facet range

constraint.

The Parameterfacet.range

The parameter defines the field for which Solr should create range facets. For example:facet.range

facet.range=price&facet.range=age

The Parameterfacet.range.start

The parameter specifies the lower bound of the ranges. You can specify this parameter on afacet.range.start

per field basis with the syntax of . For example:f.<fieldname>.facet.range.start

f.price.facet.range.start=0.0&f.age.facet.range.start=10

The Parameterfacet.range.end

The facet.range.end specifies the upper bound of the ranges. You can specify this parameter on a per field basis

with the syntax of . For example:f.<fieldname>.facet.range.end

f.price.facet.range.end=1000.0&f.age.facet.range.start=99

The Parameterfacet.range.gap

The span of each range expressed as a value to be added to the lower bound. For date fields, this should be

expressed using the (such as, ). You can syntaxDateMathParser facet.range.gap=%2B1DAY ... '+1DAY'

specify this parameter on a per-field basis with the syntax of . For example:f.<fieldname>.facet.range.gap

f.price.facet.range.gap=100&f.age.facet.range.gap=10

Gaps can also be variable width by passing in a comma separated list of the gap size to be used. The last gap

specified will be used to fill out all remaining gaps if the number of gaps given does not go evenly into the range.

Variable width gaps are useful, for example, in spatial applications where one might want to facet by distance into

three buckets: walking (0-5KM), driving (5-100KM), or other (100KM+). For example:

facet.date.gap=1,2,3,10

This creates 4+ buckets of size, 1, 2, 3 and then 0 or more buckets of 10 days each, depending on the start and end

values.

The Parameterfacet.range.hardend

The parameter is a Boolean parameter that specifies how Solr should handle casesfacet.range.hardend

where the does not divide evenly between and . If facet.range.gap facet.range.start facet.range.end t

, the last range constraint will have the value as an upper bound. If , the last range willrue facet.range.end false

have the smallest possible upper bound greater then such that the range is the exact width offacet.range.end

the specified range gap. The default value for this parameter is false.

This parameter can be specified on a per field basis with the syntax .f.<fieldname>.facet.range.hardend

The Parameterfacet.range.include

By default, the ranges used to compute range faceting between and arefacet.range.start facet.range.end

inclusive of their lower bounds and exclusive of the upper bounds. The "before" range defined with the facet.rang

266Apache Solr Reference Guide 4.10

parameter is exclusive and the "after" range is inclusive. This default, equivalent to "lower" below, will note.other

result in double counting at the boundaries. You can use the parameter to modify thisfacet.range.include

behavior using the following options:

Option Description

lower All gap-based ranges include their lower bound.

upper All gap-based ranges include their upper bound.

edge The first and last gap ranges include their edge bounds (lower for the first one, upper for the last one)

even if the corresponding upper/lower option is not specified.

outer The "before" and "after" ranges will be inclusive of their bounds, even if the first or last ranges already

include those boundaries.

all Includes all options: lower, upper, edge, outer.

You can specify this parameter on a per field basis with the syntax of ,f.<fieldname>.facet.range.include

and you can specify it multiple times to indicate multiple choices.

The Parameterfacet.range.other

The parameter specifies that in addition to the counts for each range constraint between facet.range.other fac

and , counts should also be computed for these options:et.range.start facet.range.end

Option Description

before All records with field values lower then lower bound of the first range.

after All records with field values greater then the upper bound of the last range.

between All records with field values between the start and end bounds of all ranges.

none Do not compute any counts.

all Compute counts for before, between, and after.

This parameter can be specified on a per field basis with the syntax of . Inf.<fieldname>.facet.range.other

addition to the option, this parameter can be specified multiple times to indicate multiple choices, but willall none

override all other options.

Date Faceting Parameters

To ensure you avoid double-counting, do not choose both and , do not choose , and dolower upper outer

not choose .all

Date Ranges & Time Zones

Range faceting on date fields is a common situation where the parameter can be useful to ensure thatTZ

the "facet counts per day" or "facet counts per month" are based on a meaningful definition of when a given

day/month "starts" relative to a particular TimeZone.

For more information, see the examples in the section.Working with Dates

267Apache Solr Reference Guide 4.10

Date faceting using the type specific parameters has been deprecated since Solr 3.1. Existing usersfacet.date

are encouraged to switch to using the more general , which provides the same features for dateRange Faceting

fields, but can als work with any numeric field.

The response format is slightly different, but the request parameters are virtually identical.

Pivot (Decision Tree) Faceting

Pivoting is a summarization tool that lets you automatically sort, count, total or average data stored in a table. It

displays the results in a second table showing the summarized data. Pivot faceting lets you create a summary table

of the results from a query across numerous documents. With Solr 4, pivot faceting supports nested facet queries,

not just facet fields.

Another way to look at it is that the query produces a Decision Tree, in that Solr tells you "for facet A, the

constraints/counts are X/N, Y/M, etc. If you were to constrain A by X, then the constraint counts for B would be S/P,

T/Q, etc.". In other words, it tells you in advance what the "next" set of facet results would be for a field if you apply a

constraint from the current facet results.

facet.pivot

The parameter defines the fields to use for the pivot. Multiple values will createfacet.pivot facet.pivot

multiple "facet_pivot" sections in the response. Separate each list of fields with a comma.

facet.pivot.mincount

The parameter defines the minimum number of documents that need to match in orderfacet.pivot.mincount

for the facet to be included in results. The default is 1.

For example, we can use Solr's example data set to make a query like this:

http://localhost:8983/solr/select?q=*:*&facet.pivot=cat,popularity,inStock

&facet.pivot=popularity,cat&facet=true&facet.field=cat&facet.limit=5

&rows=0&wt=json&indent=true&facet.pivot.mincount=2

This query will returns the data below, with the pivot faceting results found in the section "facet_pivot":

268Apache Solr Reference Guide 4.10

"facet_counts":{

"facet_queries":{},

"facet_fields":{

"cat":[

"electronics",14,

"currency",4,

"memory",3,

"connector",2,

"graphics card",2]},

"facet_dates":{},

"facet_ranges":{},

"facet_pivot":{

"cat,popularity,inStock":[{

"field":"cat",

"value":"electronics",

"count":14,

"pivot":[{

"field":"popularity",

"value":6,

"count":5,

"pivot":[{

"field":"inStock",

"value":true,

"count":5}]},

...

Additional Pivot Parameters

Although deviates in name from the parameter used by field faceting,facet.pivot.mincount facet.mincount

many other Field faceting parameters described above can also be used with pivot faceting:

facet.limit

facet.offset

facet.sort

facet.overrequest.count

facet.overrequest.ratio

Interval Faceting

Another supported form of faceting is “Interval Faceting”. This sounds similar to “Range Faceting”, but the

functionality is really closer to doing “Facet Queries” with range queries. Interval Faceting allows you to set variable

intervals and count the number of documents that have values within those intervals in the specified field. In order

. Even though theto use Interval Faceting on a field, it is required that the field has “docValues” enabled

same functionality can be achieved by using facet query with range queries, the implementation of these two

methods is very different and will provide different performance depending on the context. If you are concerned

about the performance of your searches you should test with both options. Interval Faceting tends to be better with

multiple intervals for the same fields, while facet query tend to be better in environments where cache is more

effective (static indexes for example).

Name What it does

facet.interval Specifies the field to facet by interval

facet.interval.set Sets the intervals for the field

269Apache Solr Reference Guide 4.10

The parameterfacet.interval

This parameter Indicates the field where interval faceting must be applied. It can be used multiple times in the same

request to indicate multiple fields. All the interval fields must have in the schema.docValues=“true”

facet.interval=price&facet.interval=size

The parameterfacet.interval.set

This parameter is used to set the intervals for the field, it can be specified multiple times to indicate multiple

intervals. This parameter is global, which means that it will be used for all fields indicated with facet.interval unless

there is an override for a specific field. To override this parameter on a specific field you can use:

f.<fieldname>.facet.interval.set, for example:

f.price.facet.interval.set=[0,10]&f.price.facet.interval.set=(10,100]

Interval Syntax

Intervals must begin with either '(' or '[', be followed by the start value, then a comma ',', the end value, and finally ')'

or ']’.

For example:

(1,10) -> will include values greater than 1 and lower than 10

[1,10) -> will include values greater or equal to 1 and lower than 10

[1,10] -> will include values greater or equal to 1 and lower or equal to 10

The initial and end values can't be empty, if the interval needs to be unbounded, the special character '*' can be

used for both, start and end limit. When using '*', '(' and '[', and ')' and ']' will be treated equal. [*,*] will include all

documents with a value in the field. The interval limits may be strings, there is no need to add quotes, all the text

until the comma will be treated as the start limit, and the text after that will be the end limit, for example: [Buenos

Aires,New York]. Keep in mind that a string-like comparison will be done to match documents in string intervals

(case-sensitive). The comparator can't be changed. Commas, brackets and square brackets can be escaped by

using '\' in front of them. Whitespaces before and after the values will be omitted. Start limit can't be grater than the

end limit. Equal limits are allowed, this allows you to indicate the specific values that you want to count, like [A,A],

[B,B] and [C,Z].

Local Parameters for Faceting

The allows overriding global settings. It can also provide a method of adding metadata to otherLocalParams syntax

parameter values, much like XML attributes.

Tagging and Excluding Filters

You can tag specific filters and exclude those filters when faceting. This is useful when doing multi-select faceting.

Consider the following example query with faceting:

q=mainquery&fq=status:public&fq=doctype:pdf&facet=true&facet.field=doctype

Because everything is already constrained by the filter , the facetdoctype:pdf facet.field=doctype

command is currently redundant and will return 0 counts for everything except .doctype:pdf

To implement a multi-select facet for doctype, a GUI may want to still display the other doctype values and their

associated counts, as if the constraint had not yet been applied. For example:doctype:pdf

270Apache Solr Reference Guide 4.10

=== Document Type ===

[ ] Word (42)

[x] PDF (96)

[ ] Excel(11)

[ ] HTML (63)

To return counts for doctype values that are currently not selected, tag filters that directly constrain doctype, and

exclude those filters when faceting on doctype.

q=mainquery&fq=status:public&fq={!tag=dt}doctype:pdf&facet=true&facet.field={!ex=dt}d

octype

Filter exclusion is supported for all types of facets. Both the and local parameters may specify multipletag ex

values by separating them with commas.

Changing the Output Key

To change the output key for a faceting command, specify a new name with the local parameter. For example:key

facet.field={!ex=dt key=mylabel}doctype

The parameter setting above causes the field facet results for the "doctype" field to be returned using the key

"mylabel" rather than "doctype" in the response. This can be helpful when faceting on the same field multiple times

with different exclusions.

Related Topics

SimpleFacetParameters from the Solr Wiki.

Highlighting

Highlighting in Solr allows fragments of documents that match the user's query to be included with the query

response. The fragments are included in a special section of the response (the section), and thehighlighting

client uses the formatting clues also included to determine how to present the snippets to users.

Solr provides a collection of highlighting utilities which allow a great deal of control over the fields fragments are

taken from, the size of fragments, and how they are formatted. The highlighting utilities can be called by various

Request Handlers and can be used with the , , or query parsers.DisMax Extended DisMax standard

There are three highlighting implementations available:

Standard Highlighter: The is the swiss-army knife of the highlighters. It has the mostStandard Highlighter

sophisticated and fine-grained query representation of the three highlighters. For example, this highlighter is

capable of providing precise matches even for advanced queryparsers such as the parser. It doessurround

not require any special datastructures such as , although it will use them if they are present. IftermVectors

they are not, this highlighter will re-analyze the document on-the-fly to highlight it. This highlighter is a good

choice for a wide variety of search use-cases.

FastVector Highlighter: The requires term vector options ( , FastVector Highlighter termVectors termPosi

, and ) on the field, and is optimized with that in mind. It tends to work better for moretions termOffsets

languages than the Standard Highlighter, because it supports Unicode breakiterators. On the other hand, its

query-representation is less advanced than the Standard Highlighter: for example it will not work well with the

parser. This highlighter is a good choice for large documents and highlighting text in a variety ofsurround

languages.

271Apache Solr Reference Guide 4.10

Postings Highlighter: The requires to be configuredPostings Highlighter storeOffsetsWithPositions

on the field. This is a much more compact and efficient structure than term vectors, but is not appropriate for

huge numbers of query terms (e.g. wildcard queries). Like the FastVector Highlighter, it supports Unicode

algorithms for dividing up the document. On the other hand, it has the most coarse query-representation: it

focuses on summary quality and ignores the structure of the query completely, ranking passages based

solely on query terms and statistics. This highlighter a good choice for classic full-text keyword search.

Configuring Highlighting

The configuration for highlighting, whichever implementation is chosen, is first to configure a search component and

then reference the component in one or more request handlers.

The exact parameters for the search component vary depending on the implementation, but there is a robust

example in the default that ships with Solr out of the box. This example includes examples ofsolrconfig.xml

how to configure both the Standard Highlighter and the FastVector Highlighter (see the sectionPostings Highlighter

for details on how to configure that implementation).

Standard Highlighter

The standard highlighter doesn't require any special indexing parameters on the fields to highlight, however you can

optionally turn on , , and for each field to be highlighted. This willtermVectors termPositions termOffsets

avoid having to run documents through the analysis chain at query-time and should make highlighting faster.

Standard Highlighting Parameters

The table below describes Solr's parameters for the Standard highlighter. These parameters can be defined in the

highlight search component, as defaults for the specific request handler, or passed to the request handler with the

query.

Parameter Default Value Description

hl blank (no highlight) When set to , enables highlighted snippets to betrue

generated in the query response. If set to or to afalse

blank or missing value, disables highlighting.

hl.q blank Specifies an overriding query term for highlighting. If hl

is specified, the highlighter will use that term rather.q

than the main query term.

hl.qparser blank Specifies a qparser to use for the hl.q query. If blank,

will use the defType of the overall query.

272Apache Solr Reference Guide 4.10

hl.fl blank Specifies a list of fields to highlight. Accepts a comma-

or space-delimited list of fields for which Solr should

generate highlighted snippets. If left blank, highlights

the defaultSearchField (or the field specified the pardf

ameter if used) for the StandardRequestHandler. For

the DisMaxRequestHandler, the fields are used asqf

defaults.

A '*' can be used to match field globs, such as 'text_*'

or even '*' to highlight on all fields where highlighting is

possible. When using '*', consider adding hl.require

.FieldMatch=true

hl.snippets 1 Specifies maximum number of highlighted snippets to

generate per field. It is possible for any number of

snippets from zero to this value to be generated. This

parameter accepts per-field overrides.

hl.fragsize 100 Specifies the size, in characters, of fragments to

consider for highlighting. indicates that no0

fragmenting should be considered and the whole field

value should be used. This parameter accepts per-field

overrides.

hl.mergeContiguous false Instructs Solr to collapse contiguous fragments into a

single fragment. A value of indicates contiguoustrue

fragments will be collapsed into single fragment. This

parameter accepts per-field overrides. The default

value, , is also the backward-compatible setting.false

hl.requireFieldMatch false If set to , highlights terms only if they appear in thetrue

specified field. If , terms are highlighted in allfalse

requested fields regardless of which field matched the

query.

hl.maxAnalyzedChars 51200 Specifies the number of characters into a document

that Solr should look for suitable snippets.

hl.maxMultiValuedToExamine integer.MAX_VALUE Specifies the maximum number of entries in a

multi-valued field to examine before stopping. This can

potentially return zero results if the limit is reached

before any matches are found. If used with the maxMul

, whichever limit is reached first willtiValuedToMatch

determine when to stop looking.

hl.maxMultiValuedToMatch integer.MAX_VALUE Specifies the maximum number of matches in a

multi-valued field that are found before stopping. If hl.

is also defined,maxMultiValuedToExamine

whichever limit is reached first will determine when to

stop looking.

273Apache Solr Reference Guide 4.10

hl.alternateField blank Specifies a field to be used as a backup default

summary if Solr cannot generate a snippet (i.e.,

because no terms match). This parameter accepts

per-field overrides.

hl.maxAlternateFieldLength unlimited Specifies the maximum number of characters of the

field to return. Any value less than or equal to 0 means

the field's length is unlimited. This parameter is only

used in conjunction with the parhl.alternateField

ameter.

hl.formatter simple Selects a formatter for the highlighted output. Currently

the only legal value is , which surrounds asimple

highlighted term with a customizable pre- and post-text

snippet. This parameter accepts per-field overrides.

hl.simple.pre

hl.simple.post

<em> and </em> Specifies the text that should appear before (hl.simp

) and after ( ) a highlightedle.pre hl.simple.post

term, when using the simple formatter. This parameter

accepts per-field overrides.

hl.fragmenter gap Specifies a text snippet generator for highlighted text.

The standard fragmenter is , which createsgap

fixed-sized fragments with gaps for multi-valued fields.

Another option is , which tries to create fragmentsregex

that resemble a specified regular expression. This

parameter accepts per-field overrides.

hl.usePhraseHighlighter true If set to , Solr will use the Lucene SpanScorer classtrue

to highlight phrase terms only when they appear within

the query phrase in the document.

hl.highlightMultiTerm true If set to , Solr will use highlight phrase terms thattrue

appear in multi-term queries.

hl.regex.slop 0.6 When using the regex fragmenter (hl.fragmenter=r

), this parameter defines the factor by which theegex

fragmenter can stray from the ideal fragment size

(given by ) to accommodate a regularhl.fragsize

expression. For instance, a slop of 0.2 with hl.fragsi

should yield fragments between 80 and 120ze=100

characters in length.

It is usually good to provide a slightly smaller hl.frag

value when using the regex fragmenter.size

hl.regex.pattern blank Specifies the regular expression for fragmenting. This

could be used to extract sentences.

274Apache Solr Reference Guide 4.10

hl.regex.maxAnalyzedChars 10000 Instructs Solr to analyze only this many characters from

a field when using the regex fragmenter (after which,

the fragmenter produces fixed-sized fragments).

Applying a complicated regex to a huge field is

computationally expensive.

hl.preserveMulti false If , multi-valued fields will return all values in thetrue

order they were saved in the index. If , only valuesfalse

that match the highlight request will be returned.

Related Topics

302Apache Solr Reference Guide 4.10

How MoreLikeThis Works

MoreLikeThis constructs a Lucene query based on terms in a document. It does this by pulling terms from the

defined list of fields ( see the parameter, below). For best results, the fields should have stored term vectorsmlt.fl

in . For example:schema.xml

If term vectors are not stored, will generate terms from stored fields. A must also beMoreLikeThis uniqueKey

stored in order for MoreLikeThis to work properly.

The next phase filters terms from the original document using thresholds defined with the MoreLikeThis parameters.

Finally, a query is run with these terms, and any other query parameters that have been defined (see the pamlt.qf

rameter, below) and a new document set is returned.

Common Parameters for MoreLikeThis

The table below summarizes the parameters supported by Lucene/Solr. These parameters can beMoreLikeThis

used with any of the three possible MoreLikeThis approaches.

Parameter Description

mlt.fl Specifies the fields to use for similarity. If possible, these should have stored .termVectors

mlt.mintf Specifies the Minimum Term Frequency, the frequency below which terms will be ignored in the

source document.

mlt.mindf Specifies the Minimum Document Frequency, the frequency at which words will be ignored which do

not occur in at least this many documents.

mlt.maxdf Specifies the Maximum Document Frequency, the frequency at which words will be ignored which

occur in more than this many documents. New in Solr 4.1

mlt.minwl Sets the minimum word length below which words will be ignored.

mlt.maxwl Sets the maximum word length above which words will be ignored.

mlt.maxqt Sets the maximum number of query terms that will be included in any generated query.

mlt.maxntp Sets the maximum number of tokens to parse in each example document field that is not stored with

TermVector support.

mlt.boost Specifies if the query will be boosted by the interesting term relevance. It can be either "true" or

"false".

mlt.qf Query fields and their boosts using the same format as that used by the DisMaxRequestHandler.

These fields must also be specified in .mlt.fl

Parameters for the MoreLikeThisComponent.

Using MoreLikeThis as a search component returns similar documents for each document in the response set. In

In Solr 4.1, MoreLikeThis supports distributed search.

303Apache Solr Reference Guide 4.10

addition to the common parameters, these additional options are available:

Parameter Description

mlt If set to true, activates the component and enables Solr to return rMoreLikeThis MoreLikeThis

esults.

mlt.count Specifies the number of similar documents to be returned for each result. The default value is 5.

Parameters for the MoreLikeThisHandler

The table below summarizes parameters accessible through the . It supports faceting,MoreLikeThisHandler

paging, and filtering using common query parameters, but does not work well with alternate query parsers.

Parameter Description

mlt.match.include Specifies whether or not the response should include the matched document. If set to

false, the response will look like a normal select response.

mlt.match.offset Specifies an offset into the main query search results to locate the document on which the

query should operate. By default, the query operates on the first result forMoreLikeThis

the q parameter.

mlt.interestingTerms Controls how the component presents the "interesting" terms (the topMoreLikeThis

TF/IDF terms) for the query. Supports three settings. The setting list lists the terms. The

setting none lists no terms. The setting details lists the terms along with the boost value

used for each term. Unless , all terms will have .mlt.boost=true boost=1.0

Related Topics

RequestHandlers and SearchComponents in SolrConfig

Pagination of Results

Basic Pagination

In most search application usage, the "top" matching results (sorted by score, or some other criteria) are then

displayed to some human user. In many applications the UI for these sorted results are displayed to the user in

"pages" containing a fixed number of matching results, and users don't typically look at results past the first few

pages worth of results.

In Solr, this basic paginated searching is supported using the and parameters, and performance of thisstart rows

common behaviour can be tuned by utilizing the and adjusting the queryResultCache queryResultWindowSiz

configuration options based on your expected page sizes.e

Basic Pagination Examples

The easiest way to think about simple pagination, is to simply multiply the page number you want (treating the "first"

page number as "0") by the number of rows per page; such as in the following psuedo-code:

304Apache Solr Reference Guide 4.10

function fetch_solr_page($page_number, $rows_per_page) {

$start = $page_number * $rows_per_page

$params = [ q = $some_query, rows = $rows_per_page, start = $start ]

return fetch_solr($params)

}

How Basic Pagination is Affected by Index Updates

The param specified in a request to Solr indicates an "offset" in the complete sorted list of matchesstart absolute

that the client wants Solr to use as the beginning of the current "page". If an index modification (such as adding or

removing documents) which affects the sequence of ordered documents matching a query occurs in between two

requests from a client for subsequent pages of results, then it is possible that these modifications can result in the

same document being returned on multiple pages, or documents being "skipped" as the result set shrinks or grows.

For example: consider an index containing 26 documents like so:

id name

1 A

2 B

...

26 Z

Followed by the following requests & index modifications interleaved:

A client requests q=*:*&rows=5&start=0&sort=name asc

documents with the ids will be returned to the client in1-5

Document id is deleted3

The client requests "page #2" using q=*:*&rows=5&start=5&sort=name asc

Documents will be returned7-11

Document has been skipped, since it is now the 5th document in the sorted set of all matching6

results, and would be returned on a new request for "page #1"

3 new documents are now added with the ids , , and ; All three documents have a name of 90 91 92 A

The client requests "page #3" using q=*:*&rows=5&start=10&sort=name asc

Documents will be returned9-13

Documents , , and have now been returned on both page #2 and page #3 since they moved9 10 11

farther back in the list of sorted results

In typical situations these impacts from index changes on paginated searching don't significantly affect user

experience -- either because they happen extremely infrequently in fairly static collections, or because the users

recognize that the collection of data is constantly evolving and expect to see documents shift up in down in the result

sets.

Performance Problems with "Deep Paging"

In some situations, the results of a Solr search are not destined for a simple paginated user interface. When you

305Apache Solr Reference Guide 4.10

wish to fetch a very large number of sorted results from Solr to feed into an external system, using very large values

for the or parameters can be very inefficient. Pagination using and not only require Solrstart rows start rows

to compute (and sort) in memory all of the matching documents that should be fetched for the current page, but also

all of the documents that would have appeared on previous pages. So while a request for start=0&rows=100000

may be obviously inefficient because it requires Solr to maintain & sort in memory a set of 1 million documents,0

likewise a request for is equally inefficient for the same reasons. Solr can't computestart=999000&rows=1000

which matching document is the 999001st result in sorted order, without first determining what the first 999000

matching sorted results are.

Fetching A Large Number of Sorted Results: Cursors

As an alternative to increasing the "start" parameter to request subsequent pages of sorted results, Solr supports

using a "Cursor" to scan through results. Cursors in Solr are a logical concept, that doesn't involve caching any

state information on the server. Instead the sort values of the last document returned to the client are used to

compute a "mark" representing a logical point in the ordered space of sort values. That "mark" can be specified in

the parameters of subsequent requests to tell Solr where to continue.

Using Cursors

To use a cursor with Solr, specify a parameter with the value of . You can think of this beingcursorMark " "*

analogous to as a way to tell Solr "start at the beginning of my sorted results" except that it also informsstart=0

Solr that you want to use a Cursor. So in addition to returning the top N sorted results (where you can control N

using the parameter) the Solr response will also include an encoded String named . Yourows nextCursorMark

then take the String value from the response, and pass it back to Solr as the paranextCursorMark cursorMark

meter for your next request. You can repeat this process until you've fetched as many docs as you want, or until

the returned matches the you've already specified -- indicating that there are nonextCursorMark cursorMark

more results.

Constraints when using Cursors

There are a few important constraints to be aware of when using parameter in a Solr requestcursorMark

cursorMark and are mutually exclusive parametersstart

Your requests must either not include a parameter, or it must be specified with a value of " ".start 0

sort clauses must include the uniqueKey field (either " " or ")asc "desc

If is your uniqueKey field, then sort params like and would bothid id asc name asc, id desc

work fine, but by itself would notname asc

Cursor mark values are computed based on the sort values of each document in the result, which means multiple

documents with identical sort values will produce identical Cursor mark values if one of them is the last document on

a page of results. In that situation, the subsequent request using that would not know which of thecursorMark

documents with the identical mark values should be skipped. Requiring that the uniqueKey field be used as a clause

in the sort criteria guarantees that a deterministic ordering will be returned, and that every value willcursorMark

identify a unique point in the sequence of documents.

Cursor Examples

Fetch All Docs

The psuedo-code shown here shows the basic logic involved in fetching all documents matching a query using a

cursor:

306Apache Solr Reference Guide 4.10

// when fetching all docs, you might as well use a simple id sort

// unless you really need the docs to come back in a specific order

$params = [ q => $some_query, sort => 'id asc', rows => $r, cursorMark => '*' ]

$done = false

while (not $done) {

$results = fetch_solr($params)

// do something with $results

if ($params[cursorMark] == $results[nextCursorMark]) {

$done = true

}

$params[cursorMark] = $results[nextCursorMark]

}

Using SolrJ, this psuedo-code would be:

SolrQuery q = (new SolrQuery(some_query)).setRows(r).setSort(SortClause.asc("id"));

String cursorMark = CursorMarkParams.CURSOR_MARK_START;

boolean done = false;

while (! done) {

q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);

QueryResponse rsp = solrServer.query(q);

String nextCursorMark = rsp.getNextCursorMark();

doCustomProcessingOfResults(rsp);

if (cursorMark.equals(nextCursorMark)) {

done = true;

}

cursorMark = nextCursorMark;

}

If you wanted to do this by hand using curl, the sequence of requests would look something like this:

307Apache Solr Reference Guide 4.10

$ curl '...&rows=10&sort=id+asc&cursorMark=*'

{

"response":{"numFound":32,"start":0,"docs":[

// ... 10 docs here ...

]},

"nextCursorMark":"AoEjR0JQ"}

$ curl '...&rows=10&sort=id+asc&cursorMark=AoEjR0JQ'

{

"response":{"numFound":32,"start":0,"docs":[

// ... 10 more docs here ...

]},

"nextCursorMark":"AoEpVkRCREIxQTE2"}

$ curl '...&rows=10&sort=id+asc&cursorMark=AoEpVkRCREIxQTE2'

{

"response":{"numFound":32,"start":0,"docs":[

// ... 10 more docs here ...

]},

"nextCursorMark":"AoEmbWF4dG9y"}

$ curl '...&rows=10&sort=id+asc&cursorMark=AoEmbWF4dG9y'

{

"response":{"numFound":32,"start":0,"docs":[

// ... 2 docs here because we've reached the end.

]},

"nextCursorMark":"AoEpdmlld3Nvbmlj"}

$ curl '...&rows=10&sort=id+asc&cursorMark=AoEpdmlld3Nvbmlj'

{

"response":{"numFound":32,"start":0,"docs":[

// no more docs here, and note that the nextCursorMark

// matches the cursorMark param we used

]},

"nextCursorMark":"AoEpdmlld3Nvbmlj"}

Fetch first N docs, Based on Post Processing

Since the cursor is stateless from Solr's perspective, your client code can stop fetching additional results as soon as

you have decided you have enough information:

while (! done) {

q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);

QueryResponse rsp = solrServer.query(q);

String nextCursorMark = rsp.getNextCursorMark();

boolean hadEnough = doCustomProcessingOfResults(rsp);

if (hadEnough || cursorMark.equals(nextCursorMark)) {

done = true;

}

cursorMark = nextCursorMark;

}

How cursors are Affected by Index Updates

Unlike basic pagination, Cursor pagination does not rely on using an absolute "offset" into the completed sorted list

of matching documents. Instead, the specified in a request encapsulates information about the cursorMark relativ

position of the last document returned, based on the sort values of that document. This means that thee absolute

308Apache Solr Reference Guide 4.10

impact of index modifications is much smaller when using a cursor compared to basic pagination.

Consider the same example index described when discussing basic pagination:

id name

1 A

2 B

...

26 Z

A client requests q=*:*&rows=5&start=0&sort=name asc, id asc&cursorMark=*

Documents with the ids will be returned to the client in order1-5

Document id is deleted3

The client requests 5 more documents using the from the previous responsenextCursorMark

Documents will be returned -- the deletion of a document that's already been returned doesn't6-10

affect the relative position of the cursor

3 new documents are now added with the ids , , and ; All three documents have a name of 90 91 92 A

The client requests 5 more documents using the from the previous responsenextCursorMark

Documents will be returned -- the addition of new documents with sort values already past11-15

does not affect the relative position of the cursor

Document id is updated to change it's 'name' to 1 Q

Document id 17 is updated to change it's 'name' to A

The client requests 5 more documents using the from the previous responsenextCursorMark

The resulting documents are in that order16,1,18,19,20

Because the sort value of document changed so that it is the cursor position, the document is1after

returned to the client twice

Because the sort value of document changed so that it is the cursor position, the document17 before

has been "skipped" and will not be returned to the client as the cursor continues to progress

In a nutshell: When fetching all results matching a query using , the only way index modifications cancursorMark

result in a document being skipped, or returned twice, is if the sort value of the document changes.

"Tailing" a Cursor

Because Cursor requests are stateless, and the values encapsulate the sort values of thecursorMark absolute

last document returned from a search, it's possible to "continue" fetching additional results from a cursor that has

already reached it's end -- if new documents are added (or existing documents are updated) to the end of the

results. You can think of this as similar to using something like " in Unix.tail -f"

One way to ensure that a document will never be returned more then once, is to use the uniqueKey field as

the primary (and therefore: only significant) sort criteria.

In this situation, you will be guaranteed that each document is only returned once, no matter how it may be

be modified during the use of the cursor.

309Apache Solr Reference Guide 4.10

The most common examples of how this can be useful is when you have a "timestamp" field recording when a

document has been added/updated in your index. Client applications can continuously poll a cursor using a sort=t

for documents matching a query, and always be notified when a document is added orimestamp asc, id asc

updated matching the request criteria. Another common example is when you have uniqueKey values that always

increase as new documents are created, and you can continuously poll a cursor using to be notifiedsort=id asc

about new documents.

The psuedo-code for tailing a cursor is only a slight modification from our early example for processing all docs

matching a query:

while (true) {

$doneForNow = false

while (not $doneForNow) {

$results = fetch_solr($params)

// do something with $results

if ($params[cursorMark] == $results[nextCursorMark]) {

$doneForNow = true

}

$params[cursorMark] = $results[nextCursorMark]

}

sleep($some_configured_delay)

}

Result Grouping

Result Grouping groups documents with a common field value into groups and returns the top documents for each

group. For example, if you searched for "DVD" on an electronic retailer's e-commerce site, you might be returned

three categories such as "TV and Video," "Movies," and "Computers," with three results per category. In this case,

the query term "DVD" appeared in all three categories, so Solr groups them together in order to increase relevancy

for the user.

Result Grouping is separate from . Though it is conceptually similar, faceting returns all relevant results andFaceting

allows the user to refine the results based on the facet category. For example, if you searched for "shoes" on a

footwear retailer's e-commerce site, you would be returned all results for that query term, along with selectable

facets such as "size," "color," "brand," and so on.

However, with Solr 4 you can also group facets. The grouped faceting works with the first parameter,group.field

and other parameters are ignored.group.field

Grouped faceting supports and but currently doesn't support date and pivot faceting.facet.field facet.range

Grouped faceting differs from non grouped facets (sum of all facets) == (total of products with that property) as

shown in the following example:

Object 1

ppm: 62

product_range: 6

Object 2

ppm: 65

310Apache Solr Reference Guide 4.10

product_range: 6

Object 3

ppm: 62

product_range: 7

If you ask Solr to group these documents by "product_range", then the total amount of groups is 2, but the facets for

ppm are 2 for 62 and 1 for 65.

Request Parameters

Result Grouping takes the following request parameters. Any number of these request parameters can be included

in a single request:

Parameter Type Description

group Boolean If true, query results will be grouped.

group.field string The name of the field by which to group results. The field must be

single-valued, and either be indexed or a field type that has a value

source and works in a function query, such as . ItExternalFileField

must also be a string-based field, such as or StrField TextField

group.func query Group based on the unique values of a function query. Supported since

Solr 4.0.

group.query query Return a single group of documents that match the given query.

rows integer The number of groups to return. The default value is 10.

start integer Specifies an initial offset for the list of groups.

group.limit integer Specifies the number of results to return for each group. The default

value is 1.

group.offset integer Specifies an initial offset for the document list of each group.

sort sortspec Specifies how Solr sorts the groups relative to each other. For example,

will cause the groups to be sorted accordingsort=popularity desc

to the highest popularity document in each group. The default value is s

.core desc

group.sort sortspec Specifies how Solr sorts documents within a single group. The default

value is .score desc

group.format grouped/simple If this parameter is set to , the grouped documents aresimple

presented in a single flat list, and the and parametersstart rows

affect the numbers of documents instead of groups.

group.main Boolean If true, the result of the first field grouping command is used as the main

result list in the response, using .group.format=simple

311Apache Solr Reference Guide 4.10

group.ngroups Boolean If true, Solr includes the number of groups that have matched the query

in the results. The default value is false.

group.truncate Boolean If true, facet counts are based on the most relevant document of each

group matching the query. The default value is false.

group.facet Boolean Determines whether to compute grouped facets for the field facets

specified in facet.field parameters. Grouped facets are computed based

on the first specified group. As with normal field faceting, fields shouldn't

be tokenized (otherwise counts are computed for each token). Grouped

faceting supports single and multivalued fields. Default is false. New with

Solr 4.

group.cache.percent integer

between 0 and

100

Setting this parameter to a number greater than 0 enables caching for

result grouping. Result Grouping executes two searches; this option

caches the second search. The default value is 0. Testing has shown

that group caching only improves search time with Boolean, wildcard,

and fuzzy queries. For simple queries like term or "match all" queries,

group caching degrades performance.

Any number of group commands ( , , ) may be specified in a singlegroup.field group.func group.query

request.

Grouping is also supported for distributed searches. Currently is the only parameter that doesn'tgroup.func

supported distributed searches.

Examples

All of the following examples work with the data provided in the Solr Example directory.

Grouping Results by Field

In this example, we will group results based on the field, which specifies the manufacturer of the itemsmanu_exact

in the sample dataset.

http://localhost:8983/solr/select?wt=json&indent=true&fl=id,name&q=solr+memory&group=

true&group.field=manu_exact

312Apache Solr Reference Guide 4.10

{

...

"grouped":{

"manu_exact":{

"matches":6,

"groups":[{

"groupValue":"Apache Software Foundation",

"doclist":{"numFound":1,"start":0,"docs":[

{

"id":"SOLR1000",

"name":"Solr, the Enterprise Search Server"}]

}},

{

"groupValue":"Corsair Microsystems Inc.",

"doclist":{"numFound":2,"start":0,"docs":[

{

"id":"VS1GB400C3",

"name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC

3200) System Memory - Retail"}]

}},

{

"groupValue":"A-DATA Technology Inc.",

"doclist":{"numFound":1,"start":0,"docs":[

{

"id":"VDBDB1A16",

"name":"A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC

3200) System Memory - OEM"}]

}},

{

"groupValue":"Canon Inc.",

"doclist":{"numFound":1,"start":0,"docs":[

{

"id":"0579B002",

"name":"Canon PIXMA MP500 All-In-One Photo Printer"}]

}},

{

"groupValue":"ASUS Computer Inc.",

"doclist":{"numFound":1,"start":0,"docs":[

{

"id":"EN7800GTX/2DHTV/256M",

"name":"ASUS Extreme N7800GTX/2DHTV (256 MB)"}]

}

]

}

The response indicates that there are six total matches for our query. For each unique value of , Solrgroup.field

returns a with the top scoring document. The also includes the total number of matches in thatdocList docList

group as the value. The groups are sorted by the score of the top document within each group.numFound

We can run the same query with the request parameter . This will format the results as a singlegroup.main=true

flat document list. This flat format does not include as much information as the normal result grouping query results,

but it may be easier for existing Solr clients to parse.

http://localhost:8983/solr/select?wt=json&indent=true&fl=id,name,manufacturer&q=solr+

313Apache Solr Reference Guide 4.10

memory&group=true&group.field=manu_exact&group.main=true

{

"responseHeader":{

"status":0,

"QTime":1,

"params":{

"fl":"id,name,manufacturer",

"indent":"true",

"q":"solr memory",

"group.field":"manu_exact",

"group.main":"true",

"group":"true",

"wt":"json"}},

"grouped":{},

"response":{"numFound":6,"start":0,"docs":[

{

"id":"SOLR1000",

"name":"Solr, the Enterprise Search Server"},

{

"id":"VS1GB400C3",

"name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200)

System Memory - Retail"},

{

"id":"VDBDB1A16",

"name":"A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200)

System Memory - OEM"},

{

"id":"0579B002",

"name":"Canon PIXMA MP500 All-In-One Photo Printer"},

{

"id":"EN7800GTX/2DHTV/256M",

"name":"ASUS Extreme N7800GTX/2DHTV (256 MB)"}]

}

Grouping by Query

In this example, we will use the parameter to find the top three results for "memory" in two differentgroup.query

price ranges: 0.00 to 99.99, and over 100.

http://localhost:8983/solr/select?wt=json&indent=true&fl=name,price&q=memory&group=tr

ue&group.query=price:[0+TO+99.99]&group.query=price:[100+TO+*]&group.limit=3

314Apache Solr Reference Guide 4.10

{

"responseHeader":{

"status":0,

"QTime":42,

"params":{

"fl":"name,price",

"indent":"true",

"q":"memory",

"group.limit":"3",

"group.query":["price:[0 TO 99.99]",

"price:[100 TO *]"],

"group":"true",

"wt":"json"}},

"grouped":{

"price:[0 TO 99.99]":{

"matches":5,

"doclist":{"numFound":1,"start":0,"docs":[

{

"name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC

3200) System Memory - Retail",

"price":74.99}]

}},

"price:[100 TO *]":{

"matches":5,

"doclist":{"numFound":3,"start":0,"docs":[

{

"name":"CORSAIR XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM Unbuffered DDR 400

(PC 3200) Dual Channel

Kit System Memory - Retail",

"price":185.0},

{

"name":"Canon PIXMA MP500 All-In-One Photo Printer",

"price":179.99},

{

"name":"ASUS Extreme N7800GTX/2DHTV (256 MB)",

"price":479.95}]

}

In this case, Solr found five matches for "memory," but only returns four results grouped by price. This is because

one result for "memory" did not have a price assigned to it.

Distributed Result Grouping

Solr also supports result grouping on distributed indexes. If you are using result grouping on the "/select" request

handler, you must provide the parameter described here. If you are using result grouping on a requestshards

handler other than "/select", you must also provide the parameter:shards.qt

Parameter Description

shards Specifies the shards in your distributed indexing configuration. For more information about

distributed indexing, see Distributed Search with Index Sharding

315Apache Solr Reference Guide 4.10

shards.qt Specifies the request handler Solr uses for requests to shards. This parameter is not required for the

request handler./select

For example: http://localhost:8983/solr/select?wt=json&indent=true&fl=id,name,manufactur

er&q=solr+memory&group=true&group.field=manu_exact&group.main=true&shards=solr-shard1

:8983/solr,solr-shard2:8983/solr

Collapse and Expand Results

The collapsing query parser and the expand component combine to form an approach to grouping documents for

field collapsing in search results.

Collapsing Query Parser

The is really a that provides more performant field collapsing than Solr's standardCollapsingQParser post filter

approach when the number of distinct groups in the result set is high. This parser collapses the result set to a single

document per group before it forwards the result set to the rest of the search components. So all downstream

components (faceting, highlighting, etc...) will work with the collapsed result set.

Collapse based on the highest scoring document:

fq={!collapse field=<field_name>}

Collapse based on the minimum value of a numeric field:

fq={!collapse field=<field_name> min=<field_name>}

Collapse based on the maximum value of a numeric field:

fq={!collapse field=<field_name> max=<field_name>}

Collapse based on the min/max value of a function. The function can be used with thecscore()

CollapsingQParserPlugin to return the score of the current document being collapsed.

fq={!collapse field=<field_name> max=sum(cscore(),field(A))}

Collapse with a null policy:

fq={!collapse field=<field_name> nullPolicy=<nullPolicy>}

There are three null policies:

ignore: removes documents with a null value in the collapse field. This is the default.

expand: treats each document with a null value in the collapse field as a separate group.

collapse: collapses all documents with a null value into a single group using either highest score, or

minimum/maximum.

The CollapsingQParserPlugin fully supports the QueryElevationComponent.

Expand Component

316Apache Solr Reference Guide 4.10

The ExpandComponent can be used to expand the groups that were collapsed by the .CollapsingQParserPlugin

Example usage with the CollapsingQParserPlugin:

q=foo&fq={!collapse field=ISBN}

In the query above, the CollapsingQParserPlugin will collapse the search results on the field. The main searchISBN

results will contain the highest ranking document from each book.

The ExpandComponent can now be used to expand the results so you can see the documents grouped by ISBN.

For example:

q=foo&fq={!collapse field=ISBN}&expand=true

The “expand=true” parameter turns on the ExpandComponent. The ExpandComponent adds a new section to the

search output labeled “expanded”.

Inside the expanded section there is a with each group head pointing to the expanded documents that aremap

within the group. As applications iterate the main collapsed result set, they can access the map to retrieveexpanded

the expanded groups.

The ExpandComponent has the following parameters:

Parameter Default

expand.sort Orders the documents with the expanded groups score

desc

expand.rows The number of rows to display in each group 5

expand.q Overrides the main q parameter, determines which documents to include in the main

group.

main q

expand.fq Overrides main fq's, determines which documents to include in the main group. main fq's

Result Clustering

The (or ) plugin attempts to automatically discoverclustering cluster analysis

groups of related search hits (documents) and assign human-readable labels to

these groups. By default in Solr, the clustering algorithm is applied to the search

result of each single query—this is called an clustering. While Solron-line

contains an extension for for full-index clustering ( clustering) this sectionoff-line

will focus on discussing on-line clustering only.

Clusters discovered for a given query can be perceived as . Thisdynamic facets

is beneficial when regular faceting is difficult (field values are not known in

advance) or when the queries are exploratory in nature. Take a look at the Carro

project's demo page to see an example of search results clustering in actiont2

(the groups in the visualization have been discovered automatically in search

results to the right, there is no external information involved).

317Apache Solr Reference Guide 4.10

The query issued to the system was . It seems clear that faceting could notSolr

yield a similar set of groups, although the goals of both techniques are

similar—to let the user explore the set of search results and either rephrase the

query or narrow the focus to a subset of current documents. Clustering is also

similar to in that it can help to look deeper into search results,Result Grouping

beyond the top few hits.

Topics covered in

this section:

Preliminary

Concepts

Quick Start

Example

Installation

Configuration

Tweaking

Algorithm

Settings

Performance

Considerations

Additional

Resources

Preliminary Concepts

Each passed to the clustering component is composed of several logical parts:document

a unique identifier,

origin URL,

the title,

the main content,

a language code of the title and content.

The identifier part is mandatory, everything else is optional but at least one of the text fields (title or content) will be

required to make the clustering process reasonable. It is important to remember that logical document parts must be

mapped to a particular schema and its fields. The content (text) for clustering can be sourced from either a stored

text field or context-filtered using a highlighter, all these options are explained below in the section.configuration

A is the actual logic (implementation) that discovers relationships among the documents in theclustering algorithm

search result and forms human-readable cluster labels. Depending on the choice of the algorithm the clusters may

(and probably will) vary. Solr comes with several algorithms implemented in the open source project,Carrot2

commercial alternatives also exist.

Quick Start Example

Assuming an unpacked, unmodified distribution of Solr, issue the following commands in the console window:

cd example

java -Dsolr.clustering.enabled=true -jar start.jar

This command uses the same configuration and index as the main Solr example, but it additionally enables the

clustering component contrib and a dedicated search handler configured to use it.

In a different console window, add some documents using the post tool (unless you have done so already):

318Apache Solr Reference Guide 4.10

cd example/exampledocs

java -jar post.jar *.xml

You can now try out the clustering handler by opening the following URL in a browser: http://localhost:8983

/solr/clustering?q=*:*&rows=100

The output XML should include search hits and an array of automatically discovered clusters at the end, resembling

the output shown here:

</lst>

<doc>

<str name="name">Test with some GB18030 encoded characters</str>

<str>No accents here</str>

<str>This is a feature (translated)</str>

<str>This document is very shiny (translated)</str>

</arr>

</doc>

</result>

<lst>

</arr>

<str>TWINX2048-3200PRO</str>

<str>VDBDB1A16</str>

</arr>

</lst>

<lst>

</arr>

319Apache Solr Reference Guide 4.10

</arr>

</lst>

<lst>

<str>Other Topics</str>

</arr>

<str>adata</str>

<str>apple</str>

</arr>

</lst>

320Apache Solr Reference Guide 4.10

</arr>

</response>

There were a few clusters discovered for this query ( ), separating search hits into various categories: DDR,*:*

iPod, Hard Drive, etc. Each cluster has a label and score that indicates the "goodness" of the cluster. The score is

algorithm-specific and is meaningful only in relation to the scores of other clusters in the same set. In other words, if

cluster has a higher score than cluster , cluster should be of better quality (have a better label and/or moreA B A

coherent document set). Each cluster has an array of identifiers of documents belonging to it. These identifiers

correspond to the field declared in the schema.uniqueKey

Depending on the quality of input documents, some clusters may not make much sense. Some documents may be

left out and not be clustered at all; these will be assigned to the synthetic group, marked with the Other Topics othe

property set to (see the XML dump above for an example). The score of the other topics group isr-topics true

zero.

Installation

The clustering contrib extension requires and all JARs under dist/solr-clustering-*.jar contrib/cluste

.ring/lib

Configuration

Declaration of the Search Component and Request Handler

Clustering extension is a search component and must be declared in . Such a component can besolrconfig.xml

then appended to a request handler as the last component in the chain (because it requires search results which

must be previously fetched by the search component).

An example configuration could look as shown below.

Include the required contrib JARs. Note paths are relative to the Solr core so they may need adjustments to

your configuration.

Declaration of the search component. Each component can also declare multiple clustering pipelines

("engines"), which can be selected at runtime.

321Apache Solr Reference Guide 4.10

<str name="name">lingo</str>

<str

name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</st

</lst>

<str

name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm</str>

</lst>

</searchComponent>

A request handler to which we append the clustering component declared above.

<requestHandler name="/clustering"

class="solr.SearchHandler">

<str name="carrot.title">doctitle</str>

<str name="carrot.snippet">content</str>

<!-- Configure any other request handler parameters. We will cluster the

top 100 search results so bump up the 'rows' parameter. -->

<str name="fl">*,score</str>

</lst>

<str>clustering</str>

</arr>

</requestHandler>

Configuration Parameters of the Clustering Component

The table below summarizes parameters of each clustering engine or the entire clustering component (depending

where they are declared).

Parameter Description

clustering When , clustering component is enabled.true

clustering.engine Declares which clustering engine to use. If not present, the first declared engine

will become the default one.

322Apache Solr Reference Guide 4.10

clustering.results When , the component will perform clustering of search results (this shouldtrue

be enabled).

clustering.collection When , the component will perform clustering of the whole document indextrue

(this section does not cover full-index clustering).

At the engine declaration level, the following parameters are supported.

Parameter Description

carrot.algorithm The algorithm class.

carrot.resourcesDir Algorithm-specific resources and configuration files (stop words, other lexical

resources, default settings). By default points to conf/clustering/carrot

carrot.outputSubClusters If and the algorithm supports hierarchical clustering, sub-clusters willtrue

also be emitted.

carrot.numDescriptions Maximum number of per-cluster labels to return (if the algorithm assigns more

than one label to a cluster).

The parameter should contain a fully qualified class name of an algorithm supported by the carrot.algorithm C

framework. Currently, the following algorithms are available:arrot2

org.carrot2.clustering.lingo.LingoClusteringAlgorithm (open source)

org.carrot2.clustering.stc.STCClusteringAlgorithm (open source)

org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm (open source)

com.carrotsearch.lingo3g.Lingo3GClusteringAlgorithm (commercial)

For a comparison of characteristics of these algorithms see the following links:

http://doc.carrot2.org/#section.advanced-topics.fine-tuning.choosing-algorithm

http://project.carrot2.org/algorithms.html

http://carrotsearch.com/lingo3g-comparison.html

The question of which algorithm to choose depends on the amount of traffic (STC is faster than Lingo, but arguably

produces less intuitive clusters, Lingo3G is the fastest algorithm but is not free or open source), expected result

(Lingo3G provides hierarchical clusters, Lingo and STC provide flat clusters), and the input data (each algorithm will

cluster the input slightly differently). There is no one answer which algorithm is "the best".

Contextual and Full Field Clustering

The clustering engine can apply clustering to the full content of (stored) fields or it can run an internal highlighter

pass to extract context-snippets before clustering. Highlighting is recommended when the logical snippet field

contains a lot of content (this would affect clustering performance). Highlighting can also increase the quality of

clustering because the content passed to the algorithm will be more focused around the query (it will be

query-specific context). The following parameters control the internal highlighter.

Parameter Description

323Apache Solr Reference Guide 4.10

carrot.produceSummary When the clustering component will run a highlighter pass on the content oftrue

logical fields pointed to by and . Otherwisecarrot.title carrot.snippet

full content of those fields will be clustered.

carrot.fragSize The size, in characters, of the snippets (aka fragments) created by the

highlighter. If not specified, the default highlighting fragsize ( ) willhl.fragsize

be used.

carrot.summarySnippets The number of summary snippets to generate for clustering. If not specified, the

default highlighting snippet count ( ) will be used.hl.snippets

Logical to Document Field Mapping

As already mentioned in , the clustering component clusters "documents" consisting of logicalPreliminary Concepts

parts that need to be mapped onto physical schema of data stored in Solr. The field mapping attributes provide a

connection between fields and logical document parts. Note that the content of title and snippet fields must be store

so that it can be retrieved at search time.d

Parameter Description

carrot.title The field (alternatively comma- or space-separated list of fields) that should be mapped to

the logical document's title. The clustering algorithms typically give more weight to the

content of the title field compared to the content (snippet). For best results, the field should

contain concise, noise-free content. If there is no clear title in your data, you can leave this

parameter blank.

carrot.snippet The field (alternatively comma- or space-separated list of fields) that should be mapped to

the logical document's main content. If this mapping points to very large content fields the

performance of clustering may drop significantly. An alternative then is to use query-context

snippets for clustering instead of full field content. See the description of the carrot.prod

parameter for details.uceSummary

carrot.url The field that should be mapped to the logical document's content URL. Leave blank if not

required.

Clustering Multilingual Content

The field mapping specification can include a parameter, which defines the field that stores carrot.lang ISO

code of the language in which the title and content of the document are written. This information can be stored639-1

in the index based on apriori knowledge of the documents' source or a language detection filter applied at indexing

time. All algorithms inside the Carrot2 framework will accept ISO codes of languages defined in LanguageCode

.enum

The language hint makes it easier for clustering algorithms to separate documents from different languages on input

and to pick the right language resources for clustering. If you do have multi-lingual query results (or query results in

a language different than English), it is strongly advised to map the language field appropriately.

Parameter Description

carrot.lang The field that stores ISO 639-1 code of the language of the document's text fields.

324Apache Solr Reference Guide 4.10

carrot.lcmap A mapping of arbitrary strings into ISO 639 two-letter codes used by . Thecarrot.lang

syntax of this parameter is the same as , for example: langid.map.lcmap langid.map.lcm

ap=japanese:ja polish:pl english:en

The default language can also be set using Carrot2-specific algorithm attributes (in this case the MultilingualClusteri

attribute).ng.defaultLanguage

Tweaking Algorithm Settings

The algorithms that come with Solr are using their default settings which may be inadequate for all data sets. All

algorithms have lexical resources and resources (stop words, stemmers, parameters) that may require tweaking to

get better clusters (and cluster labels). For Carrot2-based algorithms it is probably best to refer to a dedicated tuning

application called Carrot2 Workbench (screenshot below). From this application one can export a set of algorithm

attributes as an XML file, which can be then placed under the location pointed to by .carrot.resourcesDir

Providing Defaults

The default attributes for all engines (algorithms) declared in the clustering component are placed under carrot.r

and with an expected file name of . So for an engine named esourcesDir engineName-attributes.xml ling

and the default value of , the attributes would be read from a file in o carrot.resourcesDir conf/clustering/

.carrot2/lingo-attributes.xml

An example XML file changing the default language of documents to Polish is shown below.

325Apache Solr Reference Guide 4.10

<attribute-sets default="attributes">

<attribute-set id="attributes">

<value-set>

<label>attributes</label>

</attribute>

</value-set>

</attribute-set>

</attribute-sets>

Tweaking at Query-Time

The clustering component and Carrot2 clustering algorithms can accept query-time attribute overrides. Note that

certain things (for example lexical resources) can only be initialized once (at startup, via the XML configuration files).

An example query that changes the parameter forLingoClusteringAlgorithm.desiredClusterCountBase

the Lingo algorithm: http://localhost:8983/solr/clustering?q=*:*&rows=100&LingoClusteringAlgorithm.desiredCluster

CountBase=20

Performance Considerations

Dynamic clustering of search results comes with two major performance penalties:

Increased cost of fetching a larger-than-usual number of search results (50, 100 or more documents),

Additional computational cost of the clustering itself.

For simple queries, the clustering time will usually dominate the fetch time. If the document content is very long the

retrieval of stored content can become a bottleneck. The performance impact of clustering can be lowered in several

ways:

feed less content to the clustering algorithm by enabling attribute,carrot.produceSummary

perform clustering on selected fields (titles only) to make the input smaller,

use a faster algorithm (STC instead of Lingo, Lingo3G instead of STC),

tune the performance attributes related directly to a specific algorithm.

Some of these techniques are described in document, available at Apache SOLR and Carrot2 integration strategies

. The topic of improving performance is also included in the Carrot2http://carrot2.github.io/solr-integration-strategies

manual at .http://doc.carrot2.org/#section.advanced-topics.fine-tuning.performance

Additional Resources

The following resources provide additional information about the clustering component in Solr and its potential

applications.

Apache Solr and Carrot2 integration strategies: http://carrot2.github.io/solr-integration-strategies

Apache Solr Wiki (covers previous Solr versions, may be inaccurate): http://carrot2.github.io/solr-integration-s

trategies

Clustering and Visualization of Solr search results (video from Berlin BuzzWords conference, 2011): http://vim

eo.com/26616444

Spatial Search

326Apache Solr Reference Guide 4.10

Solr supports location data for use in spatial/geospatial searches. Using spatial search, you can:

Index points or other shapes

Filter search results by a bounding box or circle or by other shapes

Sort or boost scoring by distance between points, or relative area between rectangles

Index and search multi-value time or other numeric durations

With Solr 4, there are three field types for spatial search: LatLonType (or its non-geodetic twin PointType),

SpatialRecursivePrefixTreeFieldType (RPT for short), and BBoxField. LatLonType was the first spatial field,

introduced in Solr 3; the others have been added since. RPT offers more features than LatLonType and fast filter

performance, although LatLonType is still more appropriate when efficient distance sorting/boosting is desired. They

can both be used simultaneously for what each does best – LatLonType for sorting/boosting, RPT for filtering.

BBoxField is for indexing bounding boxes, querying by a box, specifying a search predicate

(Intersects,IsWithin,Contains,DisjointTo), and a relevancy sort/boost like overlapRatio or simply the area.

For more information on Solr spatial search, see .http://wiki.apache.org/solr/SpatialSearch

Indexing and Configuration

For indexing geodetic points (latitude and longitude), supply the pair of numbers as a string with a comma

separating them in latitude then longitude order. For non-geodetic points, the order is x,y for PointType, and for RPT

you must use a space instead of a comma, or use WKT.

See the section below for RPT configuration specifics.SpatialRecursivePrefixTreeFieldType

Spatial Filters

The following parameters are used for spatial search:

Parameter Description

d the radial distance, in kilometers (always; even for RPT field with units=degrees)

pt the center point using the format "lat,lon" if latitude & longitude. Otherwise, "x,y" for PointType or "x

y" for RPT field types.

sfield a spatial indexed field

geofilt

The filter allows you to retrieve results based on the geospatial distance (AKA the "great circle distance")geofilt

from a given point. Another way of looking at it is that it creates a circular shape filter. For example, to find all

documents within five kilometers of a given lat/lon point, you could enter &q=*:*&fq={!geofilt

. This filter returns all results within a circle of the given radius aroundsfield=store}&pt=45.15,-93.85&d=5

the initial point:

327Apache Solr Reference Guide 4.10

bbox

The filter is very similar to except it uses the of the calculated circle. See the blue boxbbox geofilt bounding box

in the diagram below. It takes the same parameters as geofilt. Here's a sample query: &q=*:*&fq={!bbox

. The rectangular shape is faster to compute and so it's sometimessfield=store}&pt=45.15,-93.85&d=5

used as an alternative to geofilt when it's acceptable to return points outside of the radius. However, if the ideal goal

is a circle but you want it to run faster, then instead consider using the RPT field and try a large "distErrPct" value

like (10% radius). This will return results outside the radius but it will do so somewhat uniformly around the0.1

shape.

Filtering by an arbitrary rectangle

Sometimes the spatial search requirement calls for finding everything in a rectangular area, such as the area

covered by a map the user is looking at. For this case, geofilt and bbox won't cut it. This is somewhat of a trick, but

you can use Solr's range query syntax for this by supplying the lower-left corner as the start of the range and the

upper-right corner as the end of the range. Here's an example: &q=*:*&fq= .store:[45,-94 TO 46,-93]

LatLonType does support rectangles that cross the dateline, but RPT does. If you are using RPT withnot

non-geospatial coordinates ( ) then you must quote the points due to the space, e.g. .geo="false" "x y"

Optimization: Solr Post Filtering

Most likely, the fastest spatial filters will be to simply use the RPT field type. However, sometimes it may be faster to

use LatLonType with in circumstances when both the spatial query isn't worth caching and thereSolr post filtering

aren't many matching documents that match the non-spatial filters (e.g. keyword queries and other filters). To use S

with LatLonType, use the or query parsers in a filter query but specify olr post filtering bbox geofilt cache=fals

and (or greater) as local-params. Here's a short example:e cost=100

&q=...mykeywords...&fq=...someotherfilters...&fq={!geofilt cache=false

cost=100}&sfield=store&pt=45.15,-93.85&d=5

Distance Function Queries

There are four distance function queries: , see below, usually the most appropriate; , to calculate thegeodist dist

p-norm distance between multi-dimensional vectors; , to calculate the distance between two points on ahsin

sphere; and , to calculate the squared Euclidean distance between two points. For more information aboutsqedist

these function queries, see the section on .Function Queries

geodist

When a bounding box includes a pole, the bounding box ends up being a "bounding bowl" (a )spherical cap

that includes all values north of the lowest latitude of the circle if it touches the north pole (or south of the

highest latitude if it touches the south pole).

328Apache Solr Reference Guide 4.10

geodist is a distance function that takes three optional parameters: . You can(sfield,latitude,longitude)

use the function to sort results by distance or score return results.geodist

For example, to sort your results by ascending distance, enter ...&q=*:*&fq={!geofilt}&sfield=store&pt

.=45.15,-93.85&d=50&sort=geodist asc

To return the distance as the document score, enter ...&q={!func}geodist()&sfield=store&pt=45.15,-

.93.85&sort=score+asc

More Examples

Here are a few more useful examples of what you can do with spatial search in Solr.

Use as a Sub-Query to Expand Search Results

Here we will query for results in Jacksonville, Florida, or within 50 kilometers of 45.15,-93.85 (near Buffalo,

Minnesota):

&q=*:*&fq=(state:"FL" AND city:"Jacksonville") OR

{!geofilt}&sfield=store&pt=45.15,-93.85&d=50&sort=geodist()+asc

Facet by Distance

To facet by distance, you can use the Frange query parser:

&q=*:*&sfield=store&pt=45.15,-93.85&facet.query={!frange l=0

u=5}geodist()&facet.query={!frange l=5.001 u=3000}geodist()

There are other ways to do it too, like using a {!geofilt} in each facet.query.

Boost Nearest Results

Using the or , you can combine spatial search with the boost function to boost the nearestDisMax Extended DisMax

results:

&q.alt=*:*&fq={!geofilt}&sfield=store&pt=45.15,-93.85&d=50&bf=recip(geodist(),2,200,2

0)&sort=score desc

SpatialRecursivePrefixTreeFieldType (abbreviated as RPT)

Solr 4's new spatial field offers several new features and improvements over LatLonType:

Query by polygons and other complex shapes, in addition to circles & rectangles

Multi-valued indexed fields

Ability to index non-point shapes (e.g. polygons) as well as point shapes

Rectangles with user-specified corners that can cross the dateline

Multi-value distance sort and score boosting (warning: non-optimized)

Well-Known-Text (WKT) shape syntax (required for specifying polygons & other complex shapes)

RPT incorporates the basic features of LatLonType and PointType, such as lat-lon bounding boxes and circles. In

fact you can (and should) use geofilt, bbox, geodist, and a range-query with it (which wasn't so when RPT was first

introduced in Solr 4.0).

Schema configuration

To use RPT, the field type must be registered and configured in . There are many options for this fieldschema.xml

329Apache Solr Reference Guide 4.10

type.

Setting Description

name The name of the field type.

class This should be . But be aware thatsolr.SpatialRecursivePrefixTreeFieldType

the Lucene spatial module includes some other so-called "spatial strategies" other than

RPT, notably TermQueryPT*, BBox, PointVector*, and SerializedDV. Solr requires a field

type to parallel these in order to use them. The asterisked ones have them.

spatialContextFactory If polygons or linestrings are required, then is a needed to implementJTS Topology Suite

them. It's a JAR file that you need to put on Solr's classpath (but not via the standard

solrconfig.xml mechanisms). If you intend to use those shapes, set this attribute to com.s

. Furthermore, thepatial4j.core.context.jts.JtsSpatialContextFactory

context factory has its own options which are directly configurable on the Solr field type

here; follow the link to the Javadocs, and remember to look at the superclass's options in

as well. One option in particular you should most likely enable is SpatialContextFactory a

(i.e. use PreparedGeometry) as it's been shown to be a major performanceutoIndex

boost for polygons. Further details about specifying polygons to index or query are at

Solr's Wiki linked below.

units This is required, and currently can only be "degrees". It doesn't apply to geofilt, bbox, or

geodist (which all use kilometers); it applies to maxDistErr and if you configure the query

itself to return the distance.

distErrPct Defines the default precision of non-point shapes (both index & query), as a fraction

between 0.0 (fully precise) to 0.5. The closer this number is to zero, the more accurate

the shape will be. However, more precise indexed shapes use more disk space and take

longer to index. Bigger distErrPct values will make queries faster but less accurate.

maxDistErr Defines the highest level of detail required for indexed data. If left blank, the default is

one meter – just a bit less than 0.000009 degrees. This setting is used internally to

compute an appropriate maxLevels (see below).

geo If , the default, latitude and longitude coordinates will be used and the mathematicaltrue

model will generally be a sphere. If false, the coordinates will be generic X & Y on a 2D

plane using Euclidean/Cartesian geometry.

worldBounds Defines the valid numerical ranges for x and y, in the format of ENVELOPE(minX,

. If , the standard lat-lon world boundaries aremaxX, maxY, minY) geo="true"

assumed. If , you should define your boundaries.geo=false

distCalculator Defines the distance calculation algorithm. If , "haversine" is the default. If geo=true geo

, "cartesian" will be the default. Other possible values are "lawOfCosines",=false

"vincentySphere" and "cartesian^2".

prefixTree Defines the spatial grid implementation. Since a PrefixTree (such as

RecursivePrefixTree) maps the world as a grid, each grid cell is decomposed to another

set of grid cells at the next level. If then the default prefix tree is "geohash",geo=false

otherwise it's "quad". Geohash has 32 children at each level, quad has 4. Geohash

cannot be used for as it's strictly geospatial.geo=false

330Apache Solr Reference Guide 4.10

maxLevels Sets the maximum grid depth for indexed data. Instead, it's usually more intuitive to

compute an appropriate maxLevels by specifying .maxDistErr

<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType"

spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory"

autoIndex="true"

distErrPct="0.025"

maxDistErr="0.000009"

units="degrees" />

Once the field type has been defined, use it to define a field that uses it.

Because RPT has more advanced features, some of which are new and experimental, please review the Solr Wiki

at for more information about using this field type.http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4

BBoxField

The field type indexes a single rectangle (bounding box) per document field and supports searching viaBBoxField

a bounding box. It supports most spatial search predicates, it has enhanced relevancy modes based on the overlap

or area between the search rectangle and the indexed rectangle. It's particularly useful for its relevancy modes. To

configure it in the schema, use a configuration like this:

<fieldType name="bbox" class="solr.BBoxField"

geo="true" units="degrees" numberType="_bbox_coord" />

<fieldType name="_bbox_coord" class="solr.TrieDoubleField" precisionStep="8"

docValues="true" stored="false"/>

BBoxField is actually based off of 4 instances of another field type referred to by numberType. It also uses a

boolean to flag a dateline cross. Assuming you want to use the relevancy feature, docValues is required. Some of

the attributes are in common with the RPT field like geo, units, worldBounds, and spatialContextFactory because

they share some of the same spatial infrastructure.

To index a box, add a field value to a bbox field that's a string in the WKT/CQL ENVELOPE syntax. Example: ENVE

LOPE(-10, 20, 15, 10) which is minX, maxX, maxY, minY order. The parameter ordering is unintuitive but

that's what the spec calls for.

To search, you can use the query parser, or the range syntax e.g. , or the{!bbox} [10,-10 TO 15,20]

ENVELOPE syntax wrapped in parenthesis with a leading search predicate. The latter is the only way to choose a

predicate other than Intersects. For example:

&q={!field f=bbox}Contains(ENVELOPE(-10, 20, 15, 10))

Now to sort the results by one of the relevancy modes, use it like this:

&q={!field f=bbox score=overlapRatio}Intersects(ENVELOPE(-10, 20, 15, 10))

The local-param can be one of , , and . area scores by the document areascore overlapRatio area area2D

using surface-of-a-sphere (assuming geo=true) math, area2D uses simple width * height. overlapRatio computes a

[0-1] ranged score based on how much overlap exists relative to the document's area and the query area. The

331Apache Solr Reference Guide 4.10

javadocs of BBoxOverlapRatioValueSource have more info on the formula, if you're really curious. There is an

additional parameter queryTargetProportion that allows you to weight the query side of the formula to the index

You can also use to see useful score computation info.(target) side of the formula. &debug=results

The Terms Component

The Terms Component provides access to the indexed terms in a field and the number of documents that match

each term. This can be useful for building an auto-suggest feature or any other feature that operates at the term

level instead of the search or document level. Retrieving terms in index order is very fast since the implementation

directly uses Lucene's TermEnum to iterate over the term dictionary.

In a sense, this search component provides fast field-faceting over the whole index, not restricted by the base query

or any filters. The document frequencies returned are the number of documents that match the term, including any

documents that have been marked for deletion but not yet removed from the index.

Configuring the Terms Component

By default, the Terms Component is already configured in for each collection.solrconfig.xml

Defining the Terms Component

Defining the Terms search component is straightforward: simply give it a name and use the class solr.TermsCom

.ponent

This makes the component available for use, but by itself will not be useable until included with a request handler.

Using the Terms Component in a Request Handler

The request handler is also defined in by default./terms solrConfig.xml

<bool name="distrib">false</bool>

</lst>

<str>terms</str>

</arr>

</requestHandler>

Note that the defaults for the this request handler set the parameter "terms" to true, which allows terms to be

returned on request. The parameter "distrib" is set to false, which allows this handler to be used only on a single Solr

core. To finish out the configuration, he Terms Component is included as an available component to this request

handler.

You could add this component to another handler if you wanted to, and pass "terms=true" in the HTTP request in

order to get terms back. If it is only defined in a separate handler, you must use that handler when querying in order

to get terms and not regular documents as results.

Terms Component Parameters

332Apache Solr Reference Guide 4.10

The parameters below allow you to control what terms are returned. You can also any of these to the request

handler if you'd like to set them permanently. Or, you can add them to the query request. These parameters are:

Parameter Required Default Description

terms No false If set to true, enables the Terms Component. By default, the Terms

Component is off.

Example: terms=true

terms.fl Yes null Specifies the field from which to retrieve terms.

Example: terms.fl=title

terms.limit No 10 Specifies the maximum number of terms to return. The default is 10. If

the limit is set to a number less than 0, then no maximum limit is

enforced. Although this is not required, either this parameter or terms.

must be defined.upper

Example: terms.limit=20

terms.lower No empty

string

Specifies the term at which to start. If not specified, the empty string is

used, causing Solr to start at the beginning of the field.

Example: terms.lower=orange

terms.lower.incl No true If set to true, includes the lower-bound term (specified with terms.low

in the result set.er

Example: terms.lower.incl=false

terms.mincount No null Specifies the minimum document frequency to return in order for a term

to be included in a query response. Results are inclusive of the

mincount (that is, >= mincount).

Example: terms.mincount=5

terms.maxcount No null Specifies the maximum document frequency a term must have in order

to be included in a query response. The default setting is -1, which sets

no upper bound. Results are inclusive of the maxcount (that is, <=

maxcount).

Example: terms.maxcount=25

terms.prefix No null Restricts matches to terms that begin with the specified string.

Example: terms.prefix=inter

terms.raw No false If set to true, returns the raw characters of the indexed term, regardless

of whether it is human-readable. For instance, the indexed form of

numeric numbers is not human-readable.

Example: terms.raw=true

333Apache Solr Reference Guide 4.10

terms.regex No null Restricts matches to terms that match the regular expression.

Example: terms.regex=*pedist

terms.regex.flag No null Defines a Java regex flag to use when evaluating the regular expression

defined with . See terms.regex http://docs.oracle.com/javase/tutorial/e

for details of each flag. Valid options are:ssential/regex/pattern.html

case_insensitive

comments

multiline

literal

dotall

unicode_case

canon_eq

unix_lines

Example: terms.regex.flag=case_insensitive

terms.sort No count Defines how to sort the terms returned. Valid options are , whichcount

sorts by the term frequency, with the highest term frequency first, or ind

, which sorts in index order.ex

Example: terms.sort=index

terms.upper No null Specifies the term to stop at. Although this parameter is not required,

either this parameter or must be defined.terms.limit

Example: terms.upper=plum

terms.upper.incl No false If set to true, the upper bound term is included in the result set. The

default is false.

Example: terms.upper.incl=true

The output is a list of the terms and their document frequency values. See below for examples.

Examples

The following examples use the sample Solr configuration located in the directory and the<Solr>/example

sample documents in the directory.exampledocs

Get Top 10 Terms

This query requests the first ten terms in the name field: http://localhost:8983/solr/terms?terms.fl=na

Results:

334Apache Solr Reference Guide 4.10

</lst>

</lst>

</response>

Get First 10 Terms Starting with Letter 'a'

This query requests the first ten terms in the name field, in index order (instead of the top 10 results by document

count): http://localhost:8983/solr/terms?terms.fl=name&terms.lower=a&terms.sort=index

Results:

</lst>

</lst>

</response>

Using the Terms Component for an Auto-Suggest Feature

If the doesn't suit your needs, you can use the Terms component in Solr to build a similar feature for yourSuggester

own search application. Simply submit a query specifying whatever characters the user has typed so far as a prefix.

For example, if the user has typed "at", the search engine's interface would submit the following query:

335Apache Solr Reference Guide 4.10

http://localhost:8983/solr/terms?terms.fl=name&terms.prefix=at

Result:

</lst>

</lst>

</response>

You can use the parameter to omit the response header from the query response, like in thisomitHeader=true

example, which also returns the response in JSON format: http://localhost:8983/solr/terms?terms.fl=

name&terms.prefix=at&indent=true&wt=json&omitHeader=true

Result:

{

"terms": {

"name": [

"ata",

"ati",

]

}

Distributed Search Support

The TermsComponent also supports distributed indexes. For the request handler, you must provide the/terms

following two parameters:

Parameter Description

shards Specifies the shards in your distributed indexing configuration. For more information about

distributed indexing, see .Distributed Search with Index Sharding

shards.qt Specifies the request handler Solr uses for requests to shards.

More Resources

TermsComponent wiki page

TermsComponent javadoc

The Term Vector Component

The TermVectorComponent is a search component designed to return additional information about documents

matching your search.

336Apache Solr Reference Guide 4.10

For each document in the response, the TermVectorCcomponent can return the term vector, the term frequency,

inverse document frequency, position, and offset information.

Configuration

The TermVectorComponent is not enabled implicitly in Solr - it must be explicitly configured in your solrconfig.x

file.ml

To enable the this component, you need to configure it using a element:searchComponent

<searchComponent name="tvComponent"

class="org.apache.solr.handler.component.TermVectorComponent"/>

A request handler must then be configured to use this component name. In this example, the component is

associated with a special request handler named , that enables term vectors by default using the p/tvrh tv=true

arameter; but you can associate it with any request handler:

</lst>

<str>tvComponent</str>

</arr>

</requestHandler>

Once your handler is defined, you may use it to fetch term vectors for any fields configured with the atttermVector

ribute in your , for example:schema.xml

<field name="includes"

type="text"

indexed="true"

stored="true"

multiValued="true"

termVectors="true"

termPositions="true"

termOffsets="true" />

Invoking the Term Vector Component

The example below shows an invocation of this component using the above configuration:

http://localhost:8983/solr/collection1/tvrh?q=*%3A*&start=0&rows=10&fl=id,includes

337Apache Solr Reference Guide 4.10

...

</arr>

</lst>

</lst>

</lst>

</lst>

</lst>

Request Parameters

The example below shows the available request parameters for this component:

http://localhost:8983/solr/collection1/tvrh?q=*%3A*&version=2.2&start=0&rows=10&inden

t=on&qt=tvrh&tv=true&tv.tf=true&tv.df=true&tv.positions&tv.offsets=true

Boolean

Parameters

Description Type

tv Should the component run or not boolean

tv.docIds Returns term vectors for the specified list of Lucene document IDs (not the Solr

Unique Key).

comma

seperated

integers

338Apache Solr Reference Guide 4.10

tv.fl Returns term vectors for the specified list of fields. If not specified, the paramfl

eter is used.

comma

seperated list

of field names

tv.all A shortcut that invokes all the boolean parameters listed below. boolean

tv.df Returns the Document Frequency (DF) of the term in the collection. This can be

computationally expensive.

boolean

tv.offsets Returns offset information for each term in the document. boolean

tv.positions Returns position information. boolean

tv.tf Returns document term frequency info per term in the document. boolean

tv.tf_idf Calculates TF*IDF for each term. Requires the parameters and totv.tf tv.df

be "true". This can be computationally expensive. (The results are not shown in

example output)

boolean

To learn more about TermVector component output, see the Wiki page: http://wiki.apache.org/solr/TermVectorComp

onentExampleOptions

For schema requirements, see the Wiki page: http://wiki.apache.org/solr/FieldOptionsByUseCase

SolrJ and the Term Vector Component

Neither the SolrQuery class nor the QueryResponse class offer specific method calls to set Term Vector Component

parameters or get the "termVectors" output. However, there is a patch for it: .SOLR-949

The Stats Component

The Stats component returns simple statistics for numeric, string, and date fields within the document set.

Stats Component Parameters

The Stats Component accepts the following parameters:

Parameter Description

stats If , then invokes the Stats component.true

stats.field Specifies a field for which statistics should be generated. This parameter may be invoked

multiple times in a query in order to request statistics on multiple fields. (See the example

below.)

stats.facet Returns sub-results for values within the specified facet.

stats.calcdistinct If , distinct values will be calculated and returned as "countDistinct" and "distinctValues" intrue

the response. This calculation may be expensive for some fields, so it is by default. Iffalse

you'd only like to return distinct values for specific fields, you can also specify f.<field>.st

, replacing with your field name, to limit the distinct valueats.calcdistinct <field>

calculation to the required field.

Statistics Returned

339Apache Solr Reference Guide 4.10

The table below describes the statistics returned by the Stats component.

Name Description

min The minimum value in the field.

max The maximum value in the field.

sum The sum of all values in the field.

count The number of non-null values in the field.

missing The number of null values in the field.

sumOfSquares Sum of all values squared (useful for ).stddev

mean The average (v1 + v2 .... + vN)/N

stddev Standard deviation, measuring how widely spread the values in the data set are.

distinctValues Displays the distinct values in a field.

countDistinct The number of distinct values in a field.

Example

The query below, which includes calculating distinct values would produce results like the ones shown below.

http://localhost:8983/solr/select?q=*:*&stats=true&stats.field=price&stats.field=

popularity&stats.calcdistinct=true&rows=0&indent=true

340Apache Solr Reference Guide 4.10

</arr>

</lst>

</arr>

</lst>

Here are is a similar request with faceting requested for the field , using the parameter inStock &stats.facet=i

. In this example, we have not requested distinct values to be calculated.nStock

http://localhost:8983/solr/select?q=*:*&stats=true&stats.field=price&stats.field=

341Apache Solr Reference Guide 4.10

popularity&stats.facet=inStock&rows=0&indent=true

<lst>

<double name="min">Infinity</double>

<double name="max">-Infinity</double>

</lst>

</lst>

</lst>

342Apache Solr Reference Guide 4.10

<lst>

<double name="min">Infinity</double>

<double name="max">-Infinity</double>

</lst>

</lst>

</lst>

343Apache Solr Reference Guide 4.10

</lst>

Local Parameters

Similar to the , the parameter supports local parameters for:Facet Component stats.field

Tagging & Excluding Filters: stats.field={!ex=filterA}price

Changing the Output Key: stats.field={!key=my_price_stats}price

Example

Here we compute stats for the price field - once including the filter on the inStock field, and once excluding it:

http://localhost:8983/solr/select?q=*:*&fq={!tag=stock_check}inStock:true&stats=true&

stats.field={!ex=stock_check+key=instock_prices}price&stats.field={!key=all_prices}pr

ice&rows=0&indent=true

</lst>

</lst>

The Stats Component and Faceting

The facet field can be selectively applied. That is if you want stats on field "A" and "B", you can facet a on "X" and B

on "Y" using the parameters:

&stats.field=A&f.A.stats.facet=X&stats.field=B&f.B.stats.facet=Y

Multi-valued fields and facets may be slow.

All facet results are returned, so be careful what fields you ask for.

344Apache Solr Reference Guide 4.10

Multi-value fields rely on for implementation. This is like a FieldCache, so be aware ofUnInvertedField.java

your memory footprint.

The Query Elevation Component

The lets you configure the top results for a given query regardless of the normalQuery Elevation Component

Lucene scoring. This is sometimes called "sponsored search," "editorial boosting," or "best bets." This component

matches the user query text to a configured map of top results. The text can be any string or non-string IDs, as long

as it's indexed. Although this component will work with any QueryParser, it makes the most sense to use with DisMa

or .x eDisMax

The is supported by distributed searching.Query Elevation Component

Configuring the Query Elevation Component

You can configure the Query Elevation Component in the file. The default configuration looks likesolrconfig.xml

this:

<str name="queryFieldType">string</str>

<str name="config-file">elevate.xml</str>

</searchComponent>

<str name="echoParams">explicit</str>

</lst>

<str>elevator</str>

</arr>

</requestHandler>

Optionally, in the Query Elevation Component configuration you can also specify the following to distinguish editorial

results from "normal" results:

The Query Elevation Search Component takes the following arguments:

Argument Description

queryFieldType Specifies which fieldType should be used to analyze the incoming text. For example, it may

be appropriate to use a fieldType with a LowerCaseFilter.

config-file Path to the file that defines query elevation. This file must exist in <instanceDir>/conf/

or . <config-file> <dataDir>/<config-file>

If the file exists in the /conf/ directory it will be loaded once at startup. If it exists in the data

directory, it will be reloaded for each IndexReader.

forceElevation By default, this component respects the requested parameter: if the request asks tosort

sort by date, it will order the results by date. If (the default),forceElevation=true

results will first return the boosted docs, then order by date.

345Apache Solr Reference Guide 4.10

elevate.xml

Elevated query results are configured in an external XML file specified in the argument. An config-file elevate

file might look like this:.xml

</query>

</query>

</elevate>

In this example, the query "AAA" would first return documents A and B, then whatever normally appears for the

same query. For the query "ipod", it would first return A, and would make sure that B is not in the result set.

Using the Query Elevation Component

The ParameterenableElevation

For debugging it may be useful to see results with and without the elevated docs. To hide results, use enableElev

:ation=false

http://localhost:8983/solr/elevate?q=YYYY&debugQuery=true&enableElevation=true

http://localhost:8983/solr/elevate?q=YYYY&debugQuery=true&enableElevation=false

The ParameterforceElevation

You can force elevation during runtime by adding to the query URL:forceElevation=true

http://localhost:8983/solr/elevate?q=YYYY&debugQuery=true&enableElevation=true&forceE

levation=true

The Parameterexclusive

You can force Solr to return only the results specified in the elevation file by adding to the URL:exclusive=true

http://localhost:8983/solr/elevate?q=YYYY&debugQuery=true&exclusive=true

Document Transformers and the ParametermarkExcludes

The can be used to annotate each document with information about whether[elevated] Document Transformer

or not it was elevated:

http://localhost:8983/solr/elevate?q=YYYY&fl=id,[elevated]

Likewise, it can be helpful when troubleshooting to see all matching documents – including documents that the

elevation configuration would normally exclude. This is possible by using the parameter,markExcludes=true

346Apache Solr Reference Guide 4.10

and then using the transformer:[excluded]

http://localhost:8983/solr/elevate?q=YYYY&markExcludes=true&fl=id,[elevated],[exclude

The and ParameterselevateIds excludeIds

When the elevation component is in use, the pre-configured list of elevations for a query can be overridden at

request time to use the unique keys specified in these request parameters.

For example, in the request below documents A and B will be elevated, and document C will be excluded --

regardless of what elevations or exclusions are configured for the query YYYY in elevate.xml:

http://localhost:8983/solr/elevate?q=YYYY&excludeIds=C&elevateIds=A,B

If either one of these parameters is specified at request time, the the entire elevation configuration for the query is

ignored.

For example, in the request below documents A and B will be elevated, and no documents will be excluded –

regardless of what elevations or exclusions are configured for the query YYYY in elevate.xml:

http://localhost:8983/solr/elevate?q=YYYY&elevateIds=A,B

The Parameterfq

Query elevation respects the standard filter query ( ) parameter. That is, if the query contains the parameter, allfq fq

results will be within that filter even if adds other documents to the result set.elevate.xml

Response Writers

A Response Writer generates the formatted response of a search. Solr supports a variety of Response Writers to

ensure that query responses can be parsed by the appropriate language or application.

The parameter selects the Response Writer to be used. The table below lists the most common settings for the wt w

parameter.t

wt Parameter Setting Response Writer Selected

csv CSVResponseWriter

json JSONResponseWriter

php PHPResponseWriter

phps PHPSerializedResponseWriter

python PythonResponseWriter

ruby RubyResponseWriter

velocity VelocityResponseWriter

xml XMLResponseWriter

xslt XSLTResponseWriter

The Standard XML Response Writer

347Apache Solr Reference Guide 4.10

The XML Response Writer is the most general purpose and reusable Response Writer currently included with Solr.

It is the format used in most discussions and documentation about the response of Solr queries.

Note that the XSLT Response Writer can be used to convert the XML produced by this writer to other vocabularies

or text-based formats.

The behavior of the XML Response Writer can be driven by the following query parameters.

The Parameterversion

The parameter determines the XML protocol used in the response. Clients are strongly encouraged to version alw

specify the protocol version, so as to ensure that the format of the response they receive does not changeays

unexpectedly when the Solr server is upgraded.

XML

Version

Notes Comments

2.0 An tag was used for multiValued fields only if there was more then one<arr>

value.

Not supported in

Solr 4.

2.1 An tag is used for multiValued fields even if there is only one value.<arr> Not supported in

Solr 4.

2.2 The format of the responseHeader changed to use the same structure as<lst>

the rest of the response.

Supported in Solr

The default value is the latest supported.

The Parameterstylesheet

The parameter can be used to direct Solr to include a stylesheet <?xml-stylesheet type="text/xsl"

declaration in the XML response it returns.href="..."?>

The default behavior is not to return any stylesheet declaration at all.

The Parameterindent

If the parameter is used, and has a non-blank value, then Solr will make some attempts at indenting itsindent

XML response to make it more readable by humans.

The default behavior is not to indent.

The XSLT Response Writer

The XSLT Response Writer applies an XML stylesheet to output. It can be used for tasks such as formatting results

for an RSS feed.

tr Parameter

The XSLT Response Writer accepts one parameter: the parameter, which identifies the XML transformation totr

Use of the parameter is discouraged, as there is currently no way to specify externalstylesheet

stylesheets, and no stylesheets are provided in the Solr distributions. This is a legacy parameter, which may

be developed further in a future release.

348Apache Solr Reference Guide 4.10

use. The transformation must be found in the Solr directory.conf/xslt

The Content-Type of the response is set according to the statement in the XSLT transform, for<xsl:output>

example: <xsl:output media-type="text/html"/>

Configuration

The example below, from the default file, shows how the XSLT Response Writer is configured.solrconfig.xml

<!--

Changes to XSLT transforms are taken into account

every xsltCacheLifetimeSeconds at most.

-->

<queryResponseWriter name="xslt"

class="org.apache.solr.request.XSLTResponseWriter">

</queryResponseWriter>

A value of 5 for is good for development, to see XSLT changes quickly. ForxsltCacheLifetimeSeconds

production you probably want a much higher value.

JSON Response Writer

A very commonly used Response Writer is the , which formats output in JavaScript ObjectJsonResponseWriter

Notation (JSON), a lightweight data interchange format specified in specified in RFC 4627. Setting the parameterwt

to invokes this Response Writer.json

With Solr 4, the has been changed:JsonResponseWriter

The default mime type for the writer is now .application/json

The example solrconfig.xml has been updated to explicitly use this parameter to set the type to :text/plain

<!-- For the purposes of the tutorial, JSON response are written as

plain text so that it's easy to read in *any* browser.

If you are building applications that consume JSON, just remove

this override to get the default "application/json" mime type.

-->

<str name="content-type">text/plain</str>

</queryResponseWriter>

Python Response Writer

Solr has an optional Python response format that extends its JSON output in the following ways to allow the

response to be safely evaluated by the python interpreter:

true and false changed to True and False

Python unicode strings are used where needed

ASCII output (with unicode escapes) is used for less error-prone interoperability

newlines are escaped

null changed to None

PHP Response Writer and PHP Serialized Response Writer

349Apache Solr Reference Guide 4.10

Solr has a PHP response format that outputs an array (as PHP code) which can be evaluated. Setting the paramwt

eter to invokes the PHP Response Writer.php

Example usage:

$code = file_get_contents('http://localhost:8983/solr/select?q=iPod&wt=*php*');

eval("$result = " . $code . ";");

print_r($result);

Solr also includes a PHP Serialized Response Writer that formats output in a serialized array. Setting the paramewt

ter to invokes the PHP Serialized Response Writer.phps

Example usage:

$serializedResult =

file_get_contents('http://localhost:8983/solr/select?q=iPod&wt=*php{*}s');

$result = unserialize($serializedResult);

print_r($result);

Before you use either the PHP or Serialized PHP Response Writer, you may first need to un-comment these two

lines in :solrconfig.xml

<queryResponseWriter name="phps"

class="org.apache.solr.request.PHPSerializedResponseWriter"/>

Ruby Response Writer

Solr has an optional Ruby response format that extends its JSON output in the following ways to allow the response

to be safely evaluated by Ruby's interpreter:

Ruby's single quoted strings are used to prevent possible string exploits.

\ and ' are the only two characters escaped.

Unicode escapes are not used. Data is written as raw UTF-8.

nil used for null.

=> is used as the key/value separator in maps.

Here is a simple example of how one may query Solr using the Ruby response format:

require 'net/http'

h = Net::HTTP.new('localhost', 8983)

hresp, data = h.get('/solr/select?q=iPod&wt=ruby', nil)

rsp = eval(data)

puts 'number of matches = ' + rsp['response']['numFound'].to_s

#print out the name field for each returned document

rsp['response']['docs'].each { |doc| puts 'name field = ' + doc['name'\] }

CSV Response Writer

The CSV response writer returns a list of documents in comma-separated values (CSV) format. Other information

that would normally be included in a response, such as facet information, is excluded.

The CSV response writer supports multi-valued fields, and the output of this CSV format is compatible with Solr's CS

350Apache Solr Reference Guide 4.10

. As of Solr 4.3, it can also support pseudo-fields.V update format

CSV Parameters

These parameters specify the CSV format that will be returned. You can accept the default values or specify your

own.

Parameter Default Value

csv.encapsulator "

csv.escape None

csv.separator ,

csv.header Defaults to true. If false, Solr does not print the column headers

csv.newline \n

csv.null Defaults to a zero length string. Use this parameter when a document has no value for a

particular field.

Multi-Valued Field CSV Parameters

These parameters specify how multi-valued fields are encoded. Per-field overrides for these values can be done

using .f.<fieldname>.csv.separator=|

Parameter Default Value

csv.mv.encapsulator None

csv.mv.escape \

csv.mv.separator Defaults to the valuecsv.separator

Example

http://localhost:8983/solr/select?q=ipod&fl=id,cat,name,popularity,price,score&wt=csv

returns:

id,cat,name,popularity,price,score

IW-02,"electronics,connector",iPod & iPod Mini USB 2.0 Cable,1,11.5,0.98867977

F8V7067-APL-KIT,"electronics,connector",Belkin Mobile Power Cord for iPod w/

Dock,1,19.95,0.6523595

MA147LL/A,"electronics,music",Apple 60 GB iPod with Video Playback

Black,10,399.0,0.2446348

Velocity Response Writer

The VelocityResponseWriter (also known as Solritas) is an optional plugin available in the direcontrib/velocity

ctory. It is used to power the in the example configuration.Velocity Search UI

Its jar and dependencies must be added (via <lib> or solr/home lib inclusion), and must be registered in

solrconfig.xml like this:

351Apache Solr Reference Guide 4.10

For more information about the Velocity Response Writer, see .https://wiki.apache.org/solr/VelocityResponseWriter

Binary Response Writer

Solr also includes a Response Writer that outputs binary format for use with a Java client. See for moreClient APIs

details.

Near Real Time Searching

Near Real Time (NRT) search means that documents are available for search almost immediately after being

indexed: additions and updates to documents are seen in 'near' real time. Solr 4 no longer blocks updates while a

commit is in progress. Nor does it wait for background merges to complete before opening a new search of indexes

and returning.

With NRT, you can modify a command to be a , which avoids parts of a standard commit thatcommit soft commit

can be costly. You will still want to do standard commits to ensure that documents are in stable storage, but soft

let you see a very near real time view of the index in the meantime. However, pay special attention tocommits

cache and autowarm settings as they can have a significant impact on NRT performance.

Commits and Optimizing

A commit operation makes index changes visible to new search requests. A uses the transaction loghard commit

to get the id of the latest document changes, and also calls on the index files to ensure they have beenfsync

flushed to stable storage and no data loss will result from a power failure.

A is much faster since it only makes index changes visible and does not index files or write asoft commit fsync

new index descriptor. If the JVM crashes or there is a loss of power, changes that occurred after the last hard

will be lost. Search collections that have NRT requirements (that want index changes to be quickly visible tocommit

searches) will want to soft commit often but hard commit less frequently. A softCommit may be "less expensive" in

terms of time, but not free, since it can slow throughput.

An is like a except that it forces all of the index segments to be merged into a singleoptimize hard commit

segment first. Depending on the use, this operation should be performed infrequently (e.g., nightly), if at all, since it

involves reading and re-writing the entire index. Segments are normally merged over time anyway (as determined

by the merge policy), and optimize just forces these merges to occur immediately.

Soft commit takes uses two parameters: and .maxDocs maxTime

Parameter Description

maxDocs Integer. Defines the number of documents to queue before pushing them to the index. It works in

conjunction with the parameter in that if eitherupdate_handler_autosoftcommit_max_time

limit is reached, the documents will be pushed to the index.

maxTime The number of milliseconds to wait before pushing documents to the index. It works in conjunction

with the parameter in that if either limit isupdate_handler_autosoftcommit_max_docs

reached, the documents will be pushed to the index.

Use and judiciously to fine-tune your commit strategies.maxDocs maxTime

AutoCommits

352Apache Solr Reference Guide 4.10

An autocommit also uses the parameters and . However it's useful in many strategies to usemaxDocs maxTime

both a hard and to achieve more flexible commits.autocommit autosoftcommit

A common configuration is to do a hard every 1-10 minutes and a every second.autocommit autosoftcommit

With this configuration, new documents will show up within about a second of being added, and if the power goes

out, soft commits are lost unless a hard commit has been done.

For example:

</autoSoftCommit>

It's better to use rather than to modify an , especially when indexing a largemaxTime maxDocs autoSoftCommit

number of documents through the commit operation. It's also better to turn off for bulk indexing.autoSoftCommit

Optional Attributes for and commit optimize

Parameter Valid

Attributes

Description

waitSearcher true, false Block until a new searcher is opened and registered as the main query

searcher, making the changes visible. Default is true.

softCommit true, false Perform a soft commit. This will refresh the view of the index faster, but without

guarantees that the document is stably stored. Default is false.

expungeDeletes true, false Valid for only. This parameter purges deleted data from segments.commit

The default is false.

maxSegments =

integer Valid for only. Optimize down to at most this number of segments.optimize

The default is 1.

Example of and with optional attributes:commit optimize

Passing and parameters as part of the URLcommit commitWithin

Update handlers can also get -related parameters as part of the update URL. This example adds a smallcommit

test document and causes an explicit commit to happen immediately afterwards:

http://localhost:8983/solr/update?stream.body=<add><doc>

<field name="id">testdoc</field></doc></add>&commit=true

Alternately, you may want to use this:

http://localhost:8983/solr/update?stream.body=<optimize/>

This example causes the index to be optimized down to at most 10 segments, but won't wait around until it's done (w

353Apache Solr Reference Guide 4.10

):aitFlush=false

curl 'http://localhost:8983/solr/update?optimize=true&maxSegments=10&waitFlush=false'

This example adds a small test document with a instruction that tells Solr to make sure thecommitWithin

document is committed no later than 10 seconds later (this method is generally preferred over explicit commits):

curl http://localhost:8983/solr/update?commitWithin=10000

-H "Content-Type: text/xml" --data-binary

'<add><doc><field name="id">testdoc</field></doc></add>'

Changing default BehaviorcommitWithin

The settings allow forcing document commits to happen in a defined time period. This is used mostcommitWithin

frequently with , and for that reason the default is to perform a soft commit. This does not,Near Real Time Searching

however, replicate new documents to slave servers in a master/slave environment. If that's a requirement for your

implementation, you can force a hard commit by adding a parameter, as in this example:

<softCommit>false</softCommit>

</commitWithin>

With this configuration, when you call as part of your update message, it will automatically performcommitWithin

a hard commit every time.

RealTime Get

For index updates to be visible (searchable), some kind of commit must reopen a searcher to a new point-in-time

view of the index. The feature allows retrieval (by ) of the latest version of any documentsrealtime get unique-key

without the associated cost of reopening a searcher. This is primarily useful when using Solr as a NoSQL data store

and not just a search index.

Realtime Get currently relies on the update log feature, which is enabled by default. It relies on an update log, which

is configured in , in a section like:solrconfig.xml

</updateLog>

The latest should also have a request handler named already defined like theexample solrconfig.xml /get

following:

</lst>

</requestHandler>

Start (or restart) the Solr server, and then index a document:

354Apache Solr Reference Guide 4.10

curl 'http://localhost:8983/solr/update/json?commitWithin=10000000'

-H 'Content-type:application/json' -d '[{"id":"mydoc","title":"realtime-get

test!"}]'

If you do a normal search, this document should not be found:

http://localhost:8983/solr/select?q=id:mydoc

...

"response":

{"numFound":0,"start":0,"docs":[]}

However if you use the realtime get handler exposed at , you should be able to retrieve that document:/get

http://localhost:8983/solr/get?id=mydoc

...

{"doc":{"id":"mydoc","title":"realtime-get test!"]}}

You can also specify multiple documents at once via the parameter and a comma separated list of ids, or byids

using multiple parameters. If you specify multiple ids, or use the parameter, the response will mimic a normalid ids

query response to make it easier for existing clients to parse. Since you've only indexed one document, the following

equivalent examples just repeat the same id.

http://localhost:8983/solr/get?ids=mydoc,mydoc

http://localhost:8983/solr/get?id=mydoc&id=mydoc

...

{"response":

{"numFound":2,"start":0,"docs":

[ { "id":"mydoc", "title":["realtime-get test!"]},

{ "id":"mydoc", "title":["realtime-get test!"]}]

}

Exporting Result Sets

Starting with Solr 4.10, it's possible to allow users to export fully sorted result sets. It is specifically designed to

handle scenarios that involve sorting and exporting millions of records. It uses a stream sorting technique that

begins to send records fairly within milliseconds and continues to stream results until the entire result set has been

sorted and exported.

The cases where this functionality may be useful include: session analysis, distributed merge joins, time series

roll-ups, aggregations on high cardinality fields, fully distributed field collapsing, and sort based stats.

Field Requirements

All the fields being sorted and exported must have docValues set to true. For more information, see the section on D

.ocValues

Do disable the realtime get handler at if you are using SolrCloud otherwise any leader electionNOT /get

will cause a full sync in replicas for the shard in question. Similarly, a replica recovery will also alwaysALL

fetch the complete index from the leader because a partial sync will not be possible in the absence of this

handler.

355Apache Solr Reference Guide 4.10

You can choose between the different docValues formats to trade off memory usage and performance. The fastest

is likely to be the “Direct” doc values format as it is uncompressed and fully in-memory. The initial tests were

performed with the default Lucene410 docValues format and the “Direct” doc values format.

Defining the /export Request Handler

To export the full sorted result set you use the new /export request handler.

This request handler is included in the example and if you use that as the basis for your ownsolrconfig.xml

new Solr implementation you already have it configured. If however, you would like to add to your existing solrcon

, you can add a section like this:fig.xml

<str name="rq">{!xport}</str>

<str name="wt">xsort</str>

<str name="distrib">false</str>

</lst>

<str>query</str>

</arr>

</requestHandler>

Note that this request handler's properties are defined as "invariants", which means they cannot be overridden by

other properties passed at another time (such as at query time).

Requesting Results Export

Once the /export request handler is defined, you can use it to make requests to export the result set of a query.

All queries must include and parameters, or the query will return an error. Filter queries are also supported.sort fl

Results are always returned in JSON format.

The basic syntax is as follows:

http://localhost:8983/solr/export?q=my-query&sort=fieldA desc,fieldB

desc&fl=fieldA,fieldB,fieldC

Specifying the Sort Criteria

The property defines how documents will be sorted in the exported result set. Results can be sorted by anysort

field that has a field type of int,long, float, double, string. The sort fields must be single valued fields.

Up to four sort fields can be specified per request, with the 'asc' or 'desc' properties.

Specifying the Field List

The property defines the fields that will be exported with the result set. Any of the field types that can be sortedfl

(i.e., int, long, float, double, string) can be used in the field list. The fields can be single or multi-valued. However,

returning scores and wildcards are not supported at this time.

Distributed Support

The initial release treats all queries as non-distributed requests. So the client is responsible for making the calls to

356Apache Solr Reference Guide 4.10

each Solr instance and merging the results.

Using SolrJ’s CloudSolrServer as a model, developers could build clients that automatically send requests to all the

shards in a collection (or multiple collections) and then merge the sorted sets any way they wish.

357Apache Solr Reference Guide 4.10
The Well-Configured Solr Instance
This section tells you how to fine-tune your Solr instance for optimum performance. This section covers the following
topics:
Configuring solrconfig.xml: Describes how to work with the main configuration file for Solr,  ,solrconfig.xml
covering the major sections of the file.
Solr Cores and solr.xml: Describes how to work with   and   to configure your Solrsolr.xml core.properties
core, or multiple Solr cores within a single instance.
Solr Plugins: Introduces Solr plugins with pointers to more information.
JVM Settings: Gives some guidance on best practices for working with Java Virtual Machines.
Configuring solrconfig.xml
The   file is the configuration file with the most parameters affecting Solr itself. While configuringsolrconfig.xml
Solr, you'll work with   often. The file comprises a series of XML statements that set configurationsolrconfig.xml
values. In  , you configure important features such as:solrconfig.xml
request handlers
listeners (processes that "listen" for particular query-related events; listeners can be used to trigger the
execution of special code, such as invoking some common queries to warm-up caches)
the Request Dispatcher for managing HTTP communications
the Admin Web interface
parameters related to replication and duplication (these parameters are covered in detail in Legacy Scaling
)and Distribution
The   file is found in the   directory. The example file is well-commented, and includessolrconfig.xml solr/conf/
information on best practices for most installations.
We've covered the options in the following sections:
DataDir and DirectoryFactory in SolrConfig
Lib Directives in SolrConfig
Managed Schema Definition in SolrConfig
IndexConfig in SolrConfig
UpdateHandlers in SolrConfig
Query Settings in SolrConfig
RequestDispatcher in SolrConfig
RequestHandlers and SearchComponents in SolrConfig
Substituting Properties in Solr Config Files
Solr supports variable substitution of property values in config files, which allows runtime specification of various
configuration options in  . The syntax is  }. Thissolrconfig.xml ${propertyname[:option default value]
The focus of this section is generally on configuring a single Solr instance, but for those interested in scaling
a Solr implementation in a cluster environment, see also the section  . There are also options toSolrCloud
scale through sharding or replication, described in the section  .Legacy Scaling and Distribution

358Apache Solr Reference Guide 4.10

allows defining a default that can be overridden when Solr is launched. If a default value is not specified, then the

property be specified at runtime or the configuration file will generate an error when parsed.must

There are multiple methods for specifying properties that can be used in configuration files.

JVM System Properties

Any JVM System properties, usually specified using the flag when starting the JVM, can be used as variables in-D

any XML configuration file in Solr.

For example, in the example , you will see this value which defines the locking type to use:solrconfig.xml

<lockType>${solr.lock.type:native}</lockType>

Which means the lock type defaults to "native" but when starting Solr's example application, you could override this

by launching the JVM it with:

java -Dsolr.lock.type=simple -jar start.jar

solrcore.properties

If the configuration directory for a Solr core contains a file named that file can contain anysolrcore.properties

arbitrary user defined property names and values using the Java standard , and thoseproperties file format

properties can be used as variables in the XML configuration files for that Solr core.

For example, the following file could be created in the directosolrcore.properties solr/collection1/conf

ry of the Solr example configuration, to specify the lockType used.

#conf/solrcore.properties

lock.type=simple

User defined properties from core.properties

If you are using the newer such that each Solr core has a file, thencore discovery style solr.xml core.properties

any properties in that file may be specified there and those properties will be available for substitutionuser defined

when parsing XML configuration files for that Solr core.

For example, consider the following file:core.properties

#core.properties

name=collection2

my.custom.prop=edismax

the property can be used as a variable, like so...my.custom.prop

The path and name of the file can be overridden using the solrcore.properties propertyproperties

in core.properties

359Apache Solr Reference Guide 4.10

<str name="defType">${my.custom.prop}</str>

</lst>

</requestHandler>

User defined properties from the Legacy Formatsolr.xml

Similar to the option above, user defined properties may be specified in the legacy solr.xmlcore.properties

format. Please see the "User Defined Properties in solr.xml" section of the documentaLegacy solr.xml Configuration

tion for more details.

Implicit Core Properties

Several attributes of a Solr core are available as "implicit" properties that can be used in variable substitution,

independent of where or how they underlying value is initialized. For example: regardless of whether the name for a

particular Solr core is explicitly configured in or inferred from the name of the instance directory,core.properties

the implicit property is available for use as a variable in that core's configuration file...solr.core.name

</lst>

</requestHandler>

All implicit properties use the name prefix, and reflect the runtime value of the equivalent solr.core. core.prop

: propertyerties

solr.core.name

solr.core.config

solr.core.schema

solr.core.dataDir

solr.core.transient

solr.core.loadOnStartup

More Information

The Solr Wiki has a comprehensive page on , at .solrconfig.xml http://wiki.apache.org/solr/SolrConfigXml

6 Sins of solrconfig.xml modifications from solr.pl.

DataDir and DirectoryFactory in SolrConfig

Specifying a Location for Index Data with the ParameterdataDir

By default, Solr stores its index data in a directory called under the Solr home. If you would like to specify a/data

different directory for storing index data, use the parameter in the file. You can<dataDir> solrconfig.xml

specify another directory either with a full pathname or a pathname relative to the current working directory of the

servlet container. For example:

If you are using replication to replicate the Solr index (as described in ), then the Legacy Scaling and Distribution <da

360Apache Solr Reference Guide 4.10

directory should correspond to the index directory used in the replication configuration.taDir>

Specifying the DirectoryFactory For Your Index

The default is filesystem based, and tries to pick the best implementation forsolr.StandardDirectoryFactory

the current JVM and platform. You can force a particular implementation by specifying solr.MMapDirectoryFact

, , or .ory solr.NIOFSDirectoryFactory solr.SimpleFSDirectoryFactory

<directoryFactory name="DirectoryFactory"

class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>

The is memory based, not persistent, and does not work with replication. Use thissolr.RAMDirectoryFactory

DirectoryFactory to store your index in RAM.

Lib Directives in SolrConfig

Solr allows loading plugins by defining directives in .<lib/> solrconfig.xml

The plugins are loaded in the order they appear in . If there are dependencies, list the lowestsolrconfig.xml

level dependency jar first.

Regular expressions can be used to provide control loading jars with dependencies on other jars in the same

directory. All directories are resolved as relative to the Solr .instanceDir

Managed Schema Definition in SolrConfig

The enables modifications through a REST interface. (Read-only access to all schemaSchema API schema

elements is also supported.)

There are challenges with allowing programmatic access to a configuration file that is also open to manual edits:

system-generated and manual edits may overlap and the system-generated edits may remove comments or other

customizations that are critical for the organization to understand why fields, field types, etc., are defined the way

they are. You may want to version the file with source control, or limit manual edits altogether.

solrconfig.xml allows the Solr schema to be defined as a "managed index schema": schema modification is

only possible through the .Schema API

From the example :solrconfig.xml

361Apache Solr Reference Guide 4.10

<!-- To enable dynamic schema REST APIs, use the following for <schemaFactory>:

<str name="managedSchemaResourceName">managed-schema</str>

</schemaFactory>

When ManagedIndexSchemaFactory is specified, Solr will load the schema from

the resource named in 'managedSchemaResourceName', rather than from schema.xml.

Note that the managed schema resource CANNOT be named schema.xml. If the

managed

schema does not exist, Solr will create it after reading schema.xml, then

rename

'schema.xml' to 'schema.xml.bak'.

Do NOT hand edit the managed schema - external modifications will be ignored

and

overwritten as a result of schema modification REST API calls.

When ManagedIndexSchemaFactory is specified with mutable = true, schema

modification REST API calls will be allowed; otherwise, error responses will be

sent back for these requests.

-->

In the example above, is actually configured to use the , whichsolrconfig.xml ClassicIndexSchemaFactory

treats the file the same as it always has, which is that it can be edited manually. This setting disallowsschema.xml

Schema API methods that modify the schema.

In the commented out sample, however, you can see configuration for the managed schema. In order for schema

modifications to be possible via the , the will need to be used. TheSchema API ManagedIndexSchemaFactory

parameter must also be set to . The , which defaults tomutable true managedSchemaResourceName

"managed-schema", may also be defined, and can be anything other than "schema.xml". Once Solr is restarted, the

existing file is renamed to and the contents are written to a file with the nameschema.xml schema.xml.bak

defined as the . If you look at the resulting file, you'll see this at the top of themanagedSchemaResourceName

page:

Note that the example at uses the Schemaless Mode example/example-schemaless/ ManagedIndexSchema

to allow automatic schema field additions based on document updates' field values.Factory

IndexConfig in SolrConfig

The section of defines low-level behavior of the<indexConfig> solrconfig.xml

Lucene index writers. By default, the settings are commented out in the sample solr

included with Solr, which means the defaults are used. In most cases,config.xml

the defaults are fine.

...

</indexConfig>

362Apache Solr Reference Guide 4.10

Parameters

covered in this

section:

Sizing

Index

Segments

Merging

Index

Segments

Index

Locks

Other

Indexing

Settings

Sizing Index Segments

ramBufferSizeMB

Once accumulated document updates exceed this much memory space (defined in megabytes), then the pending

updates are flushed. This can also create new segments or trigger a merge. Using this setting is generally

preferable to . If both and are set in maxBufferedDocs maxBufferedDocs ramBufferSizeMB solrconfig.xm

, then a flush will occur when either limit is reached. The default is 100Mb (raised from 32Mb for Solr 4.1).l

maxBufferedDocs

Sets the number of document updates to buffer in memory before they are flushed as a new segment. This may also

trigger a merge. The default Solr configuration sets to flush by RAM usage ( ).ramBufferSizeMB

maxIndexingThreads

The maximum number of simultaneous threads used to index documents. Once this threshold is reached, additional

threads will wait for the others to finish. The default is 8. This parameter is new for Solr 4.1.

UseCompoundFile

Setting to combines the various files of a segment into a single file, although the default<useCompoundFile> true

is . On systems where the number of open files allowed per process is limited, setting this to may avoidfalse false

hitting that limit (the open files limit might also be tunable for your OS with the Linux/Unix command, orulimit

Prior to Solr 4, many of these settings were contained in sections called mai

and . In Solr 4, those sections are deprecated andnIndex indexDefaults

removed. Any settings that used to be in those sections, now belong in <ind

.exConfig>

363Apache Solr Reference Guide 4.10

something similar for other operating systems). In some cases, other internal factors may set a segment to

"compound=false", even if this is setting is explicitly set to true, so the compounding of the files in a segment may

not always happen.

Updating a compound index may incur a minor performance hit for various reasons, depending on the runtime

environment. For example, filesystem buffers are typically associated with open file descriptors, which may limit the

total cache space available to each index.

This setting may also affect how much data needs to be transferred during index replication operations.

The default is .false

<useCompoundFile>false</useCompoundFile>

Merging Index Segments

mergeFactor

The controls how many segments a Lucene index is allowed to have before it is coalesced into onemergeFactor

segment. When an update is made to an index, it is added to the most recently opened segment. When that

segment fills up (see in the next section), a new segment is created and maxBufferedDocs ramBufferSizeMB

and subsequent updates are placed there.

If creating a new segment would cause the number of lowest-level segments to exceed the value,mergeFactor

then all those segments are merged together to form a single large segment. Thus, if the merge factor is ten, each

merge results in the creation of a single segment that is roughly ten times larger than each of its ten constituents.

When there are settings for these larger segments, then they in turn are merged into an even largermergeFactor

single segment. This process can continue indefinitely.

Choosing the best merge factor is generally a trade-off of indexing speed vs. searching speed. Having fewer

segments in the index generally accelerates searches, because there are fewer places to look. It also can also result

in fewer physical files on disk. But to keep the number of segments low, merges will occur more often, which can

add load to the system and slow down updates to the index.

Conversely, keeping more segments can accelerate indexing, because merges happen less often, making an

update is less likely to trigger a merge. But searches become more computationally expensive and will likely be

slower, because search terms must be looked up in more index segments. Faster index updates also means shorter

commit turnaround times, which means more timely search results.

The default value in the example is 10, which is a reasonable starting point.solrconfig.xml

mergePolicy

Defines how merging segments is done. The default in Solr is , which merges segments ofTieredMergePolicy

approximately equal size, subject to an allowed number of segments per tier. Other policies available are the LogBy

and . For more information on these policies, please see teSizeMergePolicy LogDocMergePolicy the

.MergePolicy javadocs

364Apache Solr Reference Guide 4.10

</mergePolicy>

mergeScheduler

The merge scheduler controls how merges are performed. The default performsConcurrentMergeScheduler

merges in the background using separate threads. The alternative, , does not performSerialMergeScheduler

merges with separate threads.

mergedSegmentWarmer

When using Solr in for a merged segment warmer can be configured to warm the readerNear Real Time Searching

on the newly merged segment, before the merge commits. This is not required for near real-time search, but will

reduce search latency on opening a new near real-time reader after a merge completes.

checkIntegrityAtMerge

If set to , any actions that result in merging segments will first trigger an integrity check using checksumstrue

stored in the index segments (if available). If the checksums are not correct, the merge will fail and throw an

Exception. (defaults to " " for backwards compatibility)false

Index Locks

lockType

The LockFactory options specify its implementation.

lockType=single uses SingleInstanceLockFactory, and is for a read-only index or when there is no possibility of

another process trying to modify the index.

lockType=native uses NativeFSLockFactory to specify native OS file locking. Do not use when multiple Solr web

applications in the same JVM are attempting to share a single index.

lockType=simple uses SimpleFSLockFactory to specify a plain file for locking.

native is the default for Solr3.6 and later versions; otherwise is the default.simple

For more information on the nuances of each LockFactory, see http://wiki.apache.org/lucene-java/AvailableLockFact

.ories

<lockType>native</lockType>

unlockOnStartup

365Apache Solr Reference Guide 4.10

If , any write or commit locks that have been held will be unlocked on system startup. This defeats the lockingtrue

mechanism that allows multiple processes to safely access a Lucene index. The default is , and changing thisfalse

should only be done with care. This parameter is not used if the is "none" or "single".lockType

<unlockOnStartup>false</unlockOnStartup>

writeLockTimeout

The maximum time to wait for a write lock on an IndexWriter. The default is 1000, expressed in milliseconds.

Other Indexing Settings

There are a few other parameters that may be important to configure for your implementation. These settings affect

how or when updates are made to an index.

Setting Description

termIndexInterval Controls how often terms are loaded into memory. The default is 128.

reopenReaders Controls if IndexReaders will be re-opened, instead of closed and then opened, which is often

less efficient. The default is true.

deletionPolicy Controls how commits are retained in case of rollback. The default is SolrDeletionPolicy

, which has sub-parameters for the maximum number of commits to keep (maxCommitsToKe

), the maximum number of optimized commits to keep ( ),ep maxOptimizedCommitsToKeep

and the maximum age of any commit to keep ( ), which supports maxCommitAge DateMathP

syntax.arser

infoStream The InfoStream setting instructs the underlying Lucene classes to write detailed debug

information from the indexing process as Solr log messages.

</deletionPolicy>

<infoStream>false</infoStream>

UpdateHandlers in SolrConfig

The settings in this section are configured in the <updateHandler

element in and may affect the performance of> solrconfig.xml

The parameter was removed in Solr 4. If restricting the length of fields is important tomaxFieldLength

you, you can get similar behavior with the , which can be defined for the fieldsLimitTokenCountFactory

you'd like to limit. For example, <filter class="solr.LimitTokenCountFilterFactory"

would limit the field to 10,000 characters.maxTokenCount="10000"/>

366Apache Solr Reference Guide 4.10

index updates. These settings affect how updates are done

internally. configurations do not affect the<updateHandler>

higher level configuration of that process clientRequestHandlers

update requests.

<updateHandler

class="solr.DirectUpdateHandler2">

...

</updateHandler>

Topics covered in this section:

Commits

commit and

softCommit

autoCommit

commitWithin

maxPendingDeletes

Event Listeners

Transaction Log

Commits

Data sent to Solr is not searchable until it has been to the index. The reason for this is that in some casescommitted

commits can be slow and they should be done in isolation from other possible commit requests to avoid overwriting

data. So, it's preferable to provide control over when data is committed. Several options are available to control the

timing of commits.

commit and softCommit

With Solr 4, is generally used only as a boolean flag sent with a client update request. The command commit comm

would perform a commit as soon as the data is finished loading to Solr.it=true

You can also set the flag to do a 'soft' commit, meaning that Solr will commit your changessoftCommit=true

quickly but not guarantee that documents are in stable storage. This is an implementation of Near Real Time

storage, a feature that boosts document visibility, since you don't have to wait for background merges and storage

(to ZooKeeper, if using ) to finish before moving on to something else. A full commit means that, if a serverSolrCloud

crashes, Solr will know exactly where your data was stored; a soft commit means that the data is stored, but the

location information isn't yet stored. The tradeoff is that a soft commit gives you faster visibility because it's not

waiting for background merges to finish.

For more information about Near Real Time operations, see .Near Real Time Searching

autoCommit

These settings control how often pending updates will be automatically pushed to the index. An alternative to autoC

is to use , which can be defined when making the update request to Solr (i.e., when pushingommit commitWithin

documents), or in an update RequestHandler.

Setting Description

maxDocs The number of updates that have occurred since the last commit.

maxTime The number of milliseconds since the oldest uncommitted update.

openSearcher Whether to open a new searcher when performing a commit. If this is , the default, thefalse

commit will flush recent index changes to stable storage, but does not cause a new searcher to

be opened to make those changes visible

If either of these or limits are reached, Solr automatically performs a commit operation. If the maxDocs maxTime au

tag is missing, then only explicit commits will update the index. The decision whether to use auto-committoCommit

367Apache Solr Reference Guide 4.10

or not depends on the needs of your application.

Determining the best auto-commit settings is a tradeoff between performance and accuracy. Settings that cause

frequent updates will improve the accuracy of searches because new content will be searchable more quickly, but

performance may suffer because of the frequent updates. Less frequent updates may improve performance but it

will take longer for updates to show up in queries.

<openSearcher>false</openSearcher>

</autoCommit>

You can also specify 'soft' autoCommits in the same way that you can specify 'soft' commits, except that instead of

using you set the tag.autoCommit autoSoftCommit

</autoSoftCommit>

commitWithin

The settings allow forcing document commits to happen in a defined time period. This is used mostcommitWithin

frequently with , and for that reason the default is to perform a soft commit. This does not,Near Real Time Searching

however, replicate new documents to slave servers in a master/slave environment. If that's a requirement for your

implementation, you can force a hard commit by adding a parameter, as in this example:

<softCommit>false</softCommit>

</commitWithin>

With this configuration, when you call as part of your update message, it will automatically performcommitWithin

a hard commit every time.

maxPendingDeletes

This value sets a limit on the number of deletions that Solr will buffer during document deletion. This can affect how

much memory is used during indexing.

Event Listeners

The UpdateHandler section is also where update-related event listeners can be configured. These can be triggered

to occur after a commit or optimize event, or after only an optimize event.

The listener is called with the , which runs an external executable with the defined set ofRunExecutableListener

instructions. The available commands are:

Setting Description

368Apache Solr Reference Guide 4.10

event If , the will be run after every commit or optimize. If postCommit RunExecutableListener postOpti

, the will be run every optimize only.mize RunExecutableListener

exe The name of the executable to run. It should include the path to the file, relative to Solr home.

dir The directory to use as the working directory. The default is ".".

wait Forces the calling thread to wait until the executable returns a response. The default is .true

args Any arguments to pass to the program. The default is none.

env Any environment variables to set. The default is none.

Below is the example from , which shows an example from script-based replication described at solrconfig.xml h

:ttp://wiki.apache.org/solr/CollectionDistribution

<str name="exe">solr/bin/snapshooter</str>

<arr name="env"> <str>MYVAR=val1</str> </arr>

</listener>

Transaction Log

As described in the section , a transaction log is required for that feature. It is configured in the RealTime Get updat

section of .eHandler solrconfig.xml

Realtime Get currently relies on the update log feature, which is enabled by default. It relies on an update log, which

is configured in , in a section like:solrconfig.xml

</updateLog>

Query Settings in SolrConfig

The settings in this section affect the way that Solr will process and respond to

queries. These settings are all configured in child elements of the elem<query>

ent in .solrconfig.xml

<query>

...

</query>

Topics covered in

this section:

Caches

Query Sizing

and Warming

Query-Related

Listeners

Caches

Solr caches are associated with a specific instance of an Index Searcher, a specific view of an index that doesn't

change during the lifetime of that searcher. As long as that Index Searcher is being used, any items in its cache will

be valid and available for reuse. Caching in Solr differs from caching in many other applications in that cached Solr

369Apache Solr Reference Guide 4.10

objects do not expire after a time interval; instead, they remain valid for the lifetime of the Index Searcher.

When a new searcher is opened, the current searcher continues servicing requests while the new one auto-warms

its cache. The new searcher uses the current searcher's cache to pre-populate its own. When the new searcher is

ready, it is registered as the current searcher and begins handling all new search requests. The old searcher will be

closed once it has finished servicing all its requests.

In Solr, there are three cache implementations: , and solr.search.LRUCache solr.search.FastLRUCache, s

.olr.search.LFUCache

The acronym LRU stands for Least Recently Used. When an LRU cache fills up, the entry with the oldest

last-accessed timestamp is evicted to make room for the new entry. The net effect is that entries that are accessed

frequently tend to stay in the cache, while those that are not accessed frequently tend to drop out and will be

re-fetched from the index if needed again.

The , which was introduced in Solr 1.4, is designed to be lock-free, so it is well suited for cachesFastLRUCache

which are hit several times in a request.

Both and use an auto-warm count that supports both integers and percentages whichLRUCache FastLRUCache

get evaluated relative to the current size of the cache when warming happens.

The refers to the Least Frequently Used cache. This works in a way similar to the LRU cache, exceptLFUCache

that when the cache fills up, the entry that has been used the least is evicted.

The Statistics page in the Solr Admin UI will display information about the performance of all the active caches. This

information can help you fine-tune the sizes of the various caches appropriately for your particular application. When

a Searcher terminates, a summary of its cache usage is also written to the log.

Each cache has settings to define it's initial size ( ), maximum size ( ) and number of items to useinitialSize size

for during warming ( ). The LRU and FastLRU cache implementations can take a percentageautowarmCount

instead of an absolute value for .autowarmCount

Details of each cache are described below.

filterCache

This cache is used by for filters (DocSets) for unordered sets of all documents that match aSolrIndexSearcher

query. The numeric attributes control the number of entries in the cache.

Solr uses the to cache results of queries that use the search parameter. Subsequent queriesfilterCache fq

using the same parameter setting result in cache hits and rapid returns of results. See for a detailedSearching

discussion of the parameter.fq

Solr also makes this cache for faceting when the configuration parameter is set to . For afacet.method fc

discussion of faceting, see .Searching

<filterCache class="solr.LRUCache"

size="512"

initialSize="512"

autowarmCount="128"/>

queryResultCache

This cache holds the results of previous searches: ordered lists of document IDs (DocList) based on a query, a sort,

370Apache Solr Reference Guide 4.10

and the range of documents requested.

<queryResultCache class="solr.LRUCache"

size="512"

initialSize="512"

autowarmCount="128"/>

documentCache

This cache holds Lucene Document objects (the stored fields for each document). Since Lucene internal document

IDs are transient, this cache is not auto-warmed. The size for the should always be greater than documentCache m

times the , to ensure that Solr does not need to refetch a documentax_results max_concurrent_queries

during a request. The more fields you store in your documents, the higher the memory usage of this cache will be.

<documentCache class="solr.LRUCache"

size="512"

initialSize="512"

autowarmCount="0"/>

User Defined Caches

You can also define named caches for your own application code to use. You can locate and use your cache object

by name by calling the methods , and .SolrIndexSearcher getCache() cacheLookup() cacheInsert()

<cache name="myUserCache" class="solr.LRUCache"

size="4096"

initialSize="1024"

autowarmCount="1024"

regenerator="org.mycompany.mypackage.MyRegenerator" />

If you want auto-warming of your cache, include a attribute with the fully qualified name of a classregenerator

that implements . In Solr 4.5, you can also use the , whichsolr.search.CacheRegenerator NoOpRegenerator

simply repopulates the cache with old items. Define it with the parameter as regenerator "regenerator=solr.

.NoOpRegenerator"

Query Sizing and Warming

maxBooleanClauses

This sets the maximum number of clauses allowed in a boolean query. This can affect range or prefix queries that

expand to a query with a large number of boolean terms. If this limit is exceeded, an exception is thrown.

enableLazyFieldLoading

If this parameter is set to true, then fields that are not directly requested will be loaded lazily as needed. This can

boost performance if the most common queries only need a small subset of fields, especially if infrequently

This option modifies a global property that effects all Solr cores. If multiple files disagreesolrconfig.xml

on this property, the value at any point in time will be based on the last Solr core that was initialized.

371Apache Solr Reference Guide 4.10

accessed fields are large in size.

useFilterForSortedQuery

This parameter configures Solr to use a filter to satisfy a search. If the requested sort does not include "score", the f

will be checked for a filter matching the query. For most situations, this is only useful if the sameilterCache

search is requested often with different sort options and none of them ever use "score".

queryResultWindowSize

Used with the , this will cache a superset of the requested number of document IDs. ForqueryResultCache

example, if the a search in response to a particular query requests documents 10 through 19, and queryWindowSi

is 50, documents 0 through 49 will be cached.ze

queryResultMaxDocsCached

This parameter sets the maximum number of documents to cache for any entry in the .queryResultCache

useColdSearcher

This setting controls whether search requests for which there is not a currently registered searcher should wait for a

new searcher to warm up (false) or proceed immediately (true). When set to "false", requests will block until the

searcher has warmed its caches.

<useColdSearcher>false</useColdSearcher>

maxWarmingSearchers

This parameter sets the maximum number of searchers that may be warming up in the background at any given

time. Exceeding this limit will raise an error. For read-only slaves, a value of two is reasonable. Masters should

probably be set a little higher.

Query-Related Listeners

As described in the section on , new Index Searchers are cached. It's possible to use the triggers for#Caches

listeners to perform query-related tasks. The most common use of this is to define queries to further "warm" the

Index Searchers while they are starting. One benefit of this approach is that field caches are pre-populated for faster

sorting.

Good query selection is key with this type of listener. It's best to choose your most common and/or heaviest queries

372Apache Solr Reference Guide 4.10

and include not just the keywords used, but any other parameters such as sorting or filtering requests.

There are two types of events that can trigger a listener. A event occurs when a new searcher isfirstSearcher

being prepared but there is no current registered searcher to handle requests or to gain auto-warming data from

(i.e., on Solr startup). A event is fired whenever a new searcher is being prepared and there is anewSearcher

current searcher handling requests.

The listener is always instantiated with the class , and followed a array.solr.QuerySenderListener NamedList

These examples are included with :solrconfig.xml

<!--

<lst><str name="q">solr</str><str name="sort">price asc</str></lst>

<lst><str name="q">rocks</str><str name="sort">weight asc</str></lst>

-->

</arr>

</listener>

<lst><str name="q">static firstSearcher warming in solrconfig.xml</str></lst>

</arr>

</listener>

RequestDispatcher in SolrConfig

The element of controls the way therequestDispatcher solrconfig.xml

Solr servlet's implementation responds to HTTPRequestDispatcher

requests. Included are parameters for defining if it should handle urls/select

(for Solr 1.1 compatibility), if it will support remote streaming, the maximum size

of file uploads and how it will respond to HTTP cache headers in requests.

Topics in this section:

handleSelect

Element

requestParsers

Element

httpCaching

Element

handleSelect Element

The first configurable item is the attribute on the element itself. ThishandleSelect <requestDispatcher>

attribute can be set to one of two values, either "true" or "false". It governs how Solr responds to requests such as /

. The default value "false" will ignore requests to { if a requestHandler is not explicitlyselect?qt=XXX /select

The above code sample is the default in , and a key best practice is to modify thesesolrconfig.xml

defaults before taking your application to production. While the sample queries are commented out in the

section for the "newSearcher", the example is not commented out for the "firstSearcher" event. There is no

point in auto-warming your Index Searcher with the query string "static firstSearcher warming in

solrconfig.xml" if that is not relevant to your search application.

handleSelect is for legacy back-compatibility; those new to Solr do not need to change anything about

the way this is configured by default.

373Apache Solr Reference Guide 4.10

registered with the name . A value of "true" will route query requests to the parser defined with the valu/select qt

In recent versions of Solr, a requestHandler is defined by default, so a value of "false" will work fine. See/select

the section for more information.RequestHandlers and SearchComponents in SolrConfig

...

</requestDispatcher>

requestParsers Element

The sub-element controls values related to parsing requests. This is an empty XML element<requestParsers>

that doesn't have any content, only attributes.

The attribute controls whether remote streaming of content is allowed. If set to ,enableRemoteStreaming false

streaming will not be allowed. Setting it to (the default) lets you specify the location of content to be streamedtrue

using or parameters.stream.file stream.url

If you enable remote streaming, be sure that you have authentication enabled. Otherwise, someone could potentially

gain access to your content by accessing arbitrary URLs. It's also a good idea to place Solr behind a firewall to

prevent it being accessed from untrusted clients.

The attribute sets an upper limit in kilobytes on the size of a document that may bemultipartUploadLimitInKB

submitted in a multi-part HTTP POST request. The value specified is multiplied by 1024 to determine the size in

bytes.

The attribute sets a limit in kilobytes on the size of form dataformdataUploadLimitInKB

(application/x-www-form-urlencoded) submitted in a HTTP POST request, which can be used to pass request

parameters that will not fit in a URL.

The attribute can be used to indicate that the original objectaddHttpRequestToContext HttpServletRequest

should be included in the context map of the using the key . This SolrQueryRequest httpRequest HttpServle

is not be used by any Solr components, but may be useful when developing custom plugins.tRequest

<requestParsers enableRemoteStreaming="true"

multipartUploadLimitInKB="2048000"

formdataUploadLimitInKB="2048"

addHttpRequestToContext="false" />

httpCaching Element

The element controls HTTP cache control headers. Do not confuse these settings with Solr's<httpCaching>

internal cache configuration. This element controls caching of HTTP responses as defined by the W3C HTTP

specifications.

This element allows for three attributes and one sub-element. The attributes of the element control<httpCaching>

whether a 304 response to a GET request is allowed, and if so, what sort of response it should be. When an HTTP

client application issues a GET, it may optionally specify that a 304 response is acceptable if the resource has not

been modified since the last time it was fetched.

Parameter Description

374Apache Solr Reference Guide 4.10

never304 If present with the value , then a GET request will never respond with a 304 code, even if thetrue

requested resource has not been modified. When this attribute is set to true, the next two

attributes are ignored. Setting this to true is handy for development, as the 304 response can be

confusing when tinkering with Solr responses through a web browser or other client that supports

cache headers.

lastModFrom This attribute may be set to either (the default) or . The value openTime dirLastMod openTime

indicates that last modification times, as compared to the If-Modified-Since header sent by the

client, should be calculated relative to the time the Searcher started. Use if youdirLastMod

want times to exactly correspond to when the index was last updated on disk.

etagSeed This value of this attribute is sent as the value of the header. Changing this value can beETag

helpful to force clients to re-fetch content even when the indexes have not changed---for example,

when you've made some changes to the configuration.

<httpCaching never304="false"

lastModFrom="openTime"

etagSeed="Solr">

<cacheControl>max-age=30, public</cacheControl>

</httpCaching>

cacheControl Element

In addition to these attributes, accepts one child element: . The content of this<httpCaching> <cacheControl>

element will be sent as the value of the Cache-Control header on HTTP responses. This header is used to modify

the default caching behavior of the requesting client. The possible values for the Cache-Control header are defined

by the HTTP 1.1 specification in .Section 14.9

Setting the max-age field controls how long a client may re-use a cached response before requesting it again from

the server. This time interval should be set according to how often you update your index and whether or not it is

acceptable for your application to use content that is somewhat out of date. Setting will tell themust-revalidate

client to validate with the server that its cached copy is still good before re-using it. This will ensure that the most

timely result is used, while avoiding a second fetch of the content if it isn't needed, at the cost of a request to the

server to do the check.

RequestHandlers and SearchComponents in SolrConfig

After the section, request handlers and search<query>

components are configured.These are often referred to as

"requestHandler" and "searchComponent", which is how they

are defined in .solrconfig.xml

A processes requests coming to Solr. Theserequest handler

might be query requests or index update requests. You will

likely need several of these defined, depending on how you

want Solr to handle the various requests you will make.

A is a feature of search, such as highlightingsearch component

or faceting. The search component is defined in solrconfig.

separate from the request handlers, and then registeredxml

with a request handler as needed.

375Apache Solr Reference Guide 4.10

Topics covered in this section:

Request Handlers

SearchHandlers

UpdateRequestHandlers

ShardHandlers

Other Request Handlers

Search Components

Default Components

First-Components and

Last-Components

Other Useful

Components

Related Topics

Request Handlers

Every request handler is defined with a name and a class. The name of the request handler is referenced with the

request to Solr. For example, a request to is the default addresshttp://localhost:8983/solr/collection1

for Solr, which will likely bring up the Solr Admin UI. However, add "/select" to the end, you can make a query:

http://localhost:8983/solr/collection1/select?q=solr

This query will be processed by the request handler with the name "/select". We've only used the "q" parameter

here, which includes our query term, a simple keyword of "solr". If the request handler has more parameters defined,

those will be used with any query we send to this request handler unless they are over-ridden by the client (or user)

in the query itself.

If you have another request handler defined, you would send your request with that name - for example, "/update" is

a request handler that handles index updates like sending new documents to the index.

SearchHandlers

The primary request handler defined with Solr by default is the "SearchHandler", which handles search queries. The

request handler is defined, and then a list of defaults for the handler are defined with a list.defaults

For example, in the default , the first request handler defined looks like this:solrconfig.xml

<str name="echoParams">explicit</str>

</lst>

</requestHandler>

This example defines the parameter, which defines how many search results to return, to "10". The defaultrows

field to search is the "text" field, set with the parameter. The parameter defines that the parametersdf echoParams

defined in the query should be returned when debug information is returned. Note also that the way the defaults are

defined in the list varies if the parameter is a string, an integer, or another type.

376Apache Solr Reference Guide 4.10

All of the parameters described in the section on can be defined as defaults for any of thesearching

SearchHandlers.

Besides , there are other options for the SearchHandler, which are:defaults

appends: This allows definition of parameters that are added to the user query. These might be ,filter queries

or other query rules that should be added to each query. There is no mechanism in Solr to allow a client to

override these additions, so you should be absolutely sure you always want these parameters applied to

queries.

<str name="fq">inStock:true</str>

</lst>

In this example, the filter query "inStock:true" will always be added to every query.

invariants: This allows definition of parameters that cannot be overridden by a client. The values defined

in an section will always be used regardless of the values specified by the user, by the client, ininvariants

or in .defaults appends

<str name="facet.field">manu_exact</str>

<str name="facet.query">price:[* TO 500]</str>

<str name="facet.query">price:[500 TO *]</str>

</lst>

In this example, facet fields have been defined which limits the facets that will be returned by Solr. If the client

requests facets, the facets defined with a configuration like this are the only facets they will see.

The final section of a request handler definition is , which defines a list of search components that cancomponents

be used with a request handler. They are only registered with the request handler. How to define a search

component is discussed further on in the section on . The element can only beSearch Components components

used with a request handler that is a SearchHandler.

The file includes many other examples of SearchHandlers that can be used or modified assolrconfig.xml

needed.

UpdateRequestHandlers

The UpdateRequestHandlers are request handlers which process updates to the index.

In this guide, we've covered these handlers in detail in the section .Uploading Data with Index Handlers

ShardHandlers

It is possible to configure a request handler to search across shards of a cluster, used with distributed search. More

information about distributed search and how to configure the shardHandler is in the section Distributed Search with

.Index Sharding

Other Request Handlers

377Apache Solr Reference Guide 4.10

There are other request handlers defined in , covered in other sections of this guide:solrconfig.xml

RealTime Get

Index Replication

Ping

Search Components

The search components define the logic that is used by the SearchHandler to perform queries for users.

Default Components

There are several defaults search components that work with all SearchHandlers without any additional

configuration. If no components are defined, these are used by default.

Component Name Class Name More Information

query solr.QueryComponent Described in the section .Query Syntax and Parsing

facet solr.FacetComponent Described in the section .Faceting

mlt solr.MoreLikeThisComponent Described in the section .MoreLikeThis

highlight solr.HighlightComponent Described in the section .Highlighting

stats solr.StatsComponent Described in the section .The Stats Component

debug solr.DebugComponent Described in the section on .Common Query Parameters

If you register a new search component with one of these default names, the newly defined component will be used

instead of the default.

First-Components and Last-Components

It's possible to define some components as being used before (with ) or after (with first-components last-com

) other named components. This would be useful if custom search components have been configured toponents

process data before the regular components are used. This is used when registering the components with the

request handler.

<str>mycomponent</str>

</arr>

<str>query</str>

<str>facet</str>

<str>highlight</str>

<str>spellcheck</str>

<str>stats</str>

<str>debug</str>

</arr>

Other Useful Components

Many of the other useful components are described in sections of this Guide for the features they support. These

378Apache Solr Reference Guide 4.10
are:
SpellCheckComponent, described in the section  .Spell Checking
TermVectorComponent, described in the section  .The Term Vector Component
QueryElevationComponent, described in the section  .The Query Elevation Component
TermsComponent, described in the section  .The Terms Component
Related Topics
SolrRequestHandler from the Solr Wiki.
SearchHandler from the Solr Wiki.
SearchComponent from the Solr Wiki.
Solr Cores and solr.xml
solr.xml has evolved from configuring one Solr core to supporting multiple Solr cores and eventually to defining
parameters for SolrCloud. Particularly with the advent of SolrCloud, the ability to cleanly define and maintain
high-level configuration parameters in solr.xml Solr cores has become more difficult so an alternative is being
adopted.
Starting in Solr 4.3, Solr will maintain two distinct formats for  , the   and   modes. Thesolr.xml legacy discovery
former is the format we have become accustomed to in which all of the cores one wishes to define in a Solr instance
are defined in   in   tags. This format will continue to besolr.xml <cores><core/>...<core/></cores>
supported through the entire 4.x code line.
As of Solr 5.0 this form of   will no longer be supported. Instead Solr will support  . In brief,solr.xml core discovery
core discovery still defines some configuration parameters in  , but  .solr.xml no cores are defined in this file
Instead, the solr home directory is recursively walked until a   file is encountered. This file iscore.properties
presumed to be at the root of a core, and many of the options that were placed in the   tag in legacy Solr are<core>
now defined here as simple properties, i.e. a file with entries, one to a line, like ' ', 'name=core1 schema=myschema
' and so on..xml
In Solr 4.x, the presence of a   node determines whether Solr uses legacy or discovery mode.<solr><cores>
There are checks at initialization time. If one tries to mix legacy and discovery tags in solr.xml. Solr will refuse to
initialize if "mixed mode" is discovered, and errors will be logged.
The following links are to pages that define these options in more detail, giving the acceptable parameters for the
legacy and discovery modes.
Format of solr.xml: The new   mode for  , including the acceptable parameters in both thediscovery solr.xml
 file and the corresponding   files.solr.xml core.properties
Legacy solr.xml Configuration: The   mode for   and the acceptable parameters.legacy solr.xml
Moving to the New solr.xml Format: How to migrate from legacy to discovery   configurations.solr.xml
CoreAdmin API: Tools and commands for core administration, which is common to both legacy and
discovery modes.
Format of solr.xml
You can find   in your Solr Home directory. The default     file looks like this:solr.xml discovery solr.xml
The new "core discovery mode" structure for   will become mandatory as of Solr 5.0, see: solr.xml Format
.of solr.xml

379Apache Solr Reference Guide 4.10

<solr>

<int name="hostPort">${jetty.port:8983}</int>

<str name="hostContext">${hostContext:solr}</str>

<int name="zkClientTimeout">${zkClientTimeout:15000}</int>

<bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>

</solrcloud>

<shardHandlerFactory name="shardHandlerFactory"

class="HttpShardHandlerFactory">

<int name="socketTimeout">${socketTimeout:0}</int>

<int name="connTimeout">${connTimeout:0}</int>

</shardHandlerFactory>

</solr>

As you can see, the discovery solr configuration is "SolrCloud friendly". However, the presence of the <solrcloud

element does mean that the Solr instance is running in SolrCloud mode. Unless the or ar>not -DzkHost -DzkRun

e specified at startup time, this section is ignored.

Using Multiple SolrCores

It is possible to segment Solr into multiple cores, each with its own configuration and indices. Cores may be

dedicated to a single application or to very different ones, but all are administered through a common administration

interface. You can create new Solr cores on the fly, shutdown cores, even replace one running core with another, all

without ever stopping or restarting your servlet container.

Solr cores are configured by placing a file named in a sub-directory under . Therecore.properties solr.home

are no a-priori limits to the depth of the tree, nor are there limits to the number of cores that can be defined. Cores

may be anywhere in the tree with the exception that cores may be defined under an existing core. That is, thenot

following is not allowed:

./cores/core1/core.properties

./cores/core1/coremore/core5/core.properties

In this example, the enumeration will stop at "core1".

The following is legal:

./cores/somecores/core1/core.properties

./cores/somecores/core2/core.properties

./cores/othercores/core3/core.properties

./cores/extracores/deepertree/core4/core.properties

A minimal file looks like this:core.properties

name=collection1

This is very different than the legacy tag. In fact, your file .solr.xml <core> core.properties can be empty

Say the file is located in (relative to ) . In that case, the file corecore.properties solr_home ./cores/core1

380Apache Solr Reference Guide 4.10

name is assumed to be "core1". The instanceDir will be the folder containing (i.e., core.properties ./cores/c

). The dataDir will be , etc.ore1 ../cores/core1/data

Solr.xml Parameters

The Element<solr>

There are no attributes that you can specify in the tag, which is the root element of . The tables<solr> solr.xml

below list the child nodes of each XML element in .solr.xml

Node Description

adminHandler If used, this attribute should be set to the FQN (Fully qualified name) of a class that

inherits from CoreAdminHandler. For example,

adminHandler="com.myorg.MyAdminHandler" would configure the custom admin

handler (MyAdminHandler) to handle admin requests. If this attribute isn't set, Solr

uses the default admin handler, org.apache.solr.handler.admin.CoreAdminHandler.

For more information on this parameter, see the Solr Wiki at http://wiki.apache.org/solr/

.CoreAdmin#cores

collectionsHandler As above, for custom CollectionsHandler implementations

infoHandler As above, for custom InfoHandler implementations

coreLoadThreads Specifies the number of threads that will be assigned to load cores in parallel

coreRootDirectory The root of the core discovery tree, defaults to SOLR_HOME

managementPath no-op at present.

sharedLib Specifies the path to a common library directory that will be shared across all cores.

Any JAR files in this directory will be added to the search path for Solr plugins. This

path is relative to the top-level container's Solr Home.

shareSchema This attribute, when set to true, ensures that the multiple cores pointing to the same

schema.xml will be referring to the same IndexSchema Object. Sharing the

IndexSchema Object makes loading the core faster. If you use this feature, make sure

that no core-specific property is used in your schema.xml.

transientCacheSize Defines how many cores with transient=true that can be loaded before swapping the

least recently used core for a new core.

configSetBaseDir The directory under which configsets for solr cores can be found. Defaults to

SOLR_HOME/configsets

The element<solrcloud>

You can run Solr without configuring any cores.

The attribute is no longer supported in . The properties in arepersistent solr.xml solr.xml

immutable, and any changes to individual cores are persisted in the individual files.core.properties

381Apache Solr Reference Guide 4.10

This element defines several parameters that relate so SolrCloud. This section is ignored unless the solr instance is

started with either or -DzkRun -DzkHost

Node Description

distribUpdateConnTimeout Used to set the underlying "connTimeout" for intra-cluster updates.

distribUpdateSoTimeout Used to set the underlying "socketTimeout" for intra-cluster updates.

host The hostname Solr uses to access cores.

hostContext The servlet context path.

hostPort The port Solr uses to access cores. In the default file, this is set tosolr.xml

}, which will use the Solr port defined in Jetty.${jetty.port:

leaderVoteWait When SolrCloud is starting up, how long each Solr node will wait for all

known replicas for that shard to be found before assuming that any nodes

that haven't reported are down.

leaderConflictResolveWait When trying to elect a leader for a shard, this property sets the maximum

time a replica will wait to see conflicting state information to be resolved;

temporary conflicts in state information can occur when doing rolling restarts,

especially when the node hosting the Overseer is restarted. Typically, the

default value of 180000 (millis) is sufficient for conflicts to be resolved; you

may need to increase this value if you have hundreds or thousands of small

collections in SolrCloud.

zkClientTimeout A timeout for connection to a ZooKeeper server. It is used with SolrCloud.

zkHost In SolrCloud mode, the URL of the ZooKeeper host that Solr should use for

cluster state information.

genericCoreNodeNames If , node names are not based on the address of the node, but on aTRUE

generic name that identifies the core. When a different machine takes over

serving that core things will be much easier to understand.

The element<logging>

Node Description

class The class to use for logging. The corresponding JAR file must be available to solr, perhaps through a

directive in solrconfig.xml.<lib>

enabled true/false - whether to enable logging or not.

The element<logging><watcher>

Node Description

size The number of log events that are buffered.

threshold The logging level above which your particular logging implementation will record. For example

when using log4j one might specify DEBUG, WARN, INFO, etc.

382Apache Solr Reference Guide 4.10

The element<shardHandlerFactory>

Custom shard handlers can be defined in if you wish to create a custom shard handler.solr.xml

However, since this is a custom shard handler, sub-elements are specific to the implementation.

Substituting JVM System Properties in solr.xml

Solr supports variable substitution of JVM system property values in , which allows runtime specificationsolr.xml

of various configuration options. The syntax is }. This allows${propertyname[:option default value]

defining a default that can be overridden when Solr is launched. If a default value is not specified, then the property

must be specified at runtime or the solr.xml file will generate an error when parsed.

Any JVM System properties, usually specified using the -D flag when starting the JVM, can be used as variables in

the file.solr.xml

For example: In the file shown below, starting solr using solr.xml java -DsocketTimeout=1000 -jar

will cause the option of the to be overridden using astart.jar socketTimeout HttpShardHandlerFactory

value of 1000ms, instead of the default property value of "0" – however the option will continue toconnTimeout

use the default property value of "0".

<solr>

<shardHandlerFactory name="shardHandlerFactory"

class="HttpShardHandlerFactory">

<int name="socketTimeout">${socketTimeout:0}</int>

<int name="connTimeout">${connTimeout:0}</int>

</shardHandlerFactory>

</solr>

Individual Filescore.properties

Core discovery replaces the individual tags in with a file located on disk.<core> solr.xml core.properties

The presence of the file the for that core. The filecore.properties defines instanceDir core.properties

is a simple Java Properties file where each line is just a key=value pair, e.g., . Notice that no quotesname=core1

are required.

The minimal file is an empty file, in which case all of the properties are defaulted appropriately.core.properties

Java properties files allow the hash ("#") or bang ("!") characters to specify comment-to-end-of-line. This table

defines the recognized properties:

key Description

name The name of the SolrCore. You'll use this name to reference the SolrCore when running

commands with the CoreAdminHandler.

config The configuration file name for a given core. The default is .solrconfig.xml

schema The schema file name for a given core. The default is schema.xml

dataDir Core's data directory as a path relative to the instanceDir, by default.data

383Apache Solr Reference Guide 4.10

configSet If set, the name of the configset to use to configure the core (see ).Config Sets

properties The name of the properties file for this core. The value can be an absolute pathname or a

path relative to the value of .instanceDir

transient If , the core can be unloaded if Solr reaches the . The default iftrue transientCacheSize

not specified is . Cores are unloaded in order of least recently used first. false Setting to itrue

s not recommended in SolrCloud mode.

loadOnStartup If , the default if it is not specified, the core will loaded when Solr starts. true Setting to isfalse

not recommended in SolrCloud mode.

coreNodeName Added in Solr 4.2, this attributes allows naming a core. The name can then be used later if

you need to replace a machine with a new one. By assigning the new machine the same

coreNodeName as the old core, it will take over for the old SolrCore.

ulogDir The absolute or relative directory for the update log for this core (SolrCloud)

shard The shard to assign this core to (SolrCloud)

collection The name of the collection this core is part of (SolrCloud)

roles Future param for SolrCloud or a way for users to mark nodes for their own use.

Additional "user defined" properties may be specified for use as variables in .parsing core configuration files

Legacy solr.xml Configuration

Use to configure your Solr core (a logical index and associated configuration files), or to configuresolr.xml

multiple cores. You can find in your Solr Home directory. The default file looks like this:solr.xml solr.xml

<cores adminPath="/admin/cores" defaultCoreName="collection1" host="${host:}"

hostPort="${jetty.port:}" hostContext="${hostContext:}"

zkClientTimeout="${zkClientTimeout:15000}">

</cores>

</solr>

For more information about core configuration and , see .solr.xml http://wiki.apache.org/solr/CoreAdmin

Using Multiple SolrCores

It is possible to segment Solr into multiple cores, each with its own configuration and indices. Cores may be

dedicated to a single application or to very different ones, but all are administered through a common administration

interface. You can create new Solr cores on the fly, shutdown cores, even replace one running core with another, all

without ever stopping or restarting your servlet container.

Solr cores are configured by placing a file named in your directory. A typical looksolr.xml solr.home solr.xml

s like this:

384Apache Solr Reference Guide 4.10

</cores>

</solr>

This sets up two Solr cores, named "core0" and "core1", and names the directories (relative to the Solr installation

path) which will store the configuration and data sub-directories.

Solr.xml Parameters

The Element<solr>

There are several attributes that you can specify on , which is the root element of .<solr> solr.xml

Attribute Description

coreLoadThreads Specifies the number of threads that will be assigned to load cores in parallel

persistent Indicates that changes made through the API or admin UI should be saved back to this so

. If not , any runtime changes will be lost on the next Solr restart. The servletlr.xml true

container running Solr must have sufficient permissions to replace (file deletesolr.xml

and create), or errors will result. Any comments in are not preserved when thesolr.xml

file is updated. The default is true.

sharedLib Specifies the path to a common library directory that will be shared across all cores. Any

JAR files in this directory will be added to the search path for Solr plugins. This path is

relative to the top-level container's Solr Home.

zkHost In SolrCloud mode, the URL of the ZooKeeper host that Solr should use for cluster state

information.

The Element<cores>

The element, which contains definitions for each Solr core, is a child of and accepts several<cores> <solr>

attributes of its own.

Attribute Description

You can run Solr without configuring any cores.

If you set the persistent attribute to , be sure that the Web server has permission to replace the file. Iftrue

the permissions are set incorrectly, the server will generate 500 errors and throw IOExceptions. Also, note

that any comments in the file will be lost when the file is overwritten.solr.xml

385Apache Solr Reference Guide 4.10

adminPath This is the relative URL path to access the SolrCore administration pages. For

example, a value of means that you can access the/admin/cores

CoreAdminHandler with a URL that looks like this: http://localhost:8983/solr/ad

. If this attribute is not present, then SolrCore administration will notmin/cores

be possible.

host The hostname Solr uses to access cores.

hostPort The port Solr uses to access cores. In the default file, this is set to solr.xml

}, which will use the Solr port defined in Jetty.${jetty.port:

hostContext The servlet context path.

zkClientTimeout A timeout for connection to a ZooKeeper server. It is used with .SolrCloud

distribUpdateConnTimeout Used to set the underlying "connTimeout" for intra-cluster updates.

distribUpdateSoTimeout Used to set the underlying "socketTimeout" for intra-cluster updates

leaderVoteWait When SolrCloud is starting up, how long each Solr node will wait for all known

replicas for that share to be found before assuming that any nodes that haven't

reported are down.

genericCoreNodeNames If , node names are not based on the address of the node, but on aTRUE

generic name that identifies the core. When a different machine takes over

serving that core things will be much easier to understand.

managementPath no-op at present.

defaultCoreName The name of a core that will be used for requests that do not specify a core.

transientCacheSize Defines how many cores with that can be loaded beforetransient=true

swapping the least recently used core for a new core.

shareSchema This attribute, when set to , ensures that the multiple cores pointing totrue

the same will be referring to the same IndexSchema Object.schema.xml

Sharing the IndexSchema Object makes loading the core faster. If you use this

feature, make sure that no core-specific property is used in your .schema.xml

adminHandler If used, this attribute should be set to the (Fully qualified name) of a classFQN

that inherits from . For example, CoreAdminHandler adminHandler="com.

would configure the custom admin handler (myorg.MyAdminHandler" MyAd

) to handle admin requests. If this attribute isn't set, Solr uses theminHandler

default admin handler, org.apache.solr.handler.admin.CoreAdminH

. For more information on this parameter, see the Solr Wiki at andler http://wi

.ki.apache.org/solr/CoreAdmin#cores

The Element<logging>

There is at most one element for a Solr installation that defines various attributes for logging.<logging>

Attribute Description

386Apache Solr Reference Guide 4.10

class The class to use for logging. The corresponding JAR file must be available to solr, perhaps through a

<lib> directive in solrconfig.xml.

enabled true/false - whether to enable logging or not.

In addition, the element may have a child element which may have the following attributes<logging> <watcher>

size The number of log events that are buffered.

threshold The logging level above which your particular logging implementation will record. For example

when using log4j one might specify DEBUG or WARN or INFO etc.

The <core> Element

There is one element for each SolrCore you define. They are children of the element and each<core> <cores>

one accepts the following attributes.

Attribute Description

name The name of the SolrCore. You'll use this name to reference the SolrCore when running

commands with the CoreAdminHandler.

instanceDir This relative path defines the Solr Home for the core.

config The configuration file name for a given core. The default is .solrconfig.xml

schema The schema file name for a given core. The default is schema.xml

dataDir This relative path defines the Solr Home for the core.

properties The name of the properties file for this core. The value can be an absolute pathname or a

path relative to the value of .instanceDir

transient If , the core can be unloaded if Solr reaches the . The default iftrue transientCacheSize

not specified is . Cores are unloaded in order of least recently used first.false

loadOnStartup If , the default if it is not specified, the core will loaded when Solr starts.true

coreNodeName Added in Solr 4.2, this attributes allows naming a core. The name can then be used later if

you need to replace a machine with a new one. By assigning the new machine the same

coreNodeName as the old core, it will take over for the old SolrCore.

ulogDir The absolute or relative directory for the update log for this core (SolrCloud)

shard The shard to assign this core to (SolrCloud)

collection The name of the collection this core is part of (SolrCloud)

roles Future param for SolrCloud or a way for users to mark nodes for their own use.

Substituting JVM System Properties in solr.xml

Solr supports variable substitution of JVM system property values in , which allows runtime specificationsolr.xml

of various configuration options. The syntax is }. This allows${propertyname[:option default value]

defining a default that can be overridden when Solr is launched. If a default value is not specified, then the property

387Apache Solr Reference Guide 4.10

must be specified at runtime or the solr.xml file will generate an error when parsed.

Any JVM System properties, usually specified using the -D flag when starting the JVM, can be used as variables in

the file.solr.xml

For example: In the file shown below, starting solr using solr.xml java -Dmy.logging=true -jar

will cause the option of the log watcher to be overridden using a value of , instead of thestart.jar enabled true

default property value of "false" – however the option will continue to use the default property value ofthreshold

"INFO".

</logging>

</cores>

</solr>

User Defined Properties in solr.xml

You can define custom properties in that you may then solr.xml reference in and solrconfig.xml schema.xm

. Properties are name/value pairs. The scope of a property depends on which element it occurs within. l

If a property is declared under but outside a element, then it will have container scope and will be<solr> <core>

visible to all cores. In the example above, is such a property.productname

If a property declaration occurs within a element, then its scope is limited to that core and it will not be<core>

visible to other cores. A property at core scope will override one of the same name declared at container scope.

</cores>

</solr>

Moving to the New solr.xml Format

Migration from old-style to core discovery is very straightforward. First, modify the file fromsolr.xml solr.xml

the to the .legacy format discovery format

In general there is a direct analog from the legacy format to the new format there is no element norexcept <cores>

are there any elements in discovery-based Solr.<core>

Startup

In Solr 4.4 and on, the presence of a child element of the element in the file signals a<cores> <solr> solr.xml

legacy version of , and cores are expected to be defined as they have been historically. Depending onsolr.xml

whether a element is discovered, is parsed as either a legacy or discovery file and errors are<cores> solr.xml

thrown in the log if legacy and discovery modes are mixed in .solr.xml

388Apache Solr Reference Guide 4.10

Moving definitions.<core>

To migrate to discovery-based , remove all of the elements and the enclosing elementsolr.xml <core> <cores>

from . See the pages linked above for examples of migrating other attributes. Then, in the instanceDir forsolr.xml

each core create a file. . In particular, the core.properties This file can be empty if all defaults are acceptable in

is assumed to be the directory in which the file is discovered. The data directorystanceDir core.properties

will be in a directory called "data" directly below. If the file is completely empty, the name of the core is assumed to

be the name of the folder in which the file was discovered.core.properties

As mentioned elsewhere, the tree structure that the cores are in is arbitrary, with the exception that the directories

containing the files must share a common root, but that root may be many levels up the tree.core.properties

Note that supporting a root for the cores that is not a child of is supported through properties in SOLR_HOME solr.x

. However, only root is possible, there is no provision presently for specifying multiple roots.ml one

The only restriction on the tree structure is that cores may not be children of other cores; enumeration stops

descending the tree when the first file is discovered. Siblings of the directory in which the down core.properties

file is discovered are still walked, only stopping recursing down the sibling when a core.properties core.prop

file is found.erties

Example

Here's an example of what a legacy file might look like and the equivalent discovery-based asolr.xml solr.xml

nd files:core.properties

<cores adminPath="/admin/cores" defaultCoreName="collection1" host="127.0.0.1"

hostPort="${hostPort:8983}"

hostContext="${hostContext:solr}"

zkClientTimeout="${solr.zkclienttimeout:30000}" shareSchema="${shareSchema:false}"

genericCoreNodeNames="${genericCoreNodeNames:true}">

<core name="core1" instanceDir="core1" shard="${shard:}"

collection="${collection:core1}" config="${solrconfig:solrconfig.xml}"

schema="${schema:schema.xml}" coreNodeName="${coreNodeName:}"/>

<int name="socketTimeout">${socketTimeout:120000}</int>

<int name="connTimeout">${connTimeout:15000}</int>

</shardHandlerFactory>

</cores>

</solr>

The new-style might look like what is below. Note that adminPath, defaultCoreName are not supported insolr.xml

discovery-based solr.xml.

389Apache Solr Reference Guide 4.10

<solr>

<int name="hostPort">${hostPort:8983}</int>

<str name="hostContext">${hostContext:solr}</str>

<int name="zkClientTimeout">${solr.zkclienttimeout:30000}</int>

<str name="shareSchema">${shareSchema:false}</str>

<bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>

</solrcloud>

<int name="socketTimeout">${socketTimeout:120000}</int>

<int name="connTimeout">${connTimeout:15000}</int>

</shardHandlerFactory>

</solr>

In each of "core1" and "core2" directories, there would be a file that might look like these. Notecore.properties

that note that instanceDir is not supported, it is assumed to be the directory in which core.properties is found.

core1:

name=core1

shard=${shard:}

collection=${collection:core1}

config=${solrconfig:solrconfig.xml}

schema=${schema:schema.xml}

coreNodeName=${coreNodeName:}

core2:

name=core2

In fact, the core2 file could even be empty and the name would default to the directory in whichcore.properties

the file was found.core.properties

CoreAdmin API

The CoreAdminHandler is a special SolrRequestHandler that is used to manage Solr cores. Unlike normal

SolrRequestHandlers, the CoreAdminHandler is not attached to a single core. Instead, it manages all the cores

running in a single Solr instance. Only one CoreAdminHandler exists for each top-level Solr instance.

To use the CoreAdminHandler, make sure that the attribute is defined on the <cores> element;adminPath

otherwise you will not be able to make HTTP requests to perform Solr core administration.

The action to perform is named by the HTTP request parameter "action", with arguments for a specific action being

provided as additional parameters.

All action names are uppercase, and are defined in depth in the sections below.

STATUS

CREATE

RELOAD

RENAME

SWAP

390Apache Solr Reference Guide 4.10

UNLOAD

MERGEINDEXES

SPLIT

REQUESTSTATUS

STATUS

The action returns the status of all running Solr cores, or status for only the named core.STATUS

http://localhost:8983/solr/admin/cores?action=STATUS&core=core0

Input

Query Parameters

Parameter Type Required Default Description

core string No The name of a core, as listed in the "name" attribute of a e<core>

lement in .solr.xml

indexInfo boolean No true If , information about the index will not be returned with a corefalse

STATUS request. In Solr implementations with a large number of

cores (i.e., more than hundreds), retrieving the index information

for each core can take a lot of time and isn't always required.

CREATE

The action creates a new core and registers it. If persistence is enabled ( on the CREATE persistent="true" <so

element), the updated configuration for this new core will be saved in . If a Solr core with the givenlr> solr.xml

name already exists, it will continue to handle requests while the new core is initializing. When the new core is

ready, it will take new requests and the old core will be unloaded.

http://localhost:8983/solr/admin/cores?action=CREATE&name=coreX&instanceDir=path/to/di

r&config=config_file_name.xml&schema=schem_file_name.xml&dataDir=data

Input

Query Parameters

Parameter Type Required Default Description

name string Yes N/A The name of the new core. Same as "name" on the element.<core>

instanceDir string Yes N/A The directory where files for this SolrCore should be stored. Same as

on the element.instanceDir <core>

config string No Name of the config file (solrconfig.xml) relative to .instanceDir

schema string No Name of the schema file (schema.xml) relative to .instanceDir

datadir string No Name of the data directory relative to .instanceDir

configSet string No Name of the configset to use for this core (see )Config Sets

391Apache Solr Reference Guide 4.10

collection string No The name of the collection to which this core belongs. The default is

the name of the core. causes acollection.<param>=<value>

property of to be set if a new collection is being<param>=<value>

created. Use to pointcollection.configName=<configname>

to the configuration for a new collection.

shard string No The shard id this core represents. Normally you want to be

auto-assigned a shard id.

property.n

=ame value

string No Sets the core property to .name value See core.properties file

contents.

async string No Request ID to track this action which will be processed

asynchronously

Use to point to the config for a new collection.collection.configName=<configname>

Example

http://localhost:8983/solr/admin/cores?action=CREATE&name=mycore&collection=collectio

n1&shard=shard2

RELOAD

The action loads a new core from the configuration of an existing, registered Solr core. While the new coreRELOAD

is initializing, the existing one will continue to handle requests. When the new Solr core is ready, it takes over and

the old core is unloaded.

This is useful when you've made changes to a Solr core's configuration on disk, such as adding new field definitions.

Calling the RELOAD action lets you apply the new configuration without having to restart the Web container.

However the Core Container does not persist the SolrCloud parameters, such as and solr.xml solr/@zkHost s

, which are ignored.olr/cores/@hostPort

http://localhost:8983/solr/admin/cores?action=RELOAD&core=core0

As of Solr 4.0, RELOAD performs "live" reloads of SolrCore, reusing some existing objects. Some

configuration options, such as the DataDir location and IndexWriter related settings in solrconfig.

xml can not be changed and made active with a simple RELOAD action.

Input

Query Parameters

Parameter Type Required Default Description

core string Yes N/A The name of the core, as listed in the "name" attribute of a el<core>

ement in .solr.xml

RENAME

The action changes the name of a Solr core.RENAME

http://localhost:8983/solr/admin/cores?action=RENAME&core=core0&other=core5

392Apache Solr Reference Guide 4.10

Input

Query Parameters

Parameter Type Required Default Description

core string Yes The name of the Solr core to be renamed.

other string Yes The new name for the Solr core. If the persistent attribute of i<solr>

s , the new name will be written to as the attribtrue solr.xml name

ute of the attribute.<core>

async string No Request ID to track this action which will be processed

asynchronously

SWAP

SWAP atomically swaps the names used to access two existing Solr cores. This can be used to swap new content

into production. The prior core remains available and can be swapped back, if necessary. Each core will be known

by the name of the other, after the swap.

http://localhost:8983/solr/admin/cores?action=SWAP&core=core1&other=core0

Input

Query Parameters

Parameter Type Required Default Description

core string Yes The name of one of the cores to be swapped.

other string Yes The name of one of the cores to be swapped.

async string No Request ID to track this action which will be processed

asynchronously

UNLOAD

The action removes a core from Solr. Active requests will continue to be processed, but no new requestsUNLOAD

will be sent to the named core. If a core is registered under more than one name, only the given name is removed.

http://localhost:8983/solr/admin/cores?action=UNLOAD&core=core0

The action requires a parameter ( ) identifying the core to be removed. If the persistent attribute of UNLOAD core <so

is set to , the element with this attribute will be removed from .lr> true <core> name solr.xml

Input

Do not use with a SolrCloud node. It is not supported and can result in the core being unusable.SWAP

Unloading all cores in a SolrCloud collection causes the removal of that collection's metadata from

ZooKeeper.

393Apache Solr Reference Guide 4.10

Query Parameters

Parameter Type Required Default Description

core string Yes The name of one of the cores to be removed.

deleteIndex boolean No false If true, will remove the index when unloading the core.

deleteDataDir boolean No false If true, removes the directory and all sub-directories.data

deleteInstanceDir boolean No false If true, removes everything related to the core, including the

index directory, configuration files and other related files.

async string No Request ID to track this action which will be processed

asynchronously

MERGEINDEXES

The action merges one or more indexes to another index. The indexes must have completedMERGEINDEXES

commits, and should be locked against writes until the merge is complete or the resulting merged index may

become corrupted. The target core index must already exist and have a compatible schema with the one or more

indexes that will be merged to it. Another commit on the target core should also be performed after the merge is

complete.

http://localhost:8983/solr/admin/cores?action=MERGEINDEXES&core=core0&indexDir=/o

pt/solr/core1/data/index&indexDir=/opt/solr/core2/data/index

In this example, we use the parameter to define the index locations of the source cores. The paraindexDir core

meter defines the target index. A benefit of this approach is that we can merge any Lucene-based index that may

not be associated with a Solr core.

Alternatively, we can instead use a parameter, as in this example:srcCore

http://localhost:8983/solr/admin/cores?action=mergeindexes&core=core0&srcCore=cor

e1&srcCore=core2

This approach allows us to define cores that may not have an index path that is on the same physical server as the

target core. However, we can only use Solr cores as the source indexes. Another benefit of this approach is that we

don't have as high a risk for corruption if writes occur in parallel with the source index.

We can make this call run asynchronously by specifying the parameter and passing a request-id. This id canasync

then be used to check the status of the already submitted task using the REQUESTSTATUS API.

Input

Query Parameters

Parameter Type Required Default Description

core string Yes The name of the target core/index.

indexDir string Multi-valued, directories that would be merged.

srcCore string Multi-valued, source cores that would be merged.

394Apache Solr Reference Guide 4.10

async string Request ID to track this action which will be processed

asynchronously

SPLIT

The action splits an index into two or more indexes. The index being split can continue to handle requests.SPLIT

The split pieces can be placed into a specified directory on the server's filesystem or it can be merged into running

Solr cores.

The action supports five parameters, which are described in the table below.SPLIT

Input

Query Parameters

Parameter Type Required Default Description

core string Yes The name of the core to be split.

path string Multi-valued, the directory path in which a piece of the index will be

written.

targetCore string Multi-valued, the target Solr core to which a piece of the index will be

merged

ranges string No A comma-separated list of hash ranges in hexadecimal format

split.key string No The key to be used for splitting the index

async string No Request ID to track this action which will be processed

asynchronously

Examples

The index will be split into as many pieces as the number of or parameters.core path targetCore

Usage with two parameters:targetCore

http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&targetCore=core1&t

argetCore=core2

Here the core index will be split into two pieces and merged into the two targetCore indexes.

Usage of with two parameters:path

http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&path=/path/to/inde

x/1&path=/path/to/index/2

The core index will be split into two pieces and written into the two directory paths specified.

Usage with the parameter:split.key

Either or parameter must be specified but not both. The ranges and split.keypath targetCore

parameters are optional and only one of the two should be specified, if at all required.

395Apache Solr Reference Guide 4.10

http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&targetCore=core1&s

plit.key=A!

Here all documents having the same route key as the i.e. 'A!' will be split from the index andsplit.key core

written to the .targetCore

Usage with ranges parameter:

http://localhost:8983/solr/admin/cores?action=SPLIT&core=core0&targetCore=core1&t

argetCore=core2&targetCore=core3&ranges=0-1f4,1f5-3e8,3e9-5dc

This example uses the ranges parameter with hash ranges 0-500, 501-1000 and 1001-1500

specified in hexadecimal. Here the index will be split into three pieces with each targetCore receiving

documents matching the hash ranges specified i.e. core1 will get documents with hash range 0-500,

core2 will receive documents with hash range 501-1000 and finally, core3 will receive documents

with hash range 1001-1500. At least one hash range must be specified. Please note that using a

single hash range equal to a route key's hash range is NOT equivalent to using the split.key para

meter because multiple route keys can hash to the same range.

The must already exist and must have a compatible schema with the index. A commit istargetCore core

automatically called on the index before it is split.core

This command is used as part of the command but it can be used for non-cloud Solr cores as well.SPLITSHARD

When used against a non-cloud core without parameter, this action will split the source index andsplit.key

distribute its documents alternately so that each split piece contains an equal number of documents. If the split.k

parameter is specified then only documents having the same route key will be split from the source index.ey

REQUESTSTATUS

Request the status of an already submitted asynchronous CoreAdmin API call.

Input

Query Parameters

Parameter Type Required Default Description

requestid string Yes The user defined request-id for the Asynchronous request.

The call below will return the status of an already submitted Asynchronous CoreAdmin call.

http://localhost:8983/solr/admin/cores?action=REQUESTSTATUS&requestid=1

Config Sets

On a multicore Solr instance, you may find that you want to share configuration between a number of different cores.

You can achieve this using named configsets, which are essentially shared configuration directories stored under a

configurable configset base directory.

To create a configset, simply add a new directory under the configset base directory. The configset will be identified

by the name of this directory. Then into this copy the config directory you want to share. The structure should look

something like this:

396Apache Solr Reference Guide 4.10

/<configSetBaseDir>

/configset1

/conf

/schema.xml

/solrconfig.xml

/configset2

/conf

/schema.xml

/solrconfig.xml

The default base directory is , and it can be configured in solr.xml.SOLR_HOME/configsets

To create a new core using a configset, pass as one of the core properties. For example, via the coreconfigSet

admin API:

http://<solr>/cores?action=CREATE&name=mycore&instanceDir=path/to/instance&configSet=

configset2

Solr Plugins

Solr allows you to load custom code to perform a variety of tasks within Solr, from custom Request Handlers to

process your searches, to custom Analyzers and Token Filters for your text field. You can even load custom Field

Types. These pieces of custom code are called plugins.

Not everyone will need to create plugins for their Solr instances - what's provided is usually enough for most

applications. However, if there's something that you need, you may want to review the Solr Wiki documentation on

plugins at .SolrPlugins

JVM Settings

Configuring your JVM can be a complex topic. A full discussion is beyond the scope of this document. Luckily, most

modern JVMs are quite good at making the best use of available resources with default settings. The following

sections contain a few tips that may be helpful when the defaults are not optimal for your situation.

For more general information about improving Solr performance, see https://wiki.apache.org/solr/SolrPerformanceF

.actors

Choosing Memory Heap Settings

The most important JVM configuration settings are those that determine the amount of memory it is allowed to

allocate. There are two primary command-line options that set memory limits for the JVM. These are , which-Xms

sets the initial size of the JVM's memory heap, and , which sets the maximum size to which the heap is-Xmx

allowed to grow.

If your Solr application requires more heap space than you specify with the option, the heap will grow-Xms

automatically. It's quite reasonable to not specify an initial size and let the heap grow as needed. The only downside

is a somewhat slower startup time since the application will take longer to initialize. Setting the initial heap size

higher than the default may avoid a series of heap expansions, which often results in objects being shuffled around

within the heap, as the application spins up.

The maximum heap size, set with , is more critical. If the memory heap grows to this size, object creation may-Xmx

begin to fail and throw . Setting this limit too low can cause spurious errors in yourOutOfMemoryException

application, but setting it too high can be detrimental as well.

It doesn't always cause an error when the heap reaches the maximum size. Before an error is raised, the JVM will

397Apache Solr Reference Guide 4.10

first try to reclaim any available space that already exists in the heap. Only if all garbage collection attempts fail will

your application see an exception. As long as the maximum is big enough, your app will run without error, but it may

run more slowly if forced garbage collection kicks in frequently.

The larger the heap the longer it takes to do garbage collection. This can mean minor, random pauses or, in

extreme cases, "freeze the world" pauses of a minute or more. As a practical matter, this can become a serious

problem for heap sizes that exceed about two gigabytes, even if far more physical memory is available. On robust

hardware, you may get better results running multiple JVMs, rather than just one with a large memory heap. Some

specialized JVM implementations may have customized garbage collection algorithms that do better with large

heaps. Also, Java 7 is expected to have a redesigned GC that should handle very large heaps efficiently. Consult

your JVM vendor's documentation.

When setting the maximum heap size, be careful not to let the JVM consume all available physical memory. If the

JVM process space grows too large, the operating system will start swapping it, which will severely impact

performance. In addition, the operating system uses memory space not allocated to processes for file system cache

and other purposes. This is especially important for I/O-intensive applications, like Lucene/Solr. The larger your

indexes, the more you will benefit from filesystem caching by the OS. It may require some experimentation to

determine the optimal tradeoff between heap space for the JVM and memory space for the OS to use.

On systems with many CPUs/cores, it can also be beneficial to tune the layout of the heap and/or the behavior of

the garbage collector. Adjusting the relative sizes of the generational pools in the heap can affect how often GC

sweeps occur and whether they run concurrently. Configuring the various settings of how the garbage collector

should behave can greatly reduce the overall performance impact when it does run. There is a lot of good

information on this topic available on Sun's website. A good place to start is here: Oracle's Java HotSpot Garbage

.Collection

Use the Server HotSpot VM

If you are using Sun's JVM, add the command-line option when you start Solr. This tells the JVM that it-server

should optimize for a long running, server process. If the Java runtime on your system is a JRE, rather than a full

JDK distribution (including and other development tools), then it is possible that it may not support the javac -serv

JVM option. Test this by running and look for as an available option in the displayeder java -help -server

usage message.

Checking JVM Settings

A great way to see what JVM settings your server is using, along with other useful information, is to use the admin

RequestHandler, . This request handler will display a wealth of server statistics and settings.solr/admin/system

You can also use any of the tools that are compatible with the Java Management Extensions (JMX). See the section

in for more information.Using JMX with Solr Managing Solr

398Apache Solr Reference Guide 4.10

Managing Solr

This section describes how to run Solr and how to look at Solr when it is running. It contains the following sections:

Running Solr on Jetty: Describes how to run Solr in the Jetty web application container. The Solr example included

in this distribution runs in a Jetty web application container.

Running Solr on Tomcat: Describes how to run Solr in the Tomcat web application container.

Configuring Logging: Describes how to configure logging for Solr.

Enabling SSL: Describes how to configure single-node Solr and SolrCloud to encrypt internal and external

communication using SSL.

Backing Up: Describes backup strategies for your Solr indexes.

Using JMX with Solr: Describes how to use Java Management Extensions with Solr.

Managed Resources: Describes the REST APIs for dealing with resources that various Solr plugins may expose.

Running Solr on HDFS: How to use HDFS to store your Solr indexes and transaction logs.

For information on running Solr in a variety of Java application containers, see the onbasic installation instructions

the Solr wiki.

Running Solr on Tomcat

Solr comes with an example schema and scripts for running on . The next section describes some of the detailsJetty

of how things work "under the hood," and covers running multiple Solr instances and deploying Solr using the

Tomcat application manager.

For more information about running Solr on Tomcat, see the and the pagebasic installation instructions Solr Tomcat

on the Solr wiki.

How Solr Works with Tomcat

The two basic steps for running Solr in any Web application container are as follows:

Make the Solr classes available to the container. In many cases, the Solr Web application archive (WAR) file

can be placed into a special directory of the application container. In the case of Tomcat, you need to place

the Solr WAR file in Tomcat's directory. If you installed Tomcat with Solr, take a look in webapps tomcat/we

:you'll see the file is already there.bapps solr.war

Point Solr to the Solr home directory that contains and . Thereconf/solrconfig.xml conf/schema.xml

are a few ways to get this done. One of the best is to define the Java system property.solr.solr.home

With Tomcat, the best way to do this is via a shell environment variable, . Tomcat puts the valueJAVA_OPTS

of this variable on the command line upon startup. Here is an example:

export JAVA_OPTS="-Dsolr.solr.home=/Users/jonathan/Desktop/solr"

Port 8983 is the default Solr listening port. If you are using Tomcat and wish to change this port, edit the file tomcat

in the Solr distribution. You'll find the port in this part of the file:/conf/server.xml

<Connector port="8983" protocol="HTTP/1.1" connectionTimeout="20000"

redirectPort="8443" />

399Apache Solr Reference Guide 4.10

Modify the port number as desired and restart Tomcat if it is already running.

Running Multiple Solr Instances

The standard way to deploy multiple Solr index instances in a single Web application is to use the multicore API

described in .Solr Cores and solr.xml

An alternative approach, which provides more code isolation, uses Tomcat context fragments. A context fragment is

a file that contains a single element and any subelements required for your application. The file omits<context>

all other XML elements.

Each context fragment specifies where to find the Solr WAR and the path to the solr home directory. The name of

the context fragment file determines the URL used to access that instance of Solr. For example, a context fragment

named would deploy Solr to be accessed at .harvey.xml http://localhost:8983/harvey

In Tomcat's directory, store one context fragment per instance of Solr. If the conf/Catalina/localhost conf/C

directory doesn't exist, go ahead and create it.atalina/localhost

Using Tomcat context fragments, you could run multiple instances of Solr on the same server, each with its own

schema and configuration. For full details and examples of context fragments, take a look at the Solr Wiki: http://wiki.

.apache.org/solr/SolrTomcat

Here are examples of context fragments which would set up two Solr instances, each with its own directsolr.home

ory:

<Environment name="solr/home" type="java.lang.String" value="/some/path/solr1home"

override="true" />

</Context>

<Environment name="solr/home" type="java.lang.String" value="/some/path/solr2home"

override="true" />

</Context>

Deploying Solr with the Tomcat Manager

If your instance of Tomcat is running the Tomcat Web Application Manager, you can use its browser interface to

deploy Solr.

Just as before, you have to tell Solr where to find the solr home directory. You can do this by setting JAVA_OPTS

before starting Tomcat.

Once Tomcat is running, navigate to the Web application manager, probably available at a URL like this:

Modifying the port number will leave some of the samples and help file links pointing to the default port. It is

out of the scope of this reference guide to provide full details of how to change all of the examples and other

resources to the new port.

harvey.xml ( using /some/path/solr1home)http://localhost:8983/harvey

rupert.xml ( using /some/path/solr2home)http://localhost:8983/rupert

400Apache Solr Reference Guide 4.10

http://localhost:8983/manager/html

You will see the main screen of the manager.

To add Solr, scroll down to the section, specifically . Click and find the SolrDeploy WAR file to deploy Browse...

WAR file, usually something like within your Solr installation. Click . Tomcat willdist/solr-4.x.y.war Deploy

load the WAR file and start running it. Click the link in the application path column of the manager to see Solr. You

won't see much, just a welcome screen, but it contains a link for the Admin Console.

Tomcat's manager screen, in its application list, has links so you can stop, start, reload, or undeploy the Solr

application.

Running Solr on Jetty

Solr comes with an example schema and scripts for running on , along with a working installation, in the Jetty /exam

directory. It is stripped of all unnecessary features and its config has had some minor tuning so it's optimized forple

Solr. It is recommended that you use the provided Jetty server for optimal performance. For more information about

the Jetty example installation, see the and the .Solr Tutorial basic installation instructions

For detailed information about running Solr on a stand-alone Jetty, see .http://wiki.apache.org/solr/SolrJetty

Change the Solr Listening Port

Port 8983 is the default port for Solr. If you are using Jetty and wish to change the port number, edit the file exampl

in the Solr distribution. You'll find the port in this part of the file:e/etc/jetty.xml

401Apache Solr Reference Guide 4.10

<Set name="statsOn">false</Set>

</New>

Modify the port number as desired and restart Jetty if it is already running.

Configuring Logging

Prior to version 4.3, Solr used the SLF4J Logging API ( ). To improve flexibility in logging withhttp://www.slf4j.org

containers other than Jetty, in Solr 4.3 the default behavior has changed and the SLF4J jars were removed from

Solr's file. This allows changing or upgrading the logging mechanism as needed..war

For further information about Solr logging, see .SolrLogging

Temporary Logging Settings

You can control the amount of logging output in Solr by using the Admin Web interface. Select the link.LOGGING

Note that this page only lets you change settings in the running system and is not saved for the next run. (For more

information about the Admin Web interface, see .)Using the Solr Administration User Interface

The Logging screen.

Modifying the port number will leave some of the samples and help file links pointing to the wrong port. It is

out of the scope of this reference guide to provide full details of how to change all of the examples and other

resources to the new port.

In addition to the logging options described below, there is a way to configure which request parameters

(such as parameters sent as part of queries) are logged with an additional request parameter called logPar

. See the section on for more information.amsList Common Query Parameters

402Apache Solr Reference Guide 4.10

This part of the Admin Web interface allows you to set the logging level for many different log categories.

Fortunately, any categories that are will have the logging level of its parent. This makes it possible to changeunset

many categories at once by adjusting the logging level of their parent.

When you select , you see the following menu:Level

The Log Level Menu.

Directories are shown with their current logging levels. The Log Level Menu floats over these. To set a log level for a

particular directory, select it and click the appropriate log level button.

Log levels settings are as follows:

Level Result

FINEST Reports everything.

FINE Reports everything but the least important messages.

CONFIG Reports configuration errors.

INFO Reports everything but normal status.

WARNING Reports all warnings.

SEVERE Reports only the most severe warnings.

OFF Turns off logging.

UNSET Removes the previous log setting.

Multiple settings at one time are allowed.

Permanent Logging Settings

Making permanent changes to the JDK Logging API configuration is a matter of creating or editing a properties file.

Tomcat Logging Settings

403Apache Solr Reference Guide 4.10

Tomcat offers a choice between settings for all applications or settings specifically for the Solr application.

With Solr 4.3, you will need to copy the SLF4J files from the directory to the main dir.jar example/lib/ext lib

ectory of Tomcat (this may be as simple as ). Then you can copy the file from tomcat/lib log4j.properties ex

to a location on the classpath - the same location as the files is probably OK in mostample/resources .jar

cases. Then you can edit the properties as needed to set the log destination.

See the documentation for the SLF4J Logging API for more information:

http://slf4j.org/docs.html

http://docs.oracle.com/javase/7/docs/technotes/guides/logging/index.html

Jetty Logging Settings

To change settings for the SLF4J Logging API in Jetty, you need to create a settings file and tell Jetty where to find

it.

Begin by creating a file or modifying the one found in .jetty/logging.properties example/etc

To tell Jetty how to find the file, edit and add the following property information:jetty.xml

<Arg>java.util.logging.config.file</Arg>

<Arg>logging.properties</Arg>

</Call>

</Configure>

The next time you launch Jetty, it will use the settings in the file.

Enabling SSL

Both SolrCloud and single-node Solr can encrypt communications to and from clients, and in SolrCloud between

nodes, with SSL. This section describes enabling SSL with the example Jetty server using a self-signed certificate.

For background on SSL certificates and keys, see .http://www.tldp.org/HOWTO/SSL-Certificates-HOWTO/

Basic SSL Setup

Generate a self-signed certificate and a key

Convert the certificate and key to PEM format for use with cURL

Configure Jetty

Run Single Node Solr using SSL

SolrCloud

Configure ZooKeeper

Run SolrCloud with SSL

Example Client Actions

Create a SolrCloud collection using cURL

Retrieve SolrCloud cluster status using cURL

Index documents using post.jar

Query using cURL

Index a document using CloudSolrServer

Basic SSL Setup

404Apache Solr Reference Guide 4.10

Generate a self-signed certificate and a key

To generate a self-signed certificate and a single key that will be used to authenticate both the server and the client,

we'll use the JDK command and create a separate keystore. This keystore will also be used as a keytool

truststore below. It's possible to use the keystore that comes with the JDK for these purposes, and to use a

separate truststore, but those options aren't covered here.

Run the commands below in the directory in the binary Solr distribution.example/etc/

The " " option allows you to specify all the DNS names and/or IP addresses that will be-ext SAN=... keytool

allowed during hostname verification (but see below for how to skip hostname verification between Solr nodes so

that you don't have to specify all hosts here). In addition to and , this example includes alocalhost 127.0.0.1

LAN IP address for the machine the Solr nodes will be running on:192.168.1.3

keytool -genkeypair -alias solr-ssl -keyalg RSA -keysize 2048 -keypass secret

-storepass secret -validity 9999 -keystore solr-ssl.keystore.jks -ext

SAN=DNS:localhost,IP:192.168.1.3,IP:127.0.0.1 -dname "CN=localhost, OU=Organizational

Unit, O=Organization, L=Location, ST=State, C=Country"

The above command will create a keystore file named in the current directory.solr-ssl.keystore.jks

Convert the certificate and key to PEM format for use with cURL

cURL isn't capable of using JKS formatted keystores, so the JKS keystore needs to be converted to PEM format,

which cURL understands.

First convert the JKS keystore into PKCS12 format using :keytool

keytool -importkeystore -srckeystore solr-ssl.keystore.jks -destkeystore

solr-ssl.keystore.p12 -srcstoretype jks -deststoretype pkcs12

Next convert the PKCS12 format keystore, including both the certificate and the key, into PEM format using the ope

command: nssl

openssl pkcs12 -in solr-ssl.keystore.p12 -out solr-ssl.pem

Configure Jetty

The example directory in the Solr binary distribution contains a Jetty server configured to run Solr in non-SSL mode

out of the box. The configuration changes below will allow Jetty to communicate using SSL with the keystore

prepared above.

First, comment out the non-SSL block in using bSelectChannelConnector example/etc/jetty.xml <!--

efore and afterward: -->

405Apache Solr Reference Guide 4.10

<!--

<Arg>

<Set name="statsOn">false</Set>

</New>

</Arg>

</Call>

-->

Next, uncomment the block by removing the before and afterward,SslSelectChannelConnector

and change the keyStore value to point to the JKS keystore created above - the result should look like this:

<Arg>

<Arg>

<Set name="keyStore"><SystemProperty name="jetty.home"

default="."/>/etc/solr-ssl.keystore.jks</Set>

<Set name="keyStorePassword">secret</Set>

<Set name="needClientAuth"><SystemProperty name="jetty.ssl.clientAuth"

default="false"/></Set>

</New>

</Arg>

</New>

</Arg>

</Call>

Run Single Node Solr using SSL

The command below, run from the directory in the binary Solr distribution, will start Solr on port 8984. Byexample/

default clients will not be required to authenticate:

java -jar start.jar

Alternatively, to require clients to authenticate, you can set the system property to jetty.ssl.clientAuth true

(default is ):false

java -Djetty.ssl.clientAuth=true -jar start.jar

SolrCloud

This section describes how to run a two-node SolrCloud cluster with no initial collections and a single-node external

406Apache Solr Reference Guide 4.10

ZooKeeper. The commands below assume you have already created the keystore described above.

Configure ZooKeeper

Before you start any SolrCloud nodes, you must configure your solr cluster properties in ZooKeeper, so that

Solr nodes know to communicate via SSL.

This section assumes you have created and started a single-node external ZooKeeper on port 2181 on localhost -

see Setting Up an External ZooKeeper Ensemble

The cluster-wide property needs to be set to before any Solr node starts up. The exampleurlScheme https

below uses the client that comes with the binary Solr distribution to do this, from the directory:zkcli.sh example/

scripts/cloud-scripts/zkcli.sh -zkhost localhost:2181 -cmd put /clusterprops.json

'{"urlScheme":"https"}'

Run SolrCloud with SSL

Copy the directoryexample/

Create two copies of the directory and remove the directories - from the root directory ofexample/ collection1/

the binary Solr distribution:

cp -r example node1

rm -rf node1/solr/collection1

cp -r example node2

rm -rf node2/solr/collection1

Start the first Solr node

Next, start the first Solr node on port 8984 and bootstrap a configset we'll call "myconfig" (taken from the example/

directory):solr/collection1/conf/

cd node1

java -DzkHost=localhost:2181 -Djetty.port=8984 -Djetty.ssl.port=8984

-Dbootstrap_confdir=../example/solr/collection1/conf -Dcollection.configName=myconf

-Djavax.net.ssl.keyStore=etc/solr-ssl.keystore.jks

-Djavax.net.ssl.keyStorePassword=secret

-Djavax.net.ssl.trustStore=etc/solr-ssl.keystore.jks

-Djavax.net.ssl.trustStorePassword=secret -jar start.jar

Alternatively, if you created your SSL key without all DNS names/IP addresses on which Solr nodes will run, you can

tell Solr to skip hostname verification for inter-Solr-node communications by setting the solr.ssl.checkPeerNam

system property to :e false

ZooKeeper does not support encrypted communication with clients like Solr. There are several related JIRA

tickets where SSL support is being planned/worked on: ; ; ZOOKEEPER-235 ZOOKEEPER-236 ZOOKEEPE

; and .R-733 ZOOKEEPER-1000

407Apache Solr Reference Guide 4.10

cd node1

java -Dsolr.ssl.checkPeerName=false -DzkHost=localhost:2181 -Djetty.port=8984

-Djetty.ssl.port=8984 -Dbootstrap_confdir=../example/solr/collection1/conf

-Dcollection.configName=myconf -Djavax.net.ssl.keyStore=etc/solr-ssl.keystore.jks

-Djavax.net.ssl.keyStorePassword=secret

-Djavax.net.ssl.trustStore=etc/solr-ssl.keystore.jks

-Djavax.net.ssl.trustStorePassword=secret -jar start.jar

Start the second Solr node

Finally, start the second Solr node on port 7574 - again, to skip hostname verification, add -Dsolr.ssl.checkPee

(not shown here):rName=false

cd node2

java -DzkHost=localhost:2181 -Djetty.port=7574 -Djetty.ssl.port=7574

-Djavax.net.ssl.keyStore=etc/solr-ssl.keystore.jks

-Djavax.net.ssl.keyStorePassword="secret"

-Djavax.net.ssl.trustStore=etc/solr-ssl.keystore.jks

-Djavax.net.ssl.trustStorePassword="secret" -jar start.jar

Note that both the and system properties are required when starting SolrCloudjetty.port jetty.ssl.port

using SSL.

Example Client Actions

Create a SolrCloud collection using cURL

Create a 2-shard, rf=1 collection named mycollection, from a directory containing the PEM formatted certificate and

key created above (e.g. ) - this command will perform client authentication using the same key asexample/etc/

the Solr nodes; if you have not enabled client authentication (system property ),-Djetty.ssl.clientAuth=true

then you can remove the option:-E solr-ssl.pem:secret

curl -E solr-ssl.pem:secret --cacert solr-ssl.pem

"https://localhost:8984/solr/admin/collections?action=CREATE&name=mycollection&numShar

ds=2&replicationFactor=1&maxShardsPerNode=1&collection.configName=myconf"

This should return an XML-formatted response showing successful collection creation.

Retrieve SolrCloud cluster status using cURL

To get the resulting cluster status (again, if you have not enabled client authentication, remove the -E

option):solr-ssl.pem:secret

curl -E solr-ssl.pem:secret --cacert solr-ssl.pem

"https://localhost:8984/solr/admin/collections?action=CLUSTERSTATUS&wt=json&indent=on"

You should get a response that looks like this:

cURL on OS X Mavericks has degraded SSL support. For more information and workarounds to allow 1-way

SSL, see http://curl.haxx.se/mail/archive-2013-10/0036.html

408Apache Solr Reference Guide 4.10

{

"responseHeader":{

"status":0,

"QTime":2041},

"cluster":{

"collections":{

"mycollection":{

"shards":{

"shard1":{

"range":"80000000-ffffffff",

"state":"active",

"replicas":{"core_node1":{

"state":"active",

"base_url":"https://127.0.0.1:8984/solr",

"core":"mycollection_shard1_replica1",

"node_name":"127.0.0.1:8984_solr",

"leader":"true"}}},

"shard2":{

"range":"0-7fffffff",

"state":"active",

"replicas":{"core_node2":{

"state":"active",

"base_url":"https://127.0.0.1:7574/solr",

"core":"mycollection_shard2_replica1",

"node_name":"127.0.0.1:7574_solr",

"leader":"true"}}}},

"maxShardsPerNode":"1",

"router":{"name":"compositeId"},

"replicationFactor":"1"}},

"properties":{"urlScheme":"https"}}}

Index documents using post.jar

Use to index some example documents to the SolrCloud collection created above:post.jar

cd example/exampledocs

java -Djavax.net.ssl.keyStorePassword=secret

-Djavax.net.ssl.keyStore=../etc/solr-ssl.keystore.jks

-Djavax.net.ssl.trustStore=../etc/solr-ssl.keystore.jks

-Djavax.net.ssl.trustStorePassword=secret

-Durl=https://localhost:8984/solr/mycollection/update -jar post.jar *.xml

Query using cURL

Use cURL to query the SolrCloud collection created above, from a directory containing the PEM formatted certificate

and key created above (e.g. ) - example/etc/ if you have not enabled client authentication (system property -Dje

, then you can remove the option:tty.ssl.clientAuth=true) -E solr-ssl.pem:secret

curl -E solr-ssl.pem:secret --cacert solr-ssl.pem

"https://localhost:8984/solr/mycollection/select?q=*:*&wt=json&indent=on"

Index a document using CloudSolrServer

From a java client using Solrj, index a document. In the code below, the system properties arejavax.net.ssl.*

409Apache Solr Reference Guide 4.10

set programmatically, but you could instead specify them on the java command line, as in the examplepost.jar

above:

System.setProperty("javax.net.ssl.keyStore", "/path/to/solr-ssl.keystore.jks");

System.setProperty("javax.net.ssl.keyStorePassword", "secret");

System.setProperty("javax.net.ssl.trustStore", "/path/to/solr-ssl.keystore.jks");

System.setProperty("javax.net.ssl.trustStorePassword", "secret");

String zkHost = "127.0.0.1:2181";

CloudSolrServer server = new CloudSolrServer(zkHost);

server.setDefaultCollection("mycollection");

SolrInputDocument doc = new SolrInputDocument();

doc.addField("id", "1234");

doc.addField("name", "A lovely summer holiday");

server.add(doc);

server.commit();

Backing Up

If you are worried about data loss, and of course you be, you need a way to back up your Solr indexes soshould

that you can recover quickly in case of catastrophic failure.

Making Backups with the Solr Replication Handler

The easiest way to make back-ups in Solr is to take advantage of the Replication Handler, which is described in

detail in . The Replication Handler's primary purpose is to replicate an index on slave servers forIndex Replication

load-balancing, but the Replication Handler can be used to make a back-up copy of a server's index, even if no

slave servers are in operation.

Once you have configured the Replication Handler in , you can trigger a back-up with an HTTPsolrconfig.xml

command like this:

http:// /solr/replication?command=backupmaster_host

For details on configuring the Replication Handler, see .Legacy Scaling and Distribution

Using JMX with Solr

Java Management Extensions (JMX) is a technology that makes it possible for complex systems to be controlled by

tools without the systems and tools having any previous knowledge of each other. In essence, it is a standard

interface by which complex systems can be viewed and manipulated.

Solr, like any other good citizen of the Java universe, can be controlled via a JMX interface. You can enable JMX

support by adding lines to . You can use a JMX client, like jconsole, to connect with Solr. Checksolrconfig.xml

out the Wiki page for more information. You may also find the following overviewhttp://wiki.apache.org/solr/SolrJmx

of JMX to be useful: .http://docs.oracle.com/javase/7/docs/technotes/guides/management/agent.html

Configuring JMX

JMX configuration is provided in . Please see the for more details.solrconfig.xml JMX Technology Home Page

A attribute can be used when configuring in . If this attribute is set, SolrrootName <jmx /> solrconfig.xml

uses it as the root name for all the MBeans that Solr exposes via JMX. The default name is "solr" followed by the

core name.

410Apache Solr Reference Guide 4.10

Configuring an Existing MBeanServer

The command:

<jmx />

enables JMX support in Solr if and only if an existing MBeanServer is found. Use this if you want to configure JMX

with JVM parameters. Remove this to disable exposing Solr configuration and statistics to JMX. If this is specified,

Solr will try to list all available MBeanServers and use the first one to register MBeans.

Configuring an Existing MBeanServer with agentId

The command:

enables JMX support in Solr if and only if an existing MBeanServer is found matching the given agentId. If multiple

servers are found, the first one is used. If none is found, an exception is raised and depending on the configuration,

Solr may refuse to start.

Configuring a New MBeanServer

The command:

creates a new MBeanServer exposed for remote monitoring at the specific service URL. If the JMXConnectorServer

can't be started (probably because the serviceUrl is bad), an exception is thrown.

Example

Using the example jetty setup provided with Solr installation, the JMX support works like this in .jconsole.png

Run "ant example" to build the example war file.

Go to the example folder in the Solr installation and run the following command:

java -Dcom.sun.management.jmxremote -jar start.jar

Start (provided with the Sun JDK in the bin directory).jconsole

Connect to the "start.jar" shown in the list of local processes.

Switch to the "MBeans" tab. You should be able to see "solr" listed there.

Configuring a Remote Connection to Solr JMX

If you want to connect to Solr remotely, you need to pass in some extra parameters, documented here:

http://docs.oracle.com/javase/7/docs/technotes/guides/management/agent.html

If you are not able to connect from a remote machine, you may also need to specify the hostname of the Solr host

Enabling/disabling JMX and securing access to MBeanServers is left up to the user by specifying

appropriate JVM parameters and configuration. Please explore the for moreJMX Technology Home Page

details.

411Apache Solr Reference Guide 4.10

by adding the following property as well:

Managed Resources

Managed resources expose a REST API endpoint for performing Create-Read-Update-Delete (CRUD) operations

on a Solr object. Any long-lived Solr object that has configuration settings and/or data is a good candidate to be a

managed resource. Managed resources complement other programmatically manageable components in Solr, such

as the RESTful schema API to add fields to a managed schema. Consider a Web-based UI that offers

Solr-as-a-Service where users need to configure a set of stop words and synonym mappings as part of an initial

setup process for their search application. This type of use case can easily be supported using the Managed Stop

Filter & Managed Synonym Filter Factories provided by Solr, via the Managed resources REST API. Users can also

write their own custom plugins, that leverage the same internal hooks to make additional resources REST managed.

Overview

Let's begin learning about managed resources by looking at a couple of examples provided by Solr for managing

stop words and synonyms using a REST API. After reading this section, you'll be ready to dig into the details of how

managed resources are implemented in Solr so you can start building your own implementation.

Stop words

To begin, you need to define a field type that uses the , such as: ManagedStopFilterFactory

<filter class="solr.ManagedStopFilterFactory"

managed="english" />

</analyzer>

</fieldType>

There are two important things to notice about this field type definition. First, the filter implementation class is solr

. This is a special implementation of the that uses a set of stop .ManagedStopFilterFactory StopFilterFactory

words that are managed from a REST API. Second, the attribute gives a name to the set of managed=”english”

managed stop words, in this case indicating the stop words are for English text.

The REST endpoint for managing the English stop words in the example collection is: /solr/collection1/sche

ma/analysis/stopwords/english

The example resource path should be mostly self-explanatory. It should be noted that the ManagedStopFilterFactory

implementation determines the part of the path, which makes sense because this is/schema/analysis/stopwords

an analysis component defined by the schema. It follows that a field type that uses the following filter:

<filter class="solr.ManagedStopFilterFactory"

managed="french" />

would resolve to path: /solr/collection1/schema/analysis/stopwords/french

Making JMX connections into machines running behind NATs (e.g. Amazon's EC2 service) is not a simple

task. The system property may help, but running on the serverjava.rmi.server.hostname jconsole

itself and using a remote desktop is often the simplest solution. See http://web.archive.org/web/2013052502

.2506/http://jmsbrdy.com/monitoring-java-applications-running-on-ec2-i

412Apache Solr Reference Guide 4.10

So now let’s see this API in action, starting with a simple GET request:

curl "http://localhost:8983/solr/collection1/schema/analysis/stopwords/english"

Assuming you sent this request to the example server, the response body is a JSON document:

{

"responseHeader":{

"status":0,

"QTime":1

"wordSet":{

"initArgs":{"ignoreCase":true},

"initializedOn":"2014-03-28T20:53:53.058Z",

"managedList":[

"a",

"an",

"and",

"are",

... ]

}

The collection1 core in the example server ships with a built-in set of managed stop words, see: example/solr/c

n. However, you should only interact withollection1/conf/_schema_analysis_stopwords_english.jso

this file using the API and not edit it directly.

One thing that should stand out to you in this response is that it contains a of words as well as managedList ini

. This is an important concept in this framework—managed resources typically have configuration and data. tArgs

For stop words, the only configuration parameter is a boolean that determines whether to ignore the case of tokens

during stop word filtering (ignoreCase=true|false). The data is a list of words, which is represented as a JSON array

named in the response. managedList

Now, let’s add a new word to the English stop word list using an HTTP PUT:

curl -X PUT -H 'Content-type:application/json' --data-binary '["foo"]'

"http://localhost:8983/solr/collection1/schema/analysis/stopwords/english"

Here we’re using cURL to PUT a JSON list containing a single word “foo” to the managed English stop words set.

Solr will return 200 if the request was successful. You can also put multiple words in a single PUT request.

You can test to see if a specific word exists by sending a GET request for that word as a child resource of the set,

such as:

curl "http://localhost:8983/solr/collection1/schema/analysis/stopwords/english/foo"

This request will return a status code of 200 if the child resource (foo) exists or 404 if it does not exist the managed

list.

To delete a stop word, you would do:

413Apache Solr Reference Guide 4.10

curl -X DELETE

"http://localhost:8983/solr/collection1/schema/analysis/stopwords/english/foo"

Note: PUT/POST is used to add terms to an existing list instead of replacing the list entirely. This is because it is

more common to add a term to an existing list than it is to replace a list altogether, so the API favors the more

common approach of incrementally adding terms especially since deleting individual terms is also supported.

Synonyms

For the most part, the API for managing synonyms behaves similar to the API for stop words, except instead of

working with a list of words, it uses a map, where the value for each entry in the map is a set of synonyms for a

term. As with stop words, the example server ships with a minimal set of English synonym mappings that is

activated by the following field type definition in schema.xml:

<filter class="solr.ManagedStopFilterFactory"

managed="english" />

<filter class="solr.ManagedSynonymFilterFactory"

managed="english" />

</analyzer>

</fieldType>

To get the map of managed English synonyms, send a GET request to:

curl "http://localhost:8983/solr/collection1/schema/analysis/synonyms/english"

This request will return a response that looks like:

{

"responseHeader":{

"status":0,

"QTime":4},

"synonymMappings":{

"initArgs":{

"ignoreCase":true,

"format":"solr"},

"initializedOn":"2014-03-31T15:46:48.77Z",

"managedMap":{

"gb":["gib","gigabyte"],

"happy":["glad","joyful"],

"tv":["television"]

}

Managed synonyms are returned under the property which contains a JSON Map where the value ofmanagedMap

each entry is a set of synonyms for a term, such as has synonyms and in the example above.happy glad joyful

414Apache Solr Reference Guide 4.10

To add a new synonym mapping, you can PUT/POST a single mapping such as:

curl -X PUT -H 'Content-type:application/json' --data-binary

'{"mad":["angry","upset"]}'

"http://localhost:8983/solr/collection1/schema/analysis/synonyms/english"

The API will return status code 200 if the PUT request was successful. To determine the synonyms for a specific

term, you send a GET request for the child resource, such as wo/schema/analysis/synonyms/english/mad

uld return . Lastly, you can delete a mapping by sending a DELETE request to the managed["angry","upset"]

endpoint.

Applying Changes

Changes made to managed resources via this REST API are not applied to the active Solr components until the Solr

collection (or Solr core in single server mode) is reloaded. For example:, after adding or deleting a stop word, you

must reload the core/collection before changes become active.

This approach is required when running in distributed mode so that we are assured changes are applied to all cores

in a collection at the same time so that behavior is consistent and predictable. It goes without saying that you don’t

want one of your replicas working with a different set of stop words or synonyms than the others.

One subtle outcome of this approach is that the once you make changes with the API,apply-changes-at-reload

there is no way to read the active data. In other words, the API returns the most up-to-date data from an API

perspective, which could be different than what is currently being used by Solr components. However, the intent of

this API implementation is that changes will be applied using a reload within a short time frame after making them so

the time in which the data returned by the API differs from what is active in the server is intended to be negligible.

RestManager Endpoint

Metadata about registered ManagedResources is available using the and en/schema/managed /config/managed

dpoints. Assuming you have the field type shown above defined in your schema.xml, sending a GETmanaged_en

request to the following resource will return metadata about which schema-related resources are being managed by

the RestManager:

curl "http://localhost:8983/solr/collection1/schema/managed"

The response body is a JSON document containing metadata about managed resources under the root:/schema

Changing things like stop words and synonym mappings typically require re-indexing existing documents if

being used by index-time analyzers. The RestManager framework does not guard you from this, it simply

makes it possible to programmatically build up a set of stop words, synonyms etc.

415Apache Solr Reference Guide 4.10

{

"responseHeader":{

"status":0,

"QTime":3

"managedResources":[

{

"resourceId":"/schema/analysis/stopwords/english",

"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource",

"numObservers":"1"

{

"resourceId":"/schema/analysis/synonyms/english",

"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymFilterFactory$SynonymManag

er",

"numObservers":"1"

}

]

}

You can also create new managed resource using PUT/POST to the appropriate URL – before ever configuring

anything that uses these resources.

For example: imagine we want to build up a set of German stop words. Before we can start adding stop words, we

need to create the endpoint:

/solr/collection1/schema/analysis/stopwords/german

To create this endpoint, send the following PUT/POST request to the endpoint we wish to create:

curl -X PUT -H 'Content-type:application/json' --data-binary \

'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}' \

"http://localhost:8983/solr/collection1/schema/analysis/stopwords/german"

Solr will respond with status code 200 if the request is successful. Effectively, this action registers a new endpoint

for a managed resource in the RestManager. From here you can start adding German stop words as we saw above:

curl -X PUT -H 'Content-type:application/json' --data-binary '["die"]' \

"http://localhost:8983/solr/collection1/schema/analysis/stopwords/german"

For most users, creating resources in this way should never be necessary, since managed resources are created

automatically when configured.

However: You may want to explicitly delete managed resources if they are no longer being used by a Solr

component.

For instance, the managed resource for German that we created above can be deleted because there are no Solr

components that are using it, whereas the managed resource for English stop words cannot be deleted because

there is a token filter declared in schema.xml that is using it.

curl -X DELETE

"http://localhost:8983/solr/collection1/schema/analysis/stopwords/german"

416Apache Solr Reference Guide 4.10

Related Topics

by Tim Potter @ SearchHub.orgUsing Solr’s REST APIs to manage stop words and synonyms

Running Solr on HDFS

Solr has support for writing and reading its index and transaction log files to the HDFS distributed filesystem. This

does not use Hadoop Map-Reduce to process Solr data, rather it only uses the HDFS filesystem for index and

transaction log file storage.

Basic Configuration

To use HDFS rather than a local filesystem, you must be using Hadoop 2.0.x and configure propsolrconfig.xml

erly.

You need to use an HdfsDirectoryFactory and a data dir of the form hdfs://host:port/path

You need to specify an UpdateLog location of the form hdfs://host:port/path

You should specify a lock factory type of ' ' or none.hdfs

With the default configuration files, you can start Solr on HDFS with the following command:

java -Dsolr.directoryFactory=HdfsDirectoryFactory

-Dsolr.lock.type=hdfs

-Dsolr.data.dir=hdfs://host:port/path

-Dsolr.updatelog=hdfs://host:port/path -jar start.jar

SolrCloud Configuration

In SolrCloud mode, it's best to leave the data and update log directories as the defaults Solr comes with and simply

specify the solr.hdfs.home. All dynamically created collections will create the appropriate directories automatically

under the solr.hdfs.home root directory.

Set solr.hdfs.home in the form hdfs://host:port/path

You should specify a lock factory type of ' ' or none.hdfs

With the default configuration files, you can start SolrCloud on HDFS with the following command:

java -Dsolr.directoryFactory=HdfsDirectoryFactory

-Dsolr.lock.type=hdfs

-Dsolr.hdfs.home=hdfs://host:port/path

The Block Cache

For performance, the HdfsDirectoryFactory uses a Directory that will cache HDFS blocks. This caching mechanism

is meant to replace the standard file system cache that Solr utilizes so much. By default, this cache is allocated off

heap. This cache will often need to be quite large and you may need to raise the off heap memory limit for the

specific JVM you are running Solr in. For the Oracle/OpenJDK JVMs, the follow is an example command line

parameter that you can use to raise the limit when starting Solr:

-XX:MaxDirectMemorySize=20g

417Apache Solr Reference Guide 4.10

Settings

The HdfsDirectoryFactory has a number of settings.

Solr HDFS Settings

Param Example Value Default Description

solr.hdfs.home hdfs://host:port/path/solr N/A A root location in HDFS for Solr to write collection data

to. Rather than specifying an HDFS location for the

data directory or update log directory, use this to

specify one root location and have everything

automatically created within this HDFS location.

Block Cache Settings

Param Default Description

solr.hdfs.blockcache.enabled true Enable the blockcache

solr.hdfs.blockcache.read.enabled true Enable the read cache

solr.hdfs.blockcache.write.enabled true Enable the write cache

solr.hdfs.blockcache.direct.memory.allocation true Enable direct memory allocation. If this

is false, heap is used

solr.hdfs.blockcache.slab.count 1 Number of memory slabs to allocate.

Each slab is 128 MB in size.

solr.hdfs.blockcache.global false Enable/Disable using one global cache

for all SolrCores. The settings used will

be from the first HdfsDirectoryFactory

created.

NRTCachingDirectory Settings

Param Default Description

solr.hdfs.nrtcachingdirectory.enable true Enable the use of NRTCachingDirectory

solr.hdfs.nrtcachingdirectory.maxmergesizemb 16 NRTCachingDirectory max segment size

for merges

solr.hdfs.nrtcachingdirectory.maxcachedmb 192 NRTCachingDirectory max cache size

HDFS Client Configuration Settings

solr.hdfs.confdir pass the location of HDFS client configuration files - needed for HDFS HA for example.

Param Default Description

solr.hdfs.confdir N/A Pass the location of HDFS client configuration files - needed for HDFS HA for

example.

418Apache Solr Reference Guide 4.10

Example

</directoryFactory>

Limitations

You must use an 'append-only' Lucene index codec because HDFS is an append only filesystem. The currently

default codec used by Solr is 'append-only' and supported with HDFS.

AutoAddReplica Settings

Collections created using autoAddReplica=true on a shared file system have auto addition of replica enabled. The

following settings can be used to override the defaults in the solrcloud section of solr.xml.

Param Default Description

autoReplicaFailoverWorkLoopDelay 10000 The time (in ms) between clusterstate inspections by the

Overseer to detect and possibly act on creation of a

replacement replica.

autoReplicaFailoverWaitAfterExpiration 30000 The minimum time (in ms) to wait for initiating replacement of

a replica after first noticing it not being live. This is important

to prevent false positives while stoping or starting the cluster.

autoReplicaFailoverBadNodeExpiration 60000 The delay (in ms) after which a replica marked as down

would be unmarked.

419Apache Solr Reference Guide 4.10
SolrCloud
Apache Solr includes the ability to set up a cluster of Solr servers that combines fault tolerance and high availability.
Called  , these capabilities provide distributed indexing and search capabilities, supporting the followingSolrCloud
features:
Central configuration for the entire cluster
Automatic load balancing and fail-over for queries
ZooKeeper integration for cluster coordination and configuration.
SolrCloud is flexible distributed search and indexing, without a master node to allocate nodes, shards and replicas.
Instead, Solr uses ZooKeeper to manage these locations, depending on configuration files and schemas.
Documents can be sent to any server and ZooKeeper will figure it out.
In this section, we'll cover everything you need to know about using Solr in SolrCloud mode. We've split up the
details into the following topics:
Getting Started with SolrCloud
How SolrCloud Works
Shards and Indexing Data in SolrCloud
Distributed Requests
Read and Write Side Fault Tolerance
NRT, Replication, and Disaster Recovery with SolrCloud
SolrCloud Configuration and Parameters
Using ZooKeeper to Manage Configuration Files
Collections API
Parameter Reference
Command Line Utilities
SolrCloud with Legacy Configuration Files
You can also find more information on the  .Solr wiki page on SolrCloud
Getting Started with SolrCloud
SolrCloud is designed to provide a highly available, fault tolerant environment that
can index your data for searching. It's a system in which data is organized into
multiple pieces, or shards, that can be housed on multiple machines, with replicas
providing redundancy for both scalability and fault tolerance, and a ZooKeeper
server that helps manage the overall structure so that both indexing and search
requests can be routed properly.
This section explains SolrCloud and its inner workings in detail, but before you dive
in, it's best to have an idea of what it is you're trying to accomplish. This page
provides a simple tutorial that explains how SolrCloud works on a practical level,
If upgrading an existing Solr 4.1 instance running with SolrCloud, be aware that the way the   parname_node
ameter is defined has changed. This may cause a situation where the   uses the IP address ofname_node
the machine instead of the server name, and thus SolrCloud is not aware of the existing node. If this
happens, you can manually edit the   parameter in   to refer to the server name, or set the host solr.xml ho
 in your system environment variables (since by default   is configured to inherit the   namest solr.xml host
from the environment variables). See also the section   for more information aboutSolr Cores and solr.xml
the   parameter.host

420Apache Solr Reference Guide 4.10

and how to take advantage of its capabilities. We'll use simple examples of

configuring SolrCloud on a single machine, which is obviously not a real production

environment, which would include several servers or virtual machines. In a real

production environment, you'll also use the real machine names instead of

"localhost", which we've used here.

In this section you will learn:

How to distribute data over multiple instances by using ZooKeeper and

creating shards.

How to create redundancy for shards by using replicas.

How to create redundancy for the overall cluster by running multiple

ZooKeeper instances.

Tutorials in this

section:

Interactive

SolrCloud

Example

Simple

Two-Shard

Cluster on

the Same

Machine

Two-Shard

Cluster with

Replicas

Using

Multiple

ZooKeepers

in an

Ensemble

Interactive SolrCloud Example

The script makes it easy to get started with SolrCloud as it walks you through the process of launchingbin/solr

Solr nodes in cloud mode and adding a collection. To get started, simply do:

$ bin/solr -e cloud

This starts an interactive session to walk you through the steps of setting up a simple SolrCloud cluster with

embedded ZooKeeper. The script starts by asking you how many Solr nodes you want to run in your local cluster,

with the default being 2.

Welcome to the SolrCloud example!

This interactive session will help you launch a SolrCloud cluster on your local

workstation.

To begin, how many Solr nodes would you like to run in your local cluster? (specify

1-4 nodes) [2]

The script supports starting up to 4 nodes, but we recommend using the default of 2 when starting out. Next, the

script will prompt you for the port to bind each of the Solr nodes to, such as:

Please enter the port for node1 [8983]

This tutorial assumes that you're already familiar with the basics of using Solr. If you need a refresher,

please visit the to get a grounding in Solr concepts. If you load documents as part ofGetting Started section

that exercise, you should start over with a fresh Solr installation for these SolrCloud tutorials.

421Apache Solr Reference Guide 4.10

Choose any available port for each node; the default for the first node is 8983 and 7574 for the second node. The

script will start each node in order and shows you the command it uses to start the server, such as:

solr start -cloud -d node1 -p 8983

The first node will also start an embedded ZooKeeper server bound to port 9983.

After starting up all nodes in the cluster, the script prompts you for the name of the collection to create:

Please provide a name for your new collection: [gettingstarted]

The suggested default is "gettingstarted" but you should choose a better name for your specific search application.

Next, the script prompts you for the number of shards to distribute the collection across. is covered in moreSharding

detail later in this reference guide, so if you're unsure, we suggest using the default of 2 so that you can see how a

collection is distributed across multiple nodes in a SolrCloud cluster.

Next, the script will prompt you for the number of replicas to create for each shard. is covered in moreReplication

detail later in the guide, so if you're unsure, then use the default of 2 so that you can see how replication is handled

in SolrCloud. Lastly, the script will prompt you for the name of a configuration directory for your collection. You can

choose or . The default configuration directory is pulled from default schemaless example/solr/collection1/

and the schemaless configuration is pulled from the conf example/example-schemaless/solr/collection

directory. The default configuration is more comprehensive and includes examples of most of Solr's core1/conf

capabilities, whereas the schemaless configuration uses the field-guessing and managed schema features in Solr.

The schemaless configuration is useful when you're still designing a schema for your documents and need some

flexiblity as you experiment with Solr.

At this point, you should have a new collection created in your local SolrCloud cluster. To verify this, you can run the

info command:

$ bin/solr -i

You can see how your collection is deployed across the cluster by visiting the cloud panel in the Solr Admin UI: http:/

/localhost:8983/solr/#/~cloud

You can restart your SolrCloud nodes using the script. For instance, to restart node1 running on portbin/solr

8983 (with an embedded ZooKeeper server), you would do:

$ bin/solr restart -c -p 8983 -d node1

To restart node2 running on port 7574, you can do:

$ bin/solr restart -c -p 7574 -d node2 -z localhost:9983

Notice that you need to specify the ZooKeeper address (-z localhost:9983) when starting node2 so that it can join

the cluster with node1.

The next section walks you through the manual process of setting up a SolrCloud cluster instead of using the bin/s

script (which performed all the manual steps for you).olr

422Apache Solr Reference Guide 4.10

Simple Two-Shard Cluster on the Same Machine

Creating a cluster with multiple shards involves two steps:

Start the first node, which will include an embedded ZooKeeper server to keep track of your cluster.

Start any remaining shard nodes and point them to the running ZooKeeper.

In this example, you'll create two separate Solr instances on the same machine. This is not a production-ready

installation, but just a quick exercise to get you familiar with SolrCloud.

For this exercise, we'll start by creating two copies of the directory that is part of the Solr distribution:example

cd <SOLR_DIST_HOME>

cp -r example node1

cp -r example node2

These copies of the directory can really be called anything. All we're trying to do is copy Solr's exampleexample

app to the side so we can play with it and still have a stand-alone Solr example to work with later if we want.

Next, start the first Solr instance, including the parameter, which also starts a local ZooKeeper instance:-DzkRun

cd node1

java -DzkRun -DnumShards=2 -Dbootstrap_confdir=./solr/collection1/conf

-Dcollection.configName=myconf -jar start.jar

Let's look at each of these parameters:

-DzkRun Starts up a ZooKeeper server embedded within Solr. This server will manage the cluster configuration.

Note that we're doing this example all on one machine; when you start working with a production system, you'll likely

(or at least a stand-alone ZooKeeper instance). In that case, you'll replaceuse multiple ZooKeepers in an ensemble

this parameter with , which is the hostname:port of the stand-alonezkHost=<ZooKeeper Host:Port>

ZooKeeper.

-DnumShards Determines how many pieces you're going to break your index into. In this case we're going to break

the index into two pieces, or , so we're setting this value to 2. The default value, if not specified, is 1.shards

-Dbootstrap_confdir ZooKeeper needs to get a copy of the cluster configuration, so this parameter tells it

where to find that information.

-Dcollection.configName This parameter determines the name under which that configuration information is

stored by ZooKeeper. We've used "myconf" as an example, it can be anything you'd like.

At this point you have one sever running, but it represents only half the shards, so you will need to start the second

Make sure to run Solr from the example directory in non-SolrCloud mode at least once before beginning;

this process unpacks the jar files necessary to run SolrCloud. However, do not load documents yet, just start

it once and shut it down.

The , , and parameters need only-DnumShards -Dbootstrap_confdir -Dcollection.configName

be specified once, the first time you start Solr in SolrCloud mode. They load your configurations into

ZooKeeper; if you run them again at a later time, they will re-load your configurations and may wipe out

changes you have made.

423Apache Solr Reference Guide 4.10

one before you have a fully functional cluster. To do that, start the second instance in another window as follows:

cd node2

java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar

Because this node isn't running ZooKeeper, and didn't involve bootstraping collection1, the parameters are a bit less

complex:

-Djetty.port The only reason we even have to set this parameter is because we're running both servers on the

same machine, so they can't both use Jetty's default port. In this case we're choosing an arbitrary number that's

different from the default. When you start on different machines, you can use the same Jetty ports if you'd like.

-DzkHost This parameter tells Solr where to find the ZooKeeper server so that it can "report for duty". By default,

the ZooKeeper server operates on the Solr port plus 1000. (Note that if you were running an external ZooKeeper

server, you'd simply point to that.)

At this point you should have two Solr windows running, both being managed by ZooKeeper. To verify that, open the

Solr Admin UI in your browser and go to the of the first Solr server you started:Cloud screen http://localhost:8983/so

lr/#/~cloud

You should see both node1 and node2, as in:

Now it's time to see the cluster in action. Start by indexing some data to one or both shards. You can do this any

way you like, but the easiest way is to use the , along with curl so that you can control which portexampledocs

(and thereby which server) gets the updates:

curl http://localhost:8983/solr/update?commit=true -H "Content-Type: text/xml" -d

"@mem.xml"

curl http://localhost:7574/solr/update?commit=true -H "Content-Type: text/xml" -d

"@monitor2.xml"

At this point each shard contains a subset of the data, but a search directed at either server should span both

shards. For example, the following searches should both return the identical set of all results:

http://localhost:8983/solr/collection1/select?q=*:*

http://localhost:7574/solr/collection1/select?q=*:*

The reason that this works is that each shard knows about the other shards, so the search is carried out on all

cores, then the results are combined and returned by the called server.

In this way you can have two cores or two hundred, with each containing a separate portion of the data.

If you want to check the number of documents on each shard, you could add to eachdistrib=false

424Apache Solr Reference Guide 4.10

But what about providing high availability, even if one of these servers goes down? To do that, you'll need to look at

replicas.

Two-Shard Cluster with Replicas

In order to provide high availability, you can create replicas, or copies of each shard that run in parallel with the main

core for that shard. The architecture consists of the original shards, which are called the leaders, and their replicas,

which contain the same data but let the leader handle all of the administrative tasks such as making sure data goes

to all of the places it should go. This way, if one copy of the shard goes down, the data is still available and the

cluster can continue to function.

Start by creating two more fresh copies of the example directory:

cd <SOLR_DIST_HOME>

cp -r example node3

cp -r example node4

Just as when we created the first two shards, you can name these copied directories whatever you want.

If you don't already have the two instances you created in the previous section up and running, go ahead and restart

them. From there, it's simply a matter of adding additional instances. Start by adding node3:

cd node3

java -Djetty.port=8900 -DzkHost=localhost:9983 -jar start.jar

Notice that the parameters are exactly the same as they were for starting the second node; you're simply pointing a

new instance at the original ZooKeeper. But if you look at the SolrCloud admin page, you'll see that it was added not

as a third shard, but as a replica for the first:

This is because the cluster already knew that there were only two shards and they were already accounted for, so

new nodes are added as replicas. Similarly, when you add the fourth instance, it's added as a replica for the second

shard:

cd node4

java -Djetty.port=7500 -DzkHost=localhost:9983 -jar start.jar

query and your search would not span all shards.

425Apache Solr Reference Guide 4.10

If you were to add additional instances, the cluster would continue this round-robin, adding replicas as necessary.

Replicas are attached to leaders in the order in which they are started, unless they are assigned to a specific shard

with an additional parameter of (as a system property, as in , the value of which is the IDshardId -DshardId=1

number of the shard the new node should be attached to). Upon restarts, the node will still be attached to the same

leader even if the is not defined again (it will always be attached to that machine).shardId

So where are we now? You now have four servers to handle your data. If you were to send data to a replica, as in:

curl http://localhost:7500/solr/update?commit=true -H "Content-Type: text/xml" -d

"@money.xml"

the course of events goes like this:

Replica (in this case the server on port 7500) gets the request.

Replica forwards request to its leader (in this case the server on port 7574).

The leader processes the request, and makes sure that all of its replicas process the request as well.

In this way, the data is available via a request to any of the running instances, as you can see by requests to:

http://localhost:8983/solr/collection1/select?q=*:*

http://localhost:7574/solr/collection1/select?q=*:*

http://localhost:8900/solr/collection1/select?q=*:*

http://localhost:7500/solr/collection1/select?q=*:*

But how does this help provide high availability? Simply put, a cluster must have at least one server running for each

shard in order to function. To test this, shut down the server on port 7574, and then check the other servers:

http://localhost:8983/solr/collection1/select?q=*:*

http://localhost:8900/solr/collection1/select?q=*:*

http://localhost:7500/solr/collection1/select?q=*:*

You should continue to see the full set of data, even though one of the servers is missing. In fact, you can have

multiple servers down, and as long as at least one instance for each shard is running, the cluster will continue to

function. If the leader goes down – as in this example – a new leader will be "elected" from among the remaining

replicas.

Note that when we talk about servers going down, in this example it's crucial that one particular server stays up, and

that's the one running on port 8983. That's because it's the instance running ZooKeeper. If that goes down, the

cluster can continue to function under some circumstances, but it won't be able to adapt to any servers that come up

or go down.

426Apache Solr Reference Guide 4.10

That kind of single point of failure is obviously unacceptable. Fortunately, there is a solution for this problem: multiple

ZooKeepers.

Using Multiple ZooKeepers in an Ensemble

To truly provide high availability, we need to make sure that not only do we also have at least one shard server

running at all times, but also that the cluster also has a ZooKeeper running to manage it. To do that, you can set up

a cluster to use multiple ZooKeepers. This is called using a ZooKeeper ensemble.

A ZooKeeper ensemble can keep running as long as more than half of its servers are up and running, so at least

two servers in a three ZooKeeper ensemble, 3 servers in a 5 server ensemble, and so on, must be running at any

given time. These required servers are called a quorum.

In this example, you're going to set up the same two-shard cluster you were using before, but instead of a single

ZooKeeper, you'll run a ZooKeeper server on three of the instances. Start by cleaning up any ZooKeeper data from

the previous example:

cd <SOLR_DIST_DIR>

rm -r node*/solr/zoo_data

Next you're going to restart the Solr servers, but this time, rather than having them all point to a single ZooKeeper

instance, each will run ZooKeeper listen to the rest of the ensemble for instructions.and

You're using the same ports as before – 8983, 7574, 8900 and 7500 – so any ZooKeeper instances would run on

ports 9983, 8574, 9900 and 8500. You don't actually need to run ZooKeeper on every single instance, however, so

assuming you run ZooKeeper on 9983, 8574, and 9900, the ensemble would have an address of:

localhost:9983,localhost:8574,localhost:9900

This means that when you start the first instance, you'll do it like this:

cd node1

java -DzkRun -DnumShards=2 -Dbootstrap_confdir=./solr/collection1/conf \

-Dcollection.configName=myconf \

-DzkHost=localhost:9983,localhost:8574,localhost:9900 \

-jar start.jar

You'll notice a lot of error messages scrolling past; this is because the ensemble doesn't yet have a quorum of

ZooKeepers running.

Notice also, that this step takes care of uploading the cluster's configuration information to ZooKeeper, so starting

the next server is more straightforward:

To simplify setup for this example we're using the internal ZooKeeper server that comes with Solr, but in a

production environment, you will likely be using an external ZooKeeper. The concepts are the same,

however. You can find instructions on setting up an external ZooKeeper server here: http://zookeeper.apach

e.org/doc/r3.3.4/zookeeperStarted.html

Note that the order of the parameters matters. Make sure to specify the -DzkHost parameter after the other

ZooKeeper-related parameters.

427Apache Solr Reference Guide 4.10

cd node2

java -Djetty.port=7574 -DzkRun -DnumShards=2 \

-DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar

Once you start this instance, you should see the errors begin to disappear on both instances, as the ZooKeepers

begin to update each other, even though you only have two of the three ZooKeepers in the ensemble running.

Next start the last ZooKeeper:

cd node3

java -Djetty.port=8900 -DzkRun -DnumShards=2 \

-DzkHost=localhost:9983,localhost:8574,localhost:9900 -jar start.jar

Finally, start the last replica, which doesn't itself run ZooKeeper, but references the ensemble:

cd node4

java -Djetty.port=7500 -DzkHost=localhost:9983,localhost:8574,localhost:9900 \

-jar start.jar

Just to make sure everything's working properly, run a query:

http://localhost:8983/solr/collection1/select?q=*:*

and check the SolrCloud admin page:

Now you can go ahead and kill the server on 8983, but ZooKeeper will still work, because you have more than half

of the original servers still running. To verify, open the SolrCloud admin page on another server, such as:

http://localhost:8900/solr/#/~cloud

How SolrCloud Works

In this section, we'll discuss generally how SolrCloud works, covering these topics:

Nodes, Cores, Clusters and Leaders

Shards and Indexing Data in SolrCloud

Distributed Requests

Read and Write Side Fault Tolerance

428Apache Solr Reference Guide 4.10

NRT, Replication, and Disaster Recovery with SolrCloud

If you are already familiar with SolrCloud concepts and functionality, you can skip to the section covering SolrCloud

.Configuration and Parameters

Basic SolrCloud Concepts

On a single node, Solr has a that is essentially a single . If you want multiple indexes, you create multiplecore index

cores. With SolrCloud, a single index can span multiple Solr instances. This means that a single index can be made

up of multiple cores on different machines.

The cores that make up one logical index are called a . A collection is a essentially a single index that cancollection

span many cores, both for index scaling as well as redundancy. If, for instance, you wanted to move your two-core

Solr setup to SolrCloud, you would have 2 collections, each made up of multiple individual cores.

In SolrCloud you can have multiple collections. Collections can be divided into slices. Each slice can exist in multiple

copies; these copies of the same slice are called . One of the shards within a slice is the , designatedshards leader

by a leader-election process. Each shard is a physical index, so one shard corresponds to one core.

It is important to understand the distinction between a core and a collection. In classic single node Solr, a core is

basically equivalent to a collection in that it presents one logical index. In SolrCloud, the cores on multiple nodes

form a collection. This is still just one logical index, but multiple cores host different shards of the full collection. So a

core encapsulates a single physical index on an instance. A collection is a combination of all of the cores that

together provide a logical index that is distributed across many nodes.

Differences Between Solr 3.x-style Scaling and SolrCloud

In Solr 3.x, Solr included following features:

The index and all changes to it are replicated to another Solr instance.

In distributed searches, queries are sent to multiple Solr instances and the results are combined into a single

output.

Documents are available only after committing, which may be expensive and not very timely.

Sharding must be done manually, usually through SolrJ or a similar utility, and there is no distributed

indexing: your index code must understand your sharding schema.

Replication must be manually configured and can slow down access to recent content because the system

needs to wait for a commit and the replication to be triggered and to complete.

Failure recovery may result in the loss of your ability to index, and make recovering your indexing process

difficult.

With SolrCloud, some capabilities are distributed:

SolrCloud automatically distributes index updates to the appropriate shard, distributes searches across

multiple shards, and assigns replicas to shards when replicas are available.

Near Real Time searching is supported, and if configured, documents are available after a "soft" commit.

Indexing accesses your sharding schema automatically.

Replication is automatic for backup purposes.

Recovery is robust and automatic.

ZooKeeper serves as a repository for cluster state.

Nodes, Cores, Clusters and Leaders

Nodes and Cores

429Apache Solr Reference Guide 4.10

In SolrCloud, a is Java Virtual Machine instance running Solr, commonly called a server. Each Solr core cannode

also be considered a node. Any node can contain both an instance of Solr and various kinds of data.

A Solr is basically an index of the text and fields found in documents. A single Solr instance can containcore

multiple "cores", which are separate from each other based on local criteria. It might be that they are going to

provide different search interfaces to users (customers in the US and customers in Canada, for example), or they

have security concerns (some users cannot have access to some documents), or the documents are really different

and just won't mix well in the same index (a shoe database and a dvd database).

When you start a new core in SolrCloud mode, it registers itself with ZooKeeper. This involves creating an

Ephemeral node that will go away if the Solr instance goes down, as well as registering information about the core

and how to contact it (such as the base Solr URL, core name, etc). Smart clients and nodes in the cluster can use

this information to determine who they need to talk to in order to fulfill a request.

New Solr cores may also be created and associated with a collection via . Additional cloud-relatedCoreAdmin

parameters are discussed in the page. Terms used for the CREATE action are:Parameter Reference

collection: the name of the collection to which this core belongs. Default is the name of the core.

shard: the shard id this core represents. (Optional: normally you want to be auto assigned a shard id.)

collection.<param>=<value>: causes a property of to be set if a new collection is being<param>=<value>

created. For example, use to point to the config for a newcollection.configName=<configname>

collection.

For example:

curl 'http://localhost:8983/solr/admin/cores?

action=CREATE&name=mycore&collection=collection1&shard=shard2'

Clusters

A cluster is set of Solr nodes managed by ZooKeeper as a single unit. When you have a cluster, you can always

make requests to the cluster and if the request is acknowledged, you can be sure that it will be managed as a unit

and be durable, i.e., you won't lose data. Updates can be seen right after they are made and the cluster can be

expanded or contracted.

Creating a Cluster

A cluster is created as soon as you have more than one Solr instance registered with ZooKeeper. The section Gettin

reviews how to set up a simple cluster.g Started with SolrCloud

Resizing a Cluster

Clusters contain a settable number of shards. You set the number of shards for a new cluster by passing a system

property, , when you start up Solr. The parameter must be passed on the first startup ofnumShards numShards

any Solr node, and is used to auto-assign which shard each instance should be part of. Once you have started up

more Solr nodes than , the nodes will create replicas for each shard, distributing them evenly across thenumShards

node, as long as they all belong to the same collection.

To add more cores to your collection, simply start the new core. You can do this at any time and the new core will

sync its data with the current replicas in the shard before becoming active.

You can also avoid and manually assign a core a shard ID if you choose.numShards

The number of shards determines how the data in your index is broken up, so you cannot change the number of

430Apache Solr Reference Guide 4.10

shards of the index after initially setting up the cluster.

However, you do have the option of breaking your index into multiple shards to start with, even if you are only using

a single machine. You can then expand to multiple machines later. To do that, follow these steps:

Set up your collection by hosting multiple cores on a single physical machine (or group of machines). Each of

these shards will be a leader for that shard.

When you're ready, you can migrate shards onto new machines by starting up a new replica for a given shard

on each new machine.

Remove the shard from the original machine. ZooKeeper will promote the replica to the leader for that shard.

Leaders and Replicas

The concept of a is similar to that of when thinking of traditional Solr replication. The leader isleader master

responsible for making sure the are up to date with the same information stored in the leader.replicas

However, with SolrCloud, you don't simply have one master and one or more "slaves", instead you likely have

distributed your search and index traffic to multiple machines. If you have bootstrapped Solr with , fornumShards=2

example, your indexes are split across both shards. In this case, both shards are considered leaders. If you start

more Solr nodes after the initial two, these will be automatically assigned as replicas for the leaders.

Replicas are assigned to shards in the order they are started the first time they join the cluster. This is done in a

round-robin manner, unless the new node is manually assigned to a shard with the parameter duringshardId

startup. This parameter is used as a system property, as in , the value of which is the ID number of-DshardId=1

the shard the new node should be attached to.

On subsequent restarts, each node joins the same shard that it was assigned to the first time the node was started

(whether that assignment happened manually or automatically). A node that was previously a replica, however, may

become the leader if the previously assigned leader is not available.

Consider this example:

Node A is started with the bootstrap parameters, pointing to a stand-alone ZooKeeper, with the pnumShards

arameter set to 2.

Node B is started and pointed to the stand-alone ZooKeeper.

Nodes A and B are both shards, and have fulfilled the 2 shard slots we defined when we started Node A. If we look

in the Solr Admin UI, we'll see that both nodes are considered leaders (indicated with a solid blank circle).

Node C is started and pointed to the stand-alone ZooKeeper.

Node C will automatically become a replica of Node A because we didn't specify any other shard for it to belong to,

and it cannot become a new shard because we only defined two shards and those have both been taken.

Node D is started and pointed to the stand-alone ZooKeeper.

Node D will automatically become a replica of Node B, for the same reasons why Node C is a replica of Node A.

Upon restart, suppose that Node C starts before Node A. What happens? Node C will become the leader, while

Node A becomes a replica of Node C.

Shards and Indexing Data in SolrCloud

When your data is too large for one node, you can break it up and store it in sections by creating one or more shard

. Each is a portion of the logical index, or core, and it's the set of all nodes containing that section of the index.s

A shard is a way of splitting a core over a number of "servers", or nodes. For example, you might have a shard for

431Apache Solr Reference Guide 4.10

data that represents each state, or different categories that are likely to be searched independently, but are often

combined.

Before SolrCloud, Solr supported Distributed Search, which allowed one query to be executed across multiple

shards, so the query was executed against the entire Solr index and no documents would be missed from the

search results. So splitting the core across shards is not exclusively a SolrCloud concept. There were, however,

several problems with the distributed approach that necessitated improvement with SolrCloud:

Splitting of the core into shards was somewhat manual.

There was no support for distributed indexing, which meant that you needed to explicitly send documents to a

specific shard; Solr couldn't figure out on its own what shards to send documents to.

There was no load balancing or failover, so if you got a high number of queries, you needed to figure out

where to send them and if one shard died it was just gone.

SolrCloud fixes all those problems. There is support for distributing both the index process and the queries

automatically, and ZooKeeper provides failover and load balancing. Additionally, every shard can also have multiple

replicas for additional robustness.

Unlike Solr 3.x, in SolrCloud there are no masters or slaves. Instead, there are leaders and replicas. Leaders are

automatically elected, initially on a first-come-first-served basis, and then based on the Zookeeper process

described at .http://zookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection.

If a leader goes down, one of its replicas is automatically elected as the new leader. As each node is started, it's

assigned to the shard with the fewest replicas. When there's a tie, it's assigned to the shard with the lowest shard

ID.

When a document is sent to a machine for indexing, the system first determines if the machine is a replica or a

leader.

If the machine is a replica, the document is forwarded to the leader for processing.

If the machine is a leader, SolrCloud determines which shard the document should go to, forwards the

document the leader for that shard, indexes the document for this shard, and forwards the index notation to

itself and any replicas.

Document Routing

Solr offers the ability to specify the router implementation used by a collection by specifying the pararouter.name

meter when . If you use the " " router, you can send documents with a prefix increating your collection compositeId

the document ID which will be used to calculate the hash Solr uses to determine the shard a document is sent to for

indexing. The prefix can be anything you'd like it to be (it doesn't have to be the shard name, for example), but it

must be consistent so Solr behaves consistently. For example, if you wanted to co-locate documents for a customer,

you could use the customer name or ID as the prefix. If your customer is "IBM", for example, with a document with

the ID "12345", you would insert the prefix into the document id field: "IBM!12345". The exclamation mark ('!') is

critical here, as it distinguishes the prefix used to determine which shard to direct the document to.

Then at query time, you include the prefix(es) into your query with the parameter (i.e., _route_ q=solr&_route_

) to direct queries to specific shards. In some situations, this may improve query performance because it=IBM!

overcomes network latency when querying all the shards.

The parameter replaces , which has been deprecated and will be removed in a_route_ shard.keys

future Solr release.

432Apache Solr Reference Guide 4.10

The router supports prefixes containing up to 2 levels of routing. For example: a prefix routing first bycompositeId

region, then by customer: "USA!IBM!12345"

If you do not want to influence how documents are stored, you don't need to specify a prefix in your document ID.

If you created the collection and defined the "implicit" router at the time of creation, you can additionally define a rou

parameter to use a field from each document to identify a shard where the document belongs. If theter.field

field specified is missing in the document, however, the document will be rejected. You could also use the _route_

parameter to name a specific shard.

Shard Splitting

Until Solr 4.3, when you created a collection in SolrCloud, you had to decide on your number of shards when you

created the collection and you could not change it later. It can be difficult to know in advance the number of shards

that you need, particularly when organizational requirements can change at a moment's notice, and the cost of

finding out later that you chose wrong can be high, involving creating new cores and re-indexing all of your data.

The ability to split shards is in the Collections API. It currently allows splitting a shard into two pieces. The existing

shard is left as-is, so the split action effectively makes two copies of the data as new shards. You can delete the old

shard at a later time when you're ready.

More details on how to use shard splitting is in the section on the .Collections API

Distributed Requests

One of the advantages of using SolrCloud is the ability to distribute requests among various shards that may or may

not contain the data that you're looking for. You have the option of searching over all of your data or just parts of it.

Querying all shards for a collection should look familiar; it's as though SolrCloud didn't even come into play:

http://localhost:8983/solr/collection1/select?q=*:*

If, on the other hand, you wanted to search just one shard, you can specify that shard, as in:

http://localhost:8983/solr/collection1/select?q=*:*&shards=localhost:7574/solr

If you want to search a group of shards, you can specify them together:

http://localhost:8983/solr/collection1/select?q=*:*&shards=localhost:7574/solr,localho

st:8983/solr

Or you can specify a list of servers to choose from for load balancing purposes by using the pipe symbol (|):

http://localhost:8983/solr/collection1/select?q=*:*&shards=localhost:7574/solr|localho

st:7500/solr

(If you have explicitly created your shards using ZooKeeper and have shard IDs, you can use those IDs rather than

server addresses.)

You also have the option of searching multiple collections. For example:

http://localhost:8983/solr/collection1/select?collection=collection1,collection2,colle

ction3

433Apache Solr Reference Guide 4.10

Read and Write Side Fault Tolerance

Read Side Fault Tolerance

With earlier versions of Solr, you had to set up your own load balancer. Now each individual node load balances

requests across the replicas in a cluster. You still need a load balancer on the 'outside' that talks to the cluster, or

you need a smart client (Solr provides a smart Java Solrj client called CloudSolrServer).

A smart client understands how to read and interact with ZooKeeper and only requests the ZooKeeper ensemble's

address to start discovering to which nodes it should send requests.

Each distributed search request is executed against all shards for a collection unless limited by the user with the 'sh

' or '_route_' parameters. If one or more shards queried are unavailable then the default is to fail the request.ards

However, there are many use-cases where partial results are acceptable and so Solr provides a boolean shards.t

parameter (default ' '). If then partial results may be returned. If theolerant false shards.tolerant=true

returned response does not contain results from all the appropriate shards then the response header contains a

special flag called ' '. The client can specify ' ' along with the ' 'partialResults shards.info shards.tolerant

parameter to retrieve more fine-grained details.

Example response with flag set to 'true':partialResults

{

"responseHeader": {

"status": 0,

"partialResults": true,

"QTime": 20,

"params": {

"wt": "json"

}

"response": {

"numFound": 77,

"start": 0,

"docs": [ ]

}

Write Side Fault Tolerance

SolrCloud supports near real-time actions, elasticity, high availability, and fault tolerance. What this means,

basically, is that when you have a large cluster, you can always make requests to the cluster, and if a request is

acknowledged you are sure it will be durable; i.e., you won't lose data. Updates can be seen right after they are

made and the cluster can be expanded or contracted.

Recovery

A Transaction Log is created for each node so that every change to content or organization is noted. The log is

used to determine which content in the node should be included in a replica. When a new replica is created, it refers

to the Leader and the Transaction Log to know which content to include. If it fails, it retries.

Since the Transaction Log consists of a record of updates, it allows for more robust indexing because it includes

redoing the uncommitted updates if indexing is interrupted.

Solr Response with partialResults

434Apache Solr Reference Guide 4.10

If a leader goes down, it may have sent requests to some replicas and not others. So when a new potential leader is

identified, it runs a synch process against the other replicas. If this is successful, everything should be consistent,

the leader registers as active, and normal actions proceed. If the a replica is too far out of synch, the system asks for

a full replication/replay-based recovery.

If an update fails because cores are reloading schemas and some have finished but others have not, the leader tells

the nodes that the update failed and starts the recovery procedure.

Achieved Replication Factor

When using a replication factor greater than one, an update request may succeed on the shard leader but fail on

one or more of the replicas. For instance, consider a collection with one shard and replication factor of three. In this

case, you have a shard leader and two additional replicas. If an update request succeeds on the leader but fails on

both replicas, for whatever reason, the update request is still considered successful from the perspective of the

client. The replicas that missed the update will sync with the leader when they recover.

Behind the scenes, this means that Solr has accepted updates that are only on one of the nodes (the current

leader). Solr supports the optional parameter on update requests that cause the server to return themin_rf

achieved replication factor for an update request in the response. For the example scenario described above, if the

client application included min_rf >= 1, then Solr would return rf=1 in the Solr response header because the request

only succeeded on the leader. The update request will still be accepted as the parameter only tells Solr thatmin_rf

the client application wishes to know what the achieved replication factor was for the update request. In other words,

min_rf does not mean Solr will enforce a minimum replication factor as Solr does not support rolling back updates

that succeed on a subset of replicas.

On the client side, if the achieved replication factor is less than the acceptable level, then the client application can

take additional measures to handle the degraded state. For instance, a client application may want to keep a log of

which update requests were sent while the state of the collection was degraded and then resend the updates once

the problem has been resolved. In short, is an optional mechanism for a client application to be warned thatmin_rf

an update request was accepted while the collection is in a degraded state.

NRT, Replication, and Disaster Recovery with SolrCloud

SolrCloud and Replication

Replication ensures redundancy for your data, and enables you to send an update request to any node in the shard.

If that node is a replica, it will forward the request to the leader, which then forwards it to all existing replicas, using

versioning to make sure every replica has the most up-to-date version. This architecture enables you to be certain

that your data can be recovered in the event of a disaster, even if you are using Near Real Time searching.

Near Real Time Searching

If you want to use the support, enable auto soft commits in your file beforeNearRealtimeSearch solrconfig.xml

storing it into Zookeeper. Otherwise you can send explicit soft commits to the cluster as you need.

SolrCloud doesn't work very well with separated data clusters connected by an expensive pipe. The root problem is

that SolrCloud's architecture sends documents to all the nodes in the cluster (on a per-shard basis), and that

architecture is really dictated by the NRT functionality.

Imagine that you have a set of servers in China and one in the US that are aware of each other. Assuming 5

replicas, a single update to a shard may make multiple trips over the expensive pipe before it's all done, probably

slowing indexing speed unacceptably.

So the SolrCloud recommendation for this situation is to maintain these clusters separately; nodes in China don't

435Apache Solr Reference Guide 4.10

even know that nodes exist in the US and vice-versa. When indexing, you send the update request to one node in

the US and one in China and all the node-routing after that is local to the separate clusters. Requests can go to any

node in either country and maintain a consistent view of the data.

However, if your US cluster goes down, you have to re-synchronize the down cluster with up-to-date information

from China. The process requires you to replicate the index from China to the repaired US installation and then get

everything back up and working.

Disaster Recovery for an NRT system

Use of Near Real Time (NRT) searching affects the way that systems using SolrCloud behave during disaster

recovery.

The procedure outlined below assumes that you are maintaining separate clusters, as described above. Consider,

for example, an event in which the US cluster goes down (say, because of a hurricane), but the China cluster is

intact. Disaster recovery consists of creating the new system and letting the intact cluster create a replicate for each

shard on it, then promoting those replicas to be leaders of the newly created US cluster.

Here are the steps to take:

Take the downed system offline to all end users.

Take the indexing process offline.

Repair the system.

Bring up one machine per shard in the repaired system as part of the ZooKeeper cluster on the good system,

and wait for replication to happen, creating a replica on that machine. (SoftCommits will not be repeated, but

data will be pulled from the transaction logs if necessary.)

Bring the machines of the repaired cluster down, and reconfigure them to be a separate Zookeeper cluster

again, optionally adding more replicas for each shard.

Make the repaired system visible to end users again.

Start the indexing program again, delivering updates to both systems.

SolrCloud Configuration and Parameters

In this section, we'll cover the various configuration options for SolrCloud.

In general, with a new Solr 4 instance, the required configuration is in the sample and schema.xml solrconfig.x

files. However, there may be reasons to change default settings or configure the cloud elements manually.ml

The following sections cover these topics:

Setting Up an External ZooKeeper

Ensemble

Using ZooKeeper to Manage

Configuration Files

Collections API

Parameter Reference

Command Line Utilities

SolrCloud with Legacy Configuration Files

Setting Up an External ZooKeeper Ensemble

Although Solr comes bundled with Apache ZooKeeper, you should consider yourself discouraged from using this

internal ZooKeeper in production, because shutting down a redundant Solr instance will also shut down its

ZooKeeper server, which might not be quite so redundant. Because a ZooKeeper ensemble must have a quorum of

more than half its servers running at any given time, this can be a problem.

SolrCloud will automatically use old-style replication for the bulk load. By temporarily having only one

replica, you'll minimize data transfer across a slow connection.

436Apache Solr Reference Guide 4.10

The solution to this problem is to set up an external ZooKeeper ensemble. Fortunately, while this process can seem

intimidating due to the number of powerful options, setting up a simple ensemble is actually quite straightforward.

The basic steps are as follows:

Download Apache ZooKeeper

The first step in setting up Apache ZooKeeper is, of course, to download the software. It's available from http://zooke

.eper.apache.org/releases.html

Setting Up a Single ZooKeeper

Create the instance

Creating the instance is a simple matter of extracting the files into a specific target directory. The actual directory

itself doesn't matter, as long as you know where it is, and where you'd like to have ZooKeeper store its internal data.

Configure the instance

The next step is to configure your ZooKeeper instance. To do that, create the following file: <ZOOKEEPER_HOME>/c

. To this file, add the following information:onf/zoo.cfg

tickTime=2000

dataDir=/var/lib/zookeeper

clientPort=2181

The parameters are as follows:

tickTime: Part of what ZooKeeper does is to determine which servers are up and running at any given time, and the

minimum session time out is defined as two "ticks". The parameter specifies, in miliseconds, how longtickTime

each tick should be.

dataDir: This is the directory in which ZooKeeper will store data about the cluster. This directory should start out

empty.

clientPort: This is the port on which Solr will access ZooKeeper.

Once this file is in place, you're ready to start the ZooKeeper instance.

Run the instance

To run the instance, you can simply use the script provided, as with thisZOOKEEPER_HOME/bin/zkServer.sh

command: zkServer.sh start

Again, ZooKeeper provides a great deal of power through additional configurations, but delving into them is beyond

the scope of this tutorial. For more information, see the ZooKeeper page. For this example,Getting Started

When using stand-alone ZooKeeper, you need to take care to keep your version of ZooKeeper updated with

the latest version distributed with Solr. Since you are using it as a stand-alone application, it does not get

upgraded when you upgrade Solr.

Solr 4.0 uses Apache ZooKeeper v3.3.6.

Solr 4.1 through 4.7 use Apache ZooKeeper v3.4.5.

Solr 4.8 and higher uses Apache ZooKeeper v3.4.6.

437Apache Solr Reference Guide 4.10

however, the defaults are fine.

Point Solr at the instance

Pointing Solr at the ZooKeeper instance you've created is a simple matter of using the parameter. For-DzkHost

example, in the example you learned how to point to the internal ZooKeeper. In thisGetting Started with SolrCloud

example, you would point to the ZooKeeper you've started on port 2181.

On the first server:

cd shard1

java -DnumShards=2 -Dbootstrap_confdir=./solr/collection1/conf

-Dcollection.configName=myconf -DzkHost=localhost:2181 -jar start.jar

On each subsequent server:

cd shard2java

java -Djetty.port=7574 -DzkHost=localhost:2181 -jar start.jar

As with the example, you must first upload the configuration information, and thenGetting Started with SolrCloud

you can connect a second, third, etc., instance.

Shut down ZooKeeper

To shut down ZooKeeper, use the zkServer script with the "stop" command: .zkServer.sh stop

Setting up a ZooKeeper Ensemble

In the Getting Started example, using a ZooKeeper ensemble was a simple matter of starting multiple instances and

pointing to them. With an external ZooKeeper ensemble, you need to set things up just a little more carefully.

The difference is that rather than simply starting up the servers, you need to configure them to know about and talk

to each other first. So your original file might look like this:zoo.cfg

dataDir=/var/lib/zookeeperdata/1

clientPort=2181

initLimit=5

syncLimit=2

server.1=localhost:2888:3888

server.2=localhost:2889:3889

server.3=localhost:2890:3890

Here you see three new parameters:

initLimit: The time, in ticks, the server allows for connecting to the leader. In this case, you have 5 ticks, each of

which is 2000 milliseconds long, so the server will wait as long as 10 seconds to connect.

syncLimit: The time, in ticks, the server will wait before updating itself from the leader.

server.X: These are the IDs and locations of all servers in the ensemble, the ports on which they communicate with

each other. The server ID must additionally stored in the file and be located in the of<dataDir>/myid dataDir

each ZooKeeper instance. The ID identifies each server, so in the case of this first instance, you would create the

file with the content "1"./var/lib/zookeeperdata/1/myid

Now, whereas with Solr you need to create entirely new directories to run multiple instances, all you need for a new

438Apache Solr Reference Guide 4.10

ZooKeeper instance, even if it's on the same machine for testing purposes, is a new configuration file. To complete

the example you'll create two more configuration files.

The file should have the content:<ZOOKEEPER_HOME>/conf/zoo2.cfg

tickTime=2000

dataDir=c:/sw/zookeeperdata/2

clientPort=2182

initLimit=5

syncLimit=2

server.1=localhost:2888:3888

server.2=localhost:2889:3889

server.3=localhost:2890:3890

You'll also need to create :<ZOOKEEPER_HOME>/conf/zoo3.cfg

tickTime=2000

dataDir=c:/sw/zookeeperdata/3

clientPort=2183

initLimit=5

syncLimit=2

server.1=localhost:2888:3888

server.2=localhost:2889:3889

server.3=localhost:2890:3890

Finally, create your files in each of the directories so that each server knows which instance it is.myid dataDir

The id in the file on each machine must match the "server.X" definition. So, the ZooKeeper instance (ormyid

machine) named "server.1" in the above example, must have a file containing the value "1". The file canmyid myid

be any integer between 1 and 255, and must match the server IDs assigned in the file.zoo.cfg

To start the servers, you can simply explicitly reference the configuration files:

cd <ZOOKEEPER_HOME>

bin/zkServer.sh start zoo.cfg

bin/zkServer.sh start zoo2.cfg

bin/zkServer.sh start zoo3.cfg

Once these servers are running, you can reference them from Solr just as you did before:

java -DnumShards=2 -Dbootstrap_confdir=./solr/collection1/conf \

-Dcollection.configName=myconf

-DzkHost=localhost:2181,localhost:2182,localhost:2183 -jar start.jar

For more information on getting the most power from your ZooKeeper installation, check out the ZooKeeper

.Administrator's Guide

Using ZooKeeper to Manage Configuration Files

With SolrCloud your configuration files (particularly and ) are kept in ZooKeeper.solrconfig.xml schema.xml

These files are uploaded when you first start Solr in SolrCloud mode.

Startup Bootstrap Parameters

There are two different ways you can use system properties to upload your initial configuration files to ZooKeeper

439Apache Solr Reference Guide 4.10

the first time you start Solr. Remember that these are meant to be used only on first startup or when overwriting

configuration files. Every time you start Solr with these system properties, any current configuration files in

ZooKeeper may be overwritten when names match.conf.set

The first way is to look at and upload the for each core found. The name will be thesolr.xml conf config set

collection name for that core, and collections will use the that has a matching name. One parameter isconfig set

used with this approach, . If you pass on startup, each core youbootstrap_conf -Dbootstrap_conf=true

have configured will have its configuration files automatically uploaded and linked to the collection containing the

core.

An alternate approach is to upload the given directory as a with the given name. No linking ofconfig set

collection to is done. However, if only one exists, a collection will autolink to it. Twoconfig set conf.set

parameters are used with this approach:

Parameter Default value Description

bootstrap_confdir No default If you pass on-bootstrap_confdir=<directory>

startup, that specific directory of configuration files will be

uploaded to ZooKeeper with a name defined byconf.set

the system property below, .collection.configName

collection.configName Defaults to confi

guration1

Determines the name of the pointed to by conf.set boots

.trap_confdir

Using the , you can download and re-upload these configuration files.ZooKeeper Command Line Interface (zkCLI)

Managing Your SolrCloud Configuration Files

To update or change your SolrCloud configuration files:

Download the latest configuration files from ZooKeeper, using the source control checkout process.

Make your changes.

Commit your changed file to source control.

Push the changes back to ZooKeeper.

Reload the collection so that the changes will be in effect.

There are some scripts available with the ZooKeeper Command Line Utility to help manage changes to

configuration files, discussed in the section on .Command Line Utilities

Collections API

The Collections API is used to enable you to create, remove, or reload collections, but in the context of SolrCloud

you can also use it to create collections with a specific number of shards and replicas.

API Entry Points

It's important to keep these files under version control.

By default, is not one of the Solr configuration files managed by ZooKeeper. If you would like tosolr.xml

keep your in ZooKeeper, starting with Solr 4.5 you can push it to ZooKeeper with the solr.xml zkcli.sh

utility (using the command). See the section for more information.putfile Command Line Utilities

440Apache Solr Reference Guide 4.10
The base URL for all API calls below is  .http://<hostname>:<port>/solr
/admin/collections?action=CREATE:   a collectioncreate
:   a collection/admin/collections?action=RELOAD reload
:   a shard into two new shards/admin/collections?action=SPLITSHARD split
:   a new shard/admin/collections?action=CREATESHARD create
:   an inactive shard/admin/collections?action=DELETESHARD delete
:   for a collection/admin/collections?action=CREATEALIAS create or modify an alias
:   for a collection/admin/collections?action=DELETEALIAS delete an alias
:   a collection/admin/collections?action=DELETE delete
:   of a shard/admin/collections?action=DELETEREPLICA delete a replica
:   of a shard/admin/collections?action=ADDREPLICA add a replica
:   /admin/collections?action=CLUSTERPROP Add/edit/delete a cluster-wide property
/admin/collections?action=MIGRATE: Migrate documents to another collection 
 to a node in the cluster/admin/collections?action=ADDROLE:   Add a specific role
/admin/collections?action=REMOVEROLE:   Remove an assigned role
 /admin/collections?action=OVERSEERSTATUS: Get status and statistics of the overseer
 /admin/collections?action=CLUSTERSTATUS:  Get cluster status
:   of a previous asynchronous request/admin/collections?action=REQUESTSTATUS Get the status
 /admin/collections?action=LIST: List all collections
Create a Collection
/admin/collections?action=CREATE&name=      name &numShards=  number &replicationFactor= numb
      er &maxShardsPerNode=  number &createNodeSet=  nodelist &collection.configName= confignam
e
Input
Query Parameters
Key Type Required Default Description
name string Yes   The name of the collection to be created.
router.name string No compositeId The router name that will be used. The router
defines how documents will be distributed among
the shards. The value can be either  , whichimplicit
uses an internal default hash, or  ,compositeId
which allows defining the specific shard to assign
documents to. When using the 'implicit' router, the 
 parameter is required. When using theshards
'compositeId' router, the   parameter isnumShards
required. For more information, see also the
section  .Document Routing
numShards integer No empty The number of shards to be created as part of the
collection. This is a required parameter when using
the 'compositeId' router.

441Apache Solr Reference Guide 4.10

shards string No empty A comma separated list of shard names, e.g.,

shard-x,shard-y,shard-z . This is a required

parameter when using the 'implicit' router.

replicationFactor integer No 1 The number of replicas to be created for each

shard.

maxShardsPerNode integer No 1 When creating collections, the shards and/or

replicas are spread across all available (i.e., live)

nodes, and two replicas of the same shard will

never be on the same node. If a node is not live

when the CREATE operation is called, it will not get

any parts of the new collection, which could lead to

too many replicas being created on a single live

node. Defining sets a limit onmaxShardsPerNode

the number of replicas CREATE will spread to each

node. If the entire collection can not be fit into the

live nodes, no collection will be created at all.

createNodeSet string No empty Allows defining the nodes to spread the new

collection across. If not provided, the CREATE

operation will create shard-replica spread across all

live Solr nodes. The format is a comma-separated

list of node_names, such as localhost:8983_s

olr, localhost:8984_solr, localhost:898

.5_solr

collection.configName string No empty Defines the name of the configurations (which must

already be stored in ZooKeeper) to use for this

collection. If not provided, Solr will default to the

collection name as the configuration name.

router.field string No empty If this field is specified, the router will look at the

value of the field in an input document to compute

the hash and identify a shard instead of looking at

the field. If the field specified is null inuniqueKey

the document, the document will be rejected.

Please note that or retrieval by idRealTime Get

would also require the parameter (or _route_ sha

) to avoid a distributed search.rd.keys

property. =name value string No Set core property to . See name value core.properti

.es file contents

autoAddReplicas boolean No false When set to true, enables auto addition of replicas

on shared file systems. Settings and overrides: Ru

nning Solr on HDFS#AutoAddReplica Settings

async string No Request ID to track this action which will be proces

sed asynchronously

442Apache Solr Reference Guide 4.10

Output

Output Content

The response will include the status of the request and the new core names. If the status is anything other than

"success", an error message will explain why the request failed.

Examples

Input

http://localhost:8983/solr/admin/collections?action=CREATE&name=newCollection&numShard

s=2&replicationFactor=1

Output

</lst>

<lst>

</lst>

<str name="core">newCollection_shard1_replica1</str>

<str name="saved">/Applications/solr-4.3.0/example/solr/solr.xml</str>

</lst>

<lst>

</lst>

<str name="core">newCollection_shard2_replica1</str>

<str name="saved">/Applications/solr-4.3.0/example/solr/solr.xml</str>

</lst>

</response>

Reload a Collection

/admin/collections?action=RELOAD&name= name

The RELOAD action is used when you have changed a configuration in ZooKeeper.

Input

Query Parameters

Key Type Required Description

name string Yes The name of the collection to reload.

Output

Output Content

443Apache Solr Reference Guide 4.10

The response will include the status of the request and the cores that were reloaded. If the status is anything other

than "success", an error message will explain why the request failed.

Examples

Input

http://localhost:8983/solr/admin/collections?action=RELOAD&name=newCollection

Output

</lst>

</lst>

</lst>

</response>

Split a Shard

/admin/collections?action=SPLITSHARD&collection= &shard=name shardID

Splitting a shard will take an existing shard and break it into two pieces. The original shard will continue to contain

the same data as-is but it will start re-routing requests to the new shards. The new shards will have as many replicas

as the original shard. After splitting a shard, you should issue a commit to make the documents visible, and then you

can remove the original shard (with the Core API or Solr Admin UI) when ready.

This command allows for seamless splitting and requires no downtime. A shard being split will continue to accept

query and indexing requests and will automatically start routing them to the new shards once this operation is

complete. This command can only be used for SolrCloud collections created with "numShards" parameter, meaning

collections which rely on Solr's hash-based routing mechanism.

The split is performed by dividing the original shard's hash range into two equal partitions and dividing up the

documents in the original shard according to the new sub-ranges.

One can also specify an optional 'ranges' parameter to divide the original shard's hash range into arbitrary hash

range intervals specified in hexadecimal. For example, if the original hash range is 0-1500 then adding the

parameter: ranges=0-1f4,1f5-3e8,3e9-5dc will divide the original shard into three shards with hash range 0-500,

501-1000 and 1001-1500 respectively.

Another optional parameter 'split.key' can be used to split a shard using a route key such that all documents of the

444Apache Solr Reference Guide 4.10

specified route key end up in a single dedicated sub-shard. Providing the 'shard' parameter is not required in this

case because the route key is enough to figure out the right shard. A route key which spans more than one shard is

not supported. For example, suppose split.key=A! hashes to the range 12-15 and belongs to shard 'shard1' with

range 0-20 then splitting by this route key would yield three sub-shards with ranges 0-11, 12-15 and 16-20. Note that

the sub-shard with the hash range of the route key may also contain documents for other route keys whose hash

ranges overlap.

Shard splitting can be a long running process. In order to avoid timeouts, starting Solr 4.8, you can run this as an

asynchronous call.

Input

Query Parameters

Key Type Required Description

collection string Yes The name of the collection that includes the shard to be split.

shard string Yes The name of the shard to be split.

ranges string No A comma-separated list of hash ranges in hexadecimal e.g.

ranges=0-1f4,1f5-3e8,3e9-5dc

split.key string No The key to use for splitting the index

property. =name v

alue

string No Set core property to . See .name value core.properties file contents

async string No Request ID to track this action which will be processed asynchronously

Output

Output Content

The output will include the status of the request and the new shard names, which will use the original shard as their

basis, adding an underscore and a number. For example, "shard1" will become "shard1_0" and "shard1_1". If the

status is anything other than "success", an error message will explain why the request failed.

Examples

Input

Split shard1 of the "anotherCollection" collection.

http://10.0.1.6:8983/solr/admin/collections?action=SPLITSHARD&collection=anotherCollec

tion&shard=shard1

Output

445Apache Solr Reference Guide 4.10

</lst>

<lst>

</lst>

<str name="core">anotherCollection_shard1_1_replica1</str>

<str name="saved">/Applications/solr-4.3.0/example/solr/solr.xml</str>

</lst>

<lst>

</lst>

<str name="core">anotherCollection_shard1_0_replica1</str>

<str name="saved">/Applications/solr-4.3.0/example/solr/solr.xml</str>

</lst>

<lst>

</lst>

<lst>

</lst>

<lst>

</lst>

<lst>

</lst>

<str name="core">anotherCollection_shard1_1_replica1</str>

<str name="status">EMPTY_BUFFER</str>

</lst>

<lst>

</lst>

<str name="core">anotherCollection_shard1_0_replica1</str>

<str name="status">EMPTY_BUFFER</str>

</lst>

</response>

446Apache Solr Reference Guide 4.10

Create a Shard

Shards can only created with this API for collections that use the 'implicit' router. Use SPLITSHARD for collections

using the 'compositeId' router. A new shard with a name can be created for an existing 'implicit' collection.

/admin/collections?action=CREATESHARD&shard= &collection=shardName name

Input

Query Parameters

Key Type Required Description

collection string Yes The name of the collection that includes the shard that will be splitted.

shard string Yes The name of the shard to be created.

createNodeSet string No Allows defining the nodes to spread the new collection across. If not

provided, the CREATE operation will create shard-replica spread across all

live Solr nodes. The format is a comma-separated list of node_names, such

as localhost:8983_solr, localhost:8984_solr, localhost:89

.85_solr

property.name

=value

string No Set core property to . See .name value core.properties file contents

Output

Output Content

The output will include the status of the request. If the status is anything other than "success", an error message will

explain why the request failed.

Examples

Input

Create 'shard-z' for the "anImplicitCollection" collection.

http://10.0.1.6:8983/solr/admin/collections?action=CREATESHARD&collection=anImplicitCo

llection&shard=shard-z

Output

</lst>

</response>

Delete a Shard

Deleting a shard will unload all replicas of the shard and remove them from . It will onlyclusterstate.json

remove shards that are inactive, or which have no range given for custom sharding.

/admin/collections?action=DELETESHARD&shard= &collection=shardID name

447Apache Solr Reference Guide 4.10

Input

Query Parameters

Key Type Required Description

collection string Yes The name of the collection that includes the shard to be deleted.

shard string Yes The name of the shard to be deleted.

Output

Output Content

The output will include the status of the request. If the status is anything other than "success", an error message will

explain why the request failed.

Examples

Input

Delete 'shard1' of the "anotherCollection" collection.

http://10.0.1.6:8983/solr/admin/collections?action=DELETESHARD&collection=anotherColle

ction&shard=shard1

Output

</lst>

</lst>

</response>

Create or modify an Alias for a Collection

The action will create a new alias pointing to one or more collections. If an alias by the same nameCREATEALIAS

already exists, this action will replace the existing alias, effectively acting like an atomic "MOVE" command.

/admin/collections?action=CREATEALIAS&name= &collections=name collectionlist

Input

Query Parameters

Key Type Required Description

name string Yes The alias name to be created.

448Apache Solr Reference Guide 4.10

collections string Yes The list of collections to be aliased, separated by commas.

Output

Output Content

The output will simply be a responseHeader with details of the time it took to process the request. To confirm the

creation of the alias, you can look in the Solr Admin UI, under the Cloud section and find the file.aliases.json

Examples

Input

Create an alias named "testalias" and link it to the collections named "anotherCollection" and "testCollection".

http://10.0.1.6:8983/solr/admin/collections?action=CREATEALIAS&name=testalias&collecti

ons=anotherCollection,testCollection

Output

</lst>

</response>

Delete a Collection Alias

/admin/collections?action=DELETEALIAS&name=name

Input

Query Parameters

Key Type Required Description

name string Yes The name of the alias to delete.

Output

Output Content

The output will simply be a responseHeader with details of the time it took to process the request. To confirm the

removal of the alias, you can look in the Solr Admin UI, under the Cloud section, and find the file.aliases.json

Examples

Input

Remove the alias named "testalias".

http://10.0.1.6:8983/solr/admin/collections?action=DELETEALIAS&name=testalias

Output

449Apache Solr Reference Guide 4.10

</lst>

</response>

Delete a Collection

/admin/collections?action=DELETE&name=collection

Input

Query Parameters

Key Type Required Description

name string Yes The name of the collection to delete.

Output

Output Content

The response will include the status of the request and the cores that were deleted. If the status is anything other

than "success", an error message will explain why the request failed.

Examples

Input

Delete the collection named "newCollection".

http://10.0.1.6:8983/solr/admin/collections?action=DELETE&name=newCollection

Output

450Apache Solr Reference Guide 4.10

</lst>

</lst>

<str name="saved">/Applications/solr-4.3.0/example/solr/solr.xml</str>

</lst>

</lst>

<str name="saved">/Applications/solr-4.3.0/example/solr/solr.xml</str>

</lst>

</response>

Delete a Replica

/admin/collections?action=DELETEREPLICA&collection= &shard= &replica=collection shard rep

lica

Delete a replica from a given collection and shard. If the corresponding core is up and running the core is unloaded

and the entry is removed from the clusterstate. If the node/core is down , the entry is taken off the clusterstate and if

the core comes up later it is automatically unregistered.

Input

Query Parameters

Key Type Required Description

collection string Yes The name of the collection.

shard string Yes The name of the shard that includes the replica to be removed.

replica string Yes The name of the replica to remove.

Examples

Input

http://10.0.1.6:8983/solr/admin/collections?action=DELETEREPLICA&collection=test2&shar

d=shard2&replica=core_node3

Output

Output Content

451Apache Solr Reference Guide 4.10

<lst name="responseHeader"><int name="status">0</int><int

name="QTime">110</int></lst>

</response>

Add Replica

/admin/collections?action=ADDREPLICA&collection= &shard= &node=collection shard solr_node

_name

Add a replica to a shard in a collection. The node name can be specified if the replica is to be created in a specific

node

Input

Query Parameters

Key Type Required Description

collection string Yes The name of the collection.

shard string No The name of the shard to which replica is to be added. Either shard or

_route_ must be provided

_route_ string No If the shard name is not known, just pass the _route_ value and the system

would identify the name of the shard

node string No The name of the node where the replica should be created

instanceDir string No The instanceDir for the core that will be created

dataDir string No The directory in which the core should be created

property.name

=value

string No Set core property to . See .name value core.properties file contents

async string No Request ID to track this action which will be processed asynchronously

Examples

Input

http://10.0.1.6:8983/solr/admin/collections?action=ADDREPLICA&collection=test2&shard=s

hard2&node=192.167.1.2:8983_solr

Output

Output Content

452Apache Solr Reference Guide 4.10

</lst>

<lst>

</lst>

<str name="core">test2_shard2_replica4</str>

<str name="saved">/Applications/solr-4.8.0/example/solr/solr.xml</str>

</lst>

</response>

Cluster Properties

/admin/collections?action=CLUSTERPROP&name= &val=propertyName propertyValue

Add, edit or delete a cluster-wide property.

Input

Query Parameters

Key Type Required Description

name string Yes The name of the property. The are a set of property names which are allowed. Other

names are rejected with an error. As of Solr 4.7, only the property isurlScheme

supported.

val string Yes The value of the property. If the value is empty or null, the property is unset.

Output

Output Content

The response will include the status of the request and the properties that were updated or removed. If the status is

anything other than "0", an error message will explain why the request failed.

Examples

Input

http://localhost:8983/solr/admin/collections?action=CLUSTERPROP&name=urlScheme&val=htt

ps://

Output

453Apache Solr Reference Guide 4.10

</lst>

</response>

Migrate documents to another collection

/admin/collections?action=MIGRATE&collection= &split.key= !&target.collection=name key1 t

&forward.timeout=60arget_collection

The MIGRATE command is used to migrate all documents having the given routing key to another collection. The

source collection will continue to have the same data as-is but it will start re-routing write requests to the target

collection for the number of seconds specified by the forward.timeout parameter. It is the responsibility of the user to

switch to the target collection for reads and writes after the ‘migrate’ command completes.

The routing key specified by the ‘split.key’ parameter may span multiple shards on both the source and the target

collections. The migration is performed shard-by-shard in a single thread. One or more temporary collections may

be created by this command during the ‘migrate’ process but they are cleaned up at the end automatically.

This is a synchronous operation and therefore keeping a large read timeout on the invocation is advised. The

request may still timeout due to inherent limitations of the Collection APIs but that doesn’t necessarily mean that the

operation has failed. Users should check logs, cluster state, source and target collections before invoking the

operation again.

This command works only with collections having the compositeId router. The target collection must not receive any

writes during the time the migrate command is running otherwise some writes may be lost.

Please note that the migrate API does not perform any de-duplication on the documents so if the target collection

contains documents with the same uniqueKey as the documents being migrated then the target collection will end

up with duplicate documents.

Input

Query Parameters

Key Type Required Description

collection string Yes The name of the source collection from which documents will be split.

target.collection string Yes The name of the target collection to which documents will be migrated.

split.key string Yes The routing key prefix. For example, if uniqueKey is a!123, then you would

use .split.key=a!

forward.timeout int No The timeout, in seconds, until which write requests made to the source

collection for the given will be forwarded to the target shard.split.key

The default is 60 seconds.

property.name

=value

string No Set core property to . See .name value core.properties file contents

async string No Request ID to track this action which will be processed asynchronously

454Apache Solr Reference Guide 4.10

Output

Output Content

The response will include the status of the request.

Examples

Input

http://localhost:8983/solr/admin/collections?action=MIGRATE&collection=test1&split.key

=a!&target.collection=test2

Output

</lst>

<lst>

</lst>

<str name="core">test2_shard1_0_replica1</str>

<str name="status">BUFFERING</str>

</lst>

<lst>

</lst>

<str name="core">split_shard1_0_temp_shard1_0_shard1_replica1</str>

</lst>

<lst>

</lst>

<lst>

</lst>

<lst>

</lst>

<str name="core">split_shard1_0_temp_shard1_0_shard1_replica2</str>

</lst>

<lst>

455Apache Solr Reference Guide 4.10

</lst>

<lst>

</lst>

<lst>

</lst>

<str name="core">test2_shard1_0_replica1</str>

<str name="status">EMPTY_BUFFER</str>

</lst>

</lst>

</lst>

<lst>

</lst>

<str name="core">test2_shard1_1_replica1</str>

<str name="status">BUFFERING</str>

</lst>

<lst>

</lst>

<str name="core">split_shard1_1_temp_shard1_1_shard1_replica1</str>

</lst>

<lst>

</lst>

<lst>

</lst>

<lst>

456Apache Solr Reference Guide 4.10

</lst>

<str name="core">split_shard1_1_temp_shard1_1_shard1_replica2</str>

</lst>

<lst>

</lst>

<lst>

</lst>

<lst>

</lst>

<str name="core">test2_shard1_1_replica1</str>

<str name="status">EMPTY_BUFFER</str>

</lst>

</lst>

</lst>

457Apache Solr Reference Guide 4.10

</lst>

</response>

Add Role

/admin/collections?action=ADDROLE&role= &node=roleName nodeName

Assign a role to a given node in the cluster. The only supported role as of 4.7 is 'overseer' . Use this API to dedicate

a particular node as Overseer. Invoke it multiple times to add more nodes. This is useful in large clusters where an

Overseer is likely to get overloaded . If available, one among the list of nodes which are assigned the 'overseer' role

would become the overseer. The system would assign the role to any other node if none of the designated nodes

are up and running

Input

Query Parameters

Key Type Required Description

role string Yes The name of the role. The only supported role as of now is overseer

node string Yes The name of the node . It is possible to assign a role even before that node is started

Output

Output Content

The response will include the status of the request and the properties that were updated or removed. If the status is

anything other than "0", an error message will explain why the request failed.

Examples

Input

http://localhost:8983/solr/admin/collections?action=ADDROLE&role=overseer&node=192.167

.1.2:8983_solr

Output

</lst>

</response>

Remove Role

/admin/collections?action=REMOVEROLE&role= &node=roleName nodeName

458Apache Solr Reference Guide 4.10

Remove an assigned role. This API is used to undo the roles assigned using ADDROLE operation

Input

Query Parameters

Key Type Required Description

role string Yes The name of the role. The only supported role as of now is overseer

node string Yes The name of the node

Output

Output Content

The response will include the status of the request and the properties that were updated or removed. If the status is

anything other than "0", an error message will explain why the request failed.

Examples

Input

http://localhost:8983/solr/admin/collections?action=REMOVEROLE&role=overseer&node=192.

167.1.2:8983_solr

Output

</lst>

</response>

Overseer status and statistics

/admin/collections?action=OVERSEERSTATUS

Returns the current status of the overseer, performance statistics of various overseer APIs as well as last 10 failures

per operation type.

Examples

Input:

http://localhost:8983/solr/admin/collections?action=OVERSEERSTATUS&wt=json

{

"responseHeader":{

"status":0,

"QTime":33},

"leader":"127.0.1.1:8983_solr",

"overseer_queue_size":0,

"overseer_work_queue_size":0,

"overseer_collection_queue_size":2,

459Apache Solr Reference Guide 4.10

"overseer_operations":[

"createcollection",{

"requests":2,

"errors":0,

"totalTime":1.010137,

"avgRequestsPerMinute":0.7467088842794136,

"5minRateRequestsPerMinute":7.525069023276674,

"15minRateRequestsPerMinute":10.271274280947182,

"avgTimePerRequest":0.5050685,

"medianRequestTime":0.5050685,

"75thPctlRequestTime":0.519016,

"95thPctlRequestTime":0.519016,

"99thPctlRequestTime":0.519016,

"999thPctlRequestTime":0.519016},

"removeshard",{

"requests":1,

"errors":0,

"totalTime":0.26784,

"avgRequestsPerMinute":0.4639267176178192,

"5minRateRequestsPerMinute":8.179027994326175,

"15minRateRequestsPerMinute":10.560587086130052,

"avgTimePerRequest":0.26784,

"medianRequestTime":0.26784,

"75thPctlRequestTime":0.26784,

"95thPctlRequestTime":0.26784,

"99thPctlRequestTime":0.26784,

"999thPctlRequestTime":0.26784},

"updateshardstate",{

"requests":1,

"errors":0,

"totalTime":0.609256,

"avgRequestsPerMinute":0.43725644039684236,

"5minRateRequestsPerMinute":8.043840552427673,

"15minRateRequestsPerMinute":10.502079828515368,

"avgTimePerRequest":0.609256,

"medianRequestTime":0.609256,

"75thPctlRequestTime":0.609256,

"95thPctlRequestTime":0.609256,

"99thPctlRequestTime":0.609256,

"999thPctlRequestTime":0.609256},

"state",{

"requests":29,

"errors":0,

"totalTime":25.777765,

"avgRequestsPerMinute":8.911471494053579,

"5minRateRequestsPerMinute":16.77961791015292,

"15minRateRequestsPerMinute":21.299616774565774,

"avgTimePerRequest":0.888888448275862,

"medianRequestTime":0.646322,

"75thPctlRequestTime":0.7662585,

"95thPctlRequestTime":4.9277995,

"99thPctlRequestTime":6.687749,

"999thPctlRequestTime":6.687749},

"createshard",{

"requests":2,

"errors":0,

"totalTime":0.328155,

"avgRequestsPerMinute":0.8384528317300947,

"5minRateRequestsPerMinute":15.560264184036232,

460Apache Solr Reference Guide 4.10

"15minRateRequestsPerMinute":20.772071869612244,

"avgTimePerRequest":0.1640775,

"medianRequestTime":0.1640775,

"75thPctlRequestTime":0.198494,

"95thPctlRequestTime":0.198494,

"99thPctlRequestTime":0.198494,

"999thPctlRequestTime":0.198494},

"leader",{

"requests":15,

"errors":0,

"totalTime":1.850757,

"avgRequestsPerMinute":4.664791390089222,

"5minRateRequestsPerMinute":15.267394345445812,

"15minRateRequestsPerMinute":20.61365640511346,

"avgTimePerRequest":0.1233838,

"medianRequestTime":0.095369,

"75thPctlRequestTime":0.190858,

"95thPctlRequestTime":0.245846,

"99thPctlRequestTime":0.245846,

"999thPctlRequestTime":0.245846},

"deletecore",{

"requests":2,

"errors":0,

"totalTime":0.1644,

"avgRequestsPerMinute":0.9277190814105167,

"5minRateRequestsPerMinute":16.35805598865235,

"15minRateRequestsPerMinute":21.121174172260105,

"avgTimePerRequest":0.0822,

"medianRequestTime":0.0822,

"75thPctlRequestTime":0.114723,

"95thPctlRequestTime":0.114723,

"99thPctlRequestTime":0.114723,

"999thPctlRequestTime":0.114723}],

"collection_operations":[

"overseerstatus",{

"requests":5,

"errors":0,

"totalTime":16.602856,

"avgRequestsPerMinute":1.8002951096636433,

"5minRateRequestsPerMinute":7.878245556506509,

"15minRateRequestsPerMinute":10.39984320341109,

"avgTimePerRequest":3.3205712000000003,

"medianRequestTime":3.42046,

"75thPctlRequestTime":4.0594019999999995,

"95thPctlRequestTime":4.563145,

"99thPctlRequestTime":4.563145,

"999thPctlRequestTime":4.563145},

"createalias",{

"requests":1,

"errors":0,

"totalTime":101.364917,

"avgRequestsPerMinute":8.304550290288862,

"5minRateRequestsPerMinute":12.0,

"15minRateRequestsPerMinute":12.0,

"avgTimePerRequest":101.364917,

"medianRequestTime":101.364917,

"75thPctlRequestTime":101.364917,

"95thPctlRequestTime":101.364917,

"99thPctlRequestTime":101.364917,

461Apache Solr Reference Guide 4.10

"999thPctlRequestTime":101.364917},

"splitshard",{

"requests":1,

"errors":1,

"recent_failures":[{

"request":{

"operation":"splitshard",

"shard":"shard2",

"collection":"example1"},

"response":[

"Operation splitshard caused

exception:","org.apache.solr.common.SolrException:org.apache.solr.common.SolrException

: No shard with the specified name exists: shard2",

"exception",{

"msg":"No shard with the specified name exists: shard2",

"rspCode":400}]}],

"totalTime":5905.432835,

"avgRequestsPerMinute":0.8198143044809885,

"5minRateRequestsPerMinute":8.043840552427673,

"15minRateRequestsPerMinute":10.502079828515368,

"avgTimePerRequest":2952.7164175,

"medianRequestTime":2952.7164175000003,

"75thPctlRequestTime":5904.384052,

"95thPctlRequestTime":5904.384052,

"99thPctlRequestTime":5904.384052,

"999thPctlRequestTime":5904.384052},

"createcollection",{

"requests":2,

"errors":0,

"totalTime":6294.35359,

"avgRequestsPerMinute":0.7466431055563431,

"5minRateRequestsPerMinute":7.5271593686145355,

"15minRateRequestsPerMinute":10.271591296400848,

"avgTimePerRequest":3147.176795,

"medianRequestTime":3147.1767950000003,

"75thPctlRequestTime":3387.162793,

"95thPctlRequestTime":3387.162793,

"99thPctlRequestTime":3387.162793,

"999thPctlRequestTime":3387.162793},

"deleteshard",{

"requests":1,

"errors":0,

"totalTime":320.071335,

"avgRequestsPerMinute":0.4637771550349566,

"5minRateRequestsPerMinute":8.179027994326175,

"15minRateRequestsPerMinute":10.560587086130052,

"avgTimePerRequest":320.071335,

"medianRequestTime":320.071335,

"75thPctlRequestTime":320.071335,

"95thPctlRequestTime":320.071335,

"99thPctlRequestTime":320.071335,

"999thPctlRequestTime":320.071335}],

"overseer_queue":[

"peek_wait100",{

"totalTime":2775.554755,

"avgRequestsPerMinute":12.440395120289685,

"5minRateRequestsPerMinute":18.487470843855192,

"15minRateRequestsPerMinute":22.052847430688917,

"avgTimePerRequest":69.388868875,

462Apache Solr Reference Guide 4.10

"medianRequestTime":101.1499165,

"75thPctlRequestTime":101.43390225,

"95thPctlRequestTime":101.9976678,

"99thPctlRequestTime":102.037032,

"999thPctlRequestTime":102.037032},

"peek_wait_forever",{

"totalTime":63247.861899,

"avgRequestsPerMinute":11.64420509572364,

"5minRateRequestsPerMinute":31.572546097788198,

"15minRateRequestsPerMinute":41.688934561096204,

"avgTimePerRequest":1664.4174183947368,

"medianRequestTime":636.5281970000001,

"75thPctlRequestTime":1629.3317682499999,

"95thPctlRequestTime":13220.58495709999,

"99thPctlRequestTime":16293.17735,

"999thPctlRequestTime":16293.17735},

"remove",{

"totalTime":92.528385,

"avgRequestsPerMinute":15.979782864505227,

"5minRateRequestsPerMinute":33.37988956147563,

"15minRateRequestsPerMinute":42.49548598991928,

"avgTimePerRequest":1.7793920192307693,

"medianRequestTime":1.769479,

"75thPctlRequestTime":2.22114175,

"95thPctlRequestTime":3.148778999999998,

"99thPctlRequestTime":4.393077,

"999thPctlRequestTime":4.393077},

"poll",{

"totalTime":94.686248,

"avgRequestsPerMinute":15.97973186166097,

"5minRateRequestsPerMinute":33.37988956147563,

"15minRateRequestsPerMinute":42.49548598991928,

"avgTimePerRequest":1.8208893846153844,

"medianRequestTime":1.819817,

"75thPctlRequestTime":2.266558,

"95thPctlRequestTime":3.2130298999999978,

"99thPctlRequestTime":4.433906,

"999thPctlRequestTime":4.433906}],

"overseer_internal_queue":[

"peek",{

"totalTime":0.516668,

"avgRequestsPerMinute":0.30642572162118586,

"5minRateRequestsPerMinute":6.696421749240565,

"15minRateRequestsPerMinute":9.879502985109362,

"avgTimePerRequest":0.516668,

"medianRequestTime":0.516668,

"75thPctlRequestTime":0.516668,

"95thPctlRequestTime":0.516668,

"99thPctlRequestTime":0.516668,

"999thPctlRequestTime":0.516668},

"offer",{

"totalTime":51.784521,

"avgRequestsPerMinute":15.979724576198302,

"5minRateRequestsPerMinute":33.37988956147563,

"15minRateRequestsPerMinute":42.49548598991928,

"avgTimePerRequest":0.9958561730769231,

"medianRequestTime":0.8628875,

"75thPctlRequestTime":1.1464622500000001,

"95thPctlRequestTime":1.6499188,

463Apache Solr Reference Guide 4.10

"99thPctlRequestTime":6.091519,

"999thPctlRequestTime":6.091519},

"remove",{

"totalTime":143.130248,

"avgRequestsPerMinute":27.6584163855513,

"5minRateRequestsPerMinute":64.95243565926378,

"15minRateRequestsPerMinute":84.18442055101546,

"avgTimePerRequest":1.5903360888888889,

"medianRequestTime":1.660893,

"75thPctlRequestTime":2.35234925,

"95thPctlRequestTime":3.19950245,

"99thPctlRequestTime":5.01803,

"999thPctlRequestTime":5.01803},

"poll",{

"totalTime":147.837065,

"avgRequestsPerMinute":27.65837219382363,

"5minRateRequestsPerMinute":64.95243565926378,

"15minRateRequestsPerMinute":84.18442055101546,

"avgTimePerRequest":1.6426340555555554,

"medianRequestTime":1.6923249999999999,

"75thPctlRequestTime":2.40090275,

"95thPctlRequestTime":3.2569366,

"99thPctlRequestTime":5.062005,

"999thPctlRequestTime":5.062005}],

"collection_queue":[

"remove_event",{

"totalTime":37.638197,

"avgRequestsPerMinute":3.9610733603305124,

"5minRateRequestsPerMinute":9.122591857306068,

"15minRateRequestsPerMinute":10.928990808126446,

"avgTimePerRequest":3.421654272727273,

"medianRequestTime":3.411283,

"75thPctlRequestTime":4.212892,

"95thPctlRequestTime":4.720874,

"99thPctlRequestTime":4.720874,

"999thPctlRequestTime":4.720874},

"peek_wait_forever",{

"totalTime":183048.91735,

"avgRequestsPerMinute":3.677073912023291,

"5minRateRequestsPerMinute":1.5867138429776346,

"15minRateRequestsPerMinute":0.6561136902644256,

"avgTimePerRequest":15254.076445833334,

"medianRequestTime":6745.20675,

"75thPctlRequestTime":27662.958113499997,

464Apache Solr Reference Guide 4.10

"95thPctlRequestTime":49871.380589,

"99thPctlRequestTime":49871.380589,

"999thPctlRequestTime":49871.380589}]}

Cluster Status

/admin/collections?action=CLUSTERSTATUS

Fetch the cluster status including collections, shards, replicas as well as collection aliases and cluster properties.

Input

Query Parameters

Key Type Required Description

collection string No The collection name for which information is requested. If omitted, information on

all collections in the cluster will be returned.

shard string No The shard(s) for which information is requested. Multiple shard names can be

specified as a comma separated list.

Output

Output Content

The response will include the status of the request and the cluster status.

Examples

Input

http://localhost:8983/solr/admin/collections?action=clusterstatus&wt=json

Output

{

"responseHeader":{

"status":0,

"QTime":333},

"cluster":{

"collections":{

"collection1":{

"shards":{

"shard1":{

"range":"80000000-ffffffff",

"state":"active",

"replicas":{

"core_node1":{

"state":"active",

"core":"collection1",

"node_name":"127.0.1.1:8983_solr",

"base_url":"http://127.0.1.1:8983/solr",

"leader":"true"},

"core_node3":{

"state":"active",

"core":"collection1",

465Apache Solr Reference Guide 4.10

"node_name":"127.0.1.1:8900_solr",

"base_url":"http://127.0.1.1:8900/solr"}}},

"shard2":{

"range":"0-7fffffff",

"state":"active",

"replicas":{

"core_node2":{

"state":"active",

"core":"collection1",

"node_name":"127.0.1.1:7574_solr",

"base_url":"http://127.0.1.1:7574/solr",

"leader":"true"},

"core_node4":{

"state":"active",

"core":"collection1",

"node_name":"127.0.1.1:7500_solr",

"base_url":"http://127.0.1.1:7500/solr"}}}},

"maxShardsPerNode":"1",

"router":{"name":"compositeId"},

"replicationFactor":"1",

"autoCreated":"true",

"aliases":["both_collections"]},

"collection2":{

"shards":{

"shard1":{

"range":"80000000-d554ffff",

"state":"active",

"replicas":{"core_node1":{

"state":"active",

"core":"collection2_shard1_replica1",

"node_name":"127.0.1.1:8983_solr",

"base_url":"http://127.0.1.1:8983/solr",

"leader":"true"}}},

"shard2":{

"range":"d5550000-2aa9ffff",

"state":"active",

"replicas":{"core_node2":{

"state":"active",

"core":"collection2_shard2_replica1",

"node_name":"127.0.1.1:7500_solr",

"base_url":"http://127.0.1.1:7500/solr",

"leader":"true"}}},

"shard3":{

"range":"2aaa0000-7fffffff",

"state":"active",

"replicas":{"core_node3":{

"state":"active",

"core":"collection2_shard3_replica1",

"node_name":"127.0.1.1:8900_solr",

"base_url":"http://127.0.1.1:8900/solr",

"leader":"true"}}}},

"maxShardsPerNode":"1",

"router":{"name":"compositeId"},

"replicationFactor":"1",

"autoAddReplicas":"false",

"aliases":["both_collections"]}},

"aliases":{"both_collections":"collection1,collection2"},

"roles":{"overseer":["127.0.1.1:8983_solr",

"127.0.1.1:7574_solr"]},

466Apache Solr Reference Guide 4.10

"live_nodes":["127.0.1.1:7574_solr",

467Apache Solr Reference Guide 4.10

"127.0.1.1:7500_solr",

"127.0.1.1:8983_solr",

"127.0.1.1:8900_solr"]}}

Request Status

/admin/collections?action=REQUESTSTATUS&requestid=request-id

Request the status of an already submitted call. This call is also used to clear up theAsynchronous Collection API

stored statuses (See below).

Input

Query Parameters

Key Type Required Description

requestid string Yes The user defined request-id for the request. This can be used to track the status

of the submitted asynchronous task. -1 is a special request id which is used to

cleanup the stored states for all of the already completed/failed tasks.

Examples

Input: Valid Request Status

http://localhost:8983/solr/admin/collections?action=REQUESTSTATUS&requestid=1000

Output

</lst>

<str name="state">completed</str>

<str name="msg">found 1000 in completed tasks</str>

</lst>

</response>

Input: Invalid RequestId

http://localhost:8983/solr/admin/collections?action=REQUESTSTATUS&requestid=1004

Output

468Apache Solr Reference Guide 4.10

</lst>

<str name="state">notfound</str>

<str name="msg">Did not find taskid [1004] in any tasks queue</str>

</lst>

</response>

Input: Clearing up all the stored statuses

http://localhost:8983/solr/admin/collections?action=REQUESTSTATUS&requestid=-1

List Collections

/admin/collections?action=LIST

Fetch the names of the collections in the cluster.

Example

Input

http://localhost:8983/solr/admin/collections?action=LIST&wt=json

Output

{

"responseHeader":{

"status":0,

"QTime":2011},

"collections":["collection1",

"example1",

"example2"]}

Asynchronous Calls

Since some collection API calls can be long running tasks e.g. Shard Split, you can optionally have the calls run

asynchronously. Specifying enables you to make an asynchronous call, the status of which canasync=<request-id>

be, at any point requested using the call.REQUESTSTATUS

As of now, the REQUESTSTATUS does not automatically cleanup the tracking data structures i.e. the status of

completed/failed tasks stays stored in ZooKeeper unless cleared manually. Sending a REQUESTSTATUS call with r

of clears the stored statuses.equestid -1

Example

Input

469Apache Solr Reference Guide 4.10

http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=collection1&

shard=shard1&async=1000

Output

</lst>

</response>

Parameter Reference

Cluster Parameters

numShards Defaults

to 1

The number of shards to hash documents to. There must be one leader per shard and

each leader can have N replicas.

SolrCloud Instance Parameters

These are set in , but by default they are set up to also work with system properties.solr.xml

host Defaults to the first local

host address found

If the wrong host address is found automatically, you can override

the host address with this parameter.

hostPort Defaults to the jetty.port

system property

The port that Solr is running on. By default this is found by looking

at the system property.jetty.port

hostContext Defaults to solr The context path for the Solr web application.

SolrCloud Instance ZooKeeper Parameters

zkRun Defaults to localhost:<solrP

ort+1001>

Causes Solr to run an embedded version of

ZooKeeper. Set to the address of ZooKeeper on this

node; this allows us to know who you are in the list of

addresses in the connect string. UsezkHost

-DzkRun to get the default value.

zkHost No default The host address for ZooKeeper. Usually this is a

comma-separated list of addresses to each node in

your ZooKeeper ensemble.

zkClientTimeout Defaults to 15000 The time a client is allowed to not talk to ZooKeeper

before its session expires.

zkRun and are set up using system properties. is set up in by default, butzkHost zkClientTimeout solr.xml

can also be set using a system property.

470Apache Solr Reference Guide 4.10

SolrCloud Core Parameters

shardId Defaults to being automatically assigned based on

numShards

Allows you to specify the id used to group cores

into shards.

shardId can be configured in for each core element as an attribute.solr.xml

Additional cloud related parameters are discussed in .Solr Cores and solr.xml

Command Line Utilities

Solr's Administration page (found by default at ), provides a section with menuhttp://hostname:8983/solr/

items for monitoring indexing and performance statistics, information about index distribution and replication, and

information on all threads running in the JVM at the time. There is also a section where you can run queries, and an

assistance area.

In addition, SolrCloud provides its own administration page (found by default at ),http://localhost:8983/solr/#/~cloud

as well as a few tools available via ZooKeeper's Command Line Utility (CLI). The CLI lets you upload configuration

information to ZooKeeper, in the same two ways that were shown in the examples in . It alsoParameter Reference

provides a few other commands that let you link collection sets to collections, make ZooKeeper paths or clear them,

and download configurations from ZooKeeper to the local filesystem.

Using The ZooKeeper CLI

ZooKeeper has a utility that lets you pass command line parameters: (for Windows environments) and zkcli.bat

(for Unix environments).zkcli.sh

zkcli Parameters

Short Parameter

Usage

Meaning

-cmd <arg> CLI Command to be executed: , , , , bootstrap upconfig downconfig linkconfig m

, , , , , or . akepath get getfile put putfile list clear

This parameter is mandatory

-z -zkhost

ZooKeeper host address.

This parameter is for all CLI commands.mandatory

-c -collection

<name>

For : name of the collection.linkconfig

-d -confdir

<path>

For : a directory of configuration files.upconfig

-h -help Display help text.

-n -confname

<arg>

For , : name of the configuration set.upconfig linkconfig

-r -runzk

<port>

Run ZooKeeper internally by passing the Solr run port; only for clusters on one

machine.

471Apache Solr Reference Guide 4.10

-s -solrhome

<path>

For or when using : the solrhome location.bootstrap -runzk mandatory

The short form parameter options may be specified with a single dash (eg: ).-c mycollection

The long form parameter options may be specified using either a single dash (eg: )-collection mycollection

or a double dash (eg: )--collection mycollection

ZooKeeper CLI Examples

Below are some examples of using the CLI:zkcli

Uploading a Configuration Directory

java -classpath example/solr-webapp/WEB-INF/lib/*

org.apache.solr.cloud.ZkCLI -cmd upconfig -zkhost 127.0.0.1:9983

-confdir example/solr/collection1/conf -confname conf1 -solrhome example/solr

Put arbitrary data into a new ZK file

java -classpath example/solr-webapp/WEB-INF/lib/*

org.apache.solr.cloud.ZkCLI -zkhost 127.0.0.1:9983 -put /data.txt 'some data'

Put a local file into a new ZK file

java -classpath example/solr-webapp/WEB-INF/lib/*

org.apache.solr.cloud.ZkCLI -zkhost 127.0.0.1:9983 -putfile /data.txt

/some/local/file.txt

Linking a Collection to a Configuration Set

java -classpath example/solr-webapp/webapp/WEB-INF/lib/*

org.apache.solr.cloud.ZkCLI -cmd linkconfig -zkhost 127.0.0.1:9983

-collection collection1 -confname conf1 -solrhome example/solr

Bootstrapping All the Configuration Directories in solr.xml

java -classpath example/solr-webapp/webapp/WEB-INF/lib/*

org.apache.solr.cloud.ZkCLI -cmd bootstrap -zkhost 127.0.0.1:9983

-solrhome example/solr

Scripts

There are scripts in that handle the classpath and class name for you if you are usingexample/cloud-scripts

Solr out of the box with Jetty. Commands then become:

sh zkcli.sh -cmd linkconfig -zkhost 127.0.0.1:9983

-collection collection1 -confname conf1 -solrhome example/solr

SolrCloud with Legacy Configuration Files

472Apache Solr Reference Guide 4.10

All of the required configuration is already set up in the sample configurations shipped with Solr. You only need to

add the following if you are migrating old configuration files. Do not remove these files and parameters from a new

Solr instance if you intend to use Solr in SolrCloud mode.

These properties exist in 3 files: , , and .schema.xml solrconfig.xml solr.xml

1. In , you must have a field defined:schema.xml _version_

2. In , you must have an defined. This should be defined in the sesolrconfig.xml UpdateLog updateHandler

ction.

...

</updateLog>

...

</updateHandler>

3. You must have a replication handler called defined:/replication

There are several parameters available for this handler, discussed in the section .Index Replication

4. You must have a Realtime Get handler called "/get" defined:

</lst>

</requestHandler>

The parameters for this handler are discussed in the section .RealTime Get

5. You must have the admin handlers defined:

6. And you must leave the admin path in as the default:solr.xml

7. The is part of the default update chain and is automatically injected into any of yourDistributedUpdateProcessor

custom update chains, so you don't actually need to make any changes for this capability. However, should you wish

to add it explicitly, you can still add it to the file as part of an solrconfig.xml updateRequestProcessorChain

. For example:

473Apache Solr Reference Guide 4.10

</updateRequestProcessorChain>

If you do not want the DistributedUpdateProcessFactory auto-injected into your chain (for example, if you want to

use SolrCloud functionality, but you want to distribute updates yourself) then specify the NoOpDistributingUpda

update processor factory in your chain:teProcessorFactory

</updateRequestProcessorChain>

In the update process, Solr skips updating processors that have already been run on other nodes.

474Apache Solr Reference Guide 4.10

Legacy Scaling and Distribution

This section describes how to set up distribution and replication in Solr. It is considered "legacy" behavior, since

while it is still supported in Solr, the SolrCloud functionality described in the previous chapter is where the current

development is headed. However, if you don't need all that SolrCloud delivers, search distribution and index

replication may be sufficient.

This section covers the following topics:

Introduction to Scaling and Distribution: Conceptual information about distribution and replication in Solr.

Distributed Search with Index Sharding: Detailed information about implementing distributed searching in Solr.

Index Replication: Detailed information about replicating your Solr indexes.

Combining Distribution and Replication: Detailed information about replicating shards in a distributed index.

Merging Indexes: Information about combining separate indexes in Solr.

Introduction to Scaling and Distribution

Both Lucene and Solr were designed to scale to support large implementations with minimal custom coding. This

section covers:

distributing an index across multiple servers

replicating an index on multiple servers

merging indexes

If you need full scale distribution of indexes and queries, as well as replication, load balancing and failover, you may

want to use SolrCloud. Full details on configuring and using SolrCloud is available in the section .SolrCloud

What Problem Does Distribution Solve?

If searches are taking too long or the index is approaching the physical limitations of its machine, you should

consider distributing the index across two or more Solr servers.

To distribute an index, you divide the index into partitions called shards, each of which runs on a separate machine.

Solr then partitions searches into sub-searches, which run on the individual shards, reporting results collectively.

The architectural details underlying index sharding are invisible to end users, who simply experience faster

performance on queries against very large indexes.

What Problem Does Replication Solve?

Replicating an index is useful when:

You have a large search volume which one machine cannot handle, so you need to distribute searches

across multiple read-only copies of the index.

There is a high volume/high rate of indexing which consumes machine resources and reduces search

performance on the indexing machine, so you need to separate indexing and searching.

You want to make a backup of the index (see ).Backing Up

Distributed Search with Index Sharding

When an index becomes too large to fit on a single system, or when a query takes too long to execute, an index can

be split into multiple shards, and Solr can query and merge results across those shards. A single shard receives the

query, distributes the query to other shards, and integrates the results. You can find additional information about

distributed search on the Solr wiki: .http://wiki.apache.org/solr/DistributedSearch

475Apache Solr Reference Guide 4.10

The figure below compares a single server to a distributed configuration with two shards.

Update commands may be sent to any server with distributed indexing configured correctly. Document adds and

deletes are forwarded to the appropriate server/shard based on a hash of the unique document id. commancommit

ds and commands are sent to every server in .deleteByQuery shards

Update reorders (i.e., replica A may see update X then Y, and replica B may see update Y then X). deleteByQuery

also handles reorders the same way, to ensure replicas are consistent. All replicas of a shard are consistent, even if

the updates arrive in a different order on different replicas.

Distributing Documents across Shards

It is up to you to get all your documents indexed on each shard of your server farm. Solr does not include

out-of-the-box support for distributed indexing, but your method can be as simple as a round robin technique. Just

index each document to the next server in the circle. (For more information about indexing, see Indexing and Basic

.)Data Operations

A simple hashing system would also work. The following should serve as an adequate hashing function.

uniqueId.hashCode() % numServers

One advantage of this approach is that it is easy to know where a document is if you need to update it or delete. In

contrast, if you are moving documents around in a round-robin fashion, you may not know where a document

actually is.

Solr does not calculate universal term/doc frequencies. For most large-scale implementations, it is not likely to

matter that Solr calculates TD/IDF at the shard level. However, if your collection is heavily skewed in its distribution

across servers, you may find misleading relevancy results in your searches. In general, it is probably best to

randomly distribute documents to your shards.

You can directly configure aspects of the concurrency and thread-pooling used within distributed search in Solr. This

allows for finer grained control and you can tune it to target your own specific requirements. The default

configuration favors throughput over latency.

To configure the standard handler, provide a configuration like this:

If single queries are currently fast enough and if one simply wants to expand the capacity (queries/sec) of

the search system, then standard index replication (replicating the entire index on multiple servers) should

be used instead of index sharding.

476Apache Solr Reference Guide 4.10

</shardHandler>

</requestHandler>

The parameters that can be specified are as follows:

Parameter Default Explanation

socketTimeout 0 (use OS default) The amount of time in ms that a socket is allowed to wait.

connTimeout 0 (use OS default) The amount of time in ms that is accepted for binding /

connecting a socket

maxConnectionsPerHost 20 The maximum number of connections that is made to each

individual shard in a distributed search.

corePoolSize 0 The retained lowest limit on the number of threads used in

coordinating distributed search.

maximumPoolSize Integer.MAX_VALUE The maximum number of threads used for coordinating

distributed search.

maxThreadIdleTime 5 seconds The amount of time to wait for before threads are scaled

back in response to a reduction in load.

sizeOfQueue -1 If specified, the thread pool will use a backing queue

instead of a direct handoff buffer. High throughput systems

will want to configure this to be a direct hand off (with -1).

Systems that desire better latency will want to configure a

reasonable size of queue to handle variations in requests.

fairnessPolicy false Chooses the JVM specifics dealing with fair policy

queuing, if enabled distributed searches will be handled in

a First in First out fashion at a cost to throughput. If

disabled throughput will be favored over latency.

Executing Distributed Searches with the Parametershards

If a query request includes the parameter, the Solr server distributes the request across all the shards listedshards

as arguments to the parameter. The parameter uses this syntax:shards

host:port/base_url[,host:port/base_url]*

For example, the parameter below causes the search to be distributed across two Solr servers: and shards solr1 s

, both of which are running on port 8983:olr2

http://localhost:8983/solr/select?shards=solr1:8983/solr,solr2:8983/solr&indent=true&

q=ipod+solr

Rather than require users to include the shards parameter explicitly, it is usually preferred to configure this

477Apache Solr Reference Guide 4.10

parameter as a default in the RequestHandler section of .solrconfig.xml

Currently, only query requests are distributed. This includes requests to the standard request handler (and

subclasses such as the DisMax RequestHandler), and any other handler (org.apache.solr.handler.compone

) using standard components that support distributed search.nt.searchHandler

Where , distributed responses will include information about the shard (where each shardshards.info=true

represents a logically different index or physical location), such as the following:

<str name="shardAddress">http://localhost:7777/solr</str>

</lst>

<str name="shardAddress">http://localhost:8888/solr</str>

</lst>

The following components support distributed search:

The component, which returns documents matching a queryQuery

The component, which processes facet.query and facet.field requests where facets are sorted by countFacet

(the default).

The component, which enables Solr to include "highlighted" matches in field values.Highlighting

The component, which returns simple statistics for numeric fields within the DocSet.Stats

The component, which helps with debugging.Debug

Limitations to Distributed Search

Distributed searching in Solr has the following limitations:

Each document indexed must have a unique key.

If Solr discovers duplicate document IDs, Solr selects the first document and discards subsequent ones.

Inverse-document frequency (IDF) calculations cannot be distributed.

The index for distributed searching may become momentarily out of sync if a commit happens between the

first and second phase of the distributed search. This might cause a situation where a document that once

matched a query and was subsequently changed may no longer match the query but will still be retrieved.

This situation is expected to be quite rare, however, and is only possible for a single query request.

The number of shards is limited by number of characters allowed for GET method's URI; most Web servers

generally support at least 4000 characters, but many servers limit URI length to reduce their vulnerability to

Denial of Service (DoS) attacks.

TF/IDF computations are per shard. This may not matter if content is well (randomly) distributed.

Do not add the parameter to the standard requestHandler; otherwise, search queries may enter anshards

infinite loop. Instead, define a new requestHandler that uses the parameter, and pass distributedshards

search requests to that handler.

478Apache Solr Reference Guide 4.10

Shard information can be returned with each document in a distributed search by including fl=id,

in the search request. This returns the shard URL.[shard]

In a distributed search, the data directory from the core descriptor overrides any data directory in solrconfi

g.xml.

Update commands may be sent to any server with distributed indexing configured correctly. Document adds

and deletes are forwarded to the appropriate server/shard based on a hash of the unique document id. com

commands and commands are sent to every server in .mit deleteByQuery shards

Avoiding Distributed Deadlock

Each shard may also serve top-level query requests and then make sub-requests to all of the other shards. In this

configuration, care should be taken to ensure that the max number of threads serving HTTP requests in the servlet

container is greater than the possible number of requests from both top-level clients and other shards. If this is not

the case, the configuration may result in a distributed deadlock.

For example, a deadlock might occur in the case of two shards, each with just a single thread to service HTTP

requests. Both threads could receive a top-level request concurrently, and make sub-requests to each other.

Because there are no more remaining threads to service requests, the servlet containers will block the incoming

requests until the other pending requests are finished, but they will not finish since they are waiting for the

sub-requests. By ensuring that the servlets are configured to handle a sufficient number of threads, you can avoid

deadlock situations like this.

Testing Index Sharding on Two Local Servers

For simple functionality testing, it's easiest to just set up two local Solr servers on different ports. (In a production

environment, of course, these servers would be deployed on separate machines.)

Make a copy of the solr example directory:

cd solr

cp -r example example7574

Change the port number:

perl -pi -e s/8983/7574/g example7574/etc/jetty.xml \

example7574/exampledocs/post.sh

In the first window, start up the server on port 8983:

cd example

java -server -jar start.jar

In the second window, start up the server on port 7574:

cd example7574

java -server -jar start.jar

In the third window, index some example documents to each server:

479Apache Solr Reference Guide 4.10

cd example/exampledocs

./post.sh [a-m]*.xml

cd ../../example7574/exampledocs

./post.sh [n-z]*.xml

Now do a distributed search across both servers with your browser or :curl

curl

'http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr

&indent=true&q=ipod+solr'

Index Replication

Index Replication distributes complete copies of a master index to one or

more slave servers. The master server continues to manage updates to the

index. All querying is handled by the slaves. This division of labor enables

Solr to scale to provide adequate responsiveness to queries against large

search volumes.

The figure below shows a Solr configuration using index replication. The

master server's index is replicated on the slaves.

A Solr index can be replicated across multiple slave servers, which then

process requests.

Topics covered in this

section:

Index Replication in

Solr

Replication

Terminology

Configuring the

Replication

RequestHandler on

a Master Server

Configuring the

Replication

RequestHandler on

a Slave Server

Setting Up a

Repeater with the

ReplicationHandler

Commit and

Optimize

Operations

Slave Replication

Index Replication

using ssh and rsync

The Snapshot and

Distribution Process

Commit and

Optimization

Distribution and

Optimization

Index Replication in Solr

Solr includes a Java implementation of index replication that works over HTTP.

The Lucene index format has changed with Solr 4. As a result,

once you upgrade, previous versions of Solr will no longer be able

to read the rest of your indices. In a master/slave configuration, all

searchers/slaves should be upgraded before the master. If the

master is updated first, the older searchers will not be able to read

the new index format.

480Apache Solr Reference Guide 4.10

For information on the / based replication, see .ssh rsync Index Replication using ssh and rsync

The Java-based implementation of index replication offers these benefits:

Replication without requiring external scripts

The configuration affecting replication is controlled by a single file, solrconfig.xml

Supports the replication of configuration files as well as index files

Works across platforms with same configuration

No reliance on OS-dependent hard links

Tightly integrated with Solr; an admin page offers fine-grained control of each aspect of replication

The Java-based replication feature is implemented as a RequestHandler. Configuring replication is therefore

similar to any normal RequestHandler.

Replication Terminology

The table below defines the key terms associated with Solr replication.

Term Definition

Collection A Lucene collection is a directory of files. These files make up the indexed and returnable data of

a Solr search repository.

Distribution The copying of a collection from the master server to all slaves. The distribution process takes

advantage of Lucene's index file structure.

Inserts and

Deletes

As inserts and deletes occur in the collection, the directory remains unchanged. Documents are

always inserted into newly created files. Documents that are deleted are not removed from the

files. They are flagged in the file, deletable, and are not removed from the files until the collection

is optimized.

Master and

Slave

The Solr distribution model uses the master/slave model. The master is the service which receives

all updates initially and keeps everything organized. Solr uses a single update master server

coupled with multiple query slave servers. All changes (such as inserts, updates, deletes, etc.) are

made against the single master server. Changes made on the master are distributed to all the

slave servers which service all query requests from the clients.

Update An update is a single change request against a single Solr instance. It may be a request to delete

a document, add a new document, change a document, delete all documents matching a query,

etc. Updates are handled synchronously within an individual Solr instance.

Optimization A process that compacts the index and merges segments in order to improve query performance.

New secondary segment(s) are created to contain documents inserted into the collection after it

has been optimized. A Lucene collection must be optimized periodically to maintain satisfactory

query performance. Optimization is run on the master server only. An optimized index will give you

a performance gain at query time of at least 10%. This gain may be more on an index that has

become fragmented over a period of time with many updates and no optimizations. Optimizations

require a much longer time than does the distribution of an optimized collection to all slaves.

Segments The number of files in a collection.

481Apache Solr Reference Guide 4.10

mergeFactor A parameter that controls the number of files (segments) in a collection. For example, when

mergeFactor is set to 3, Solr will fill one segment with documents until the limit maxBufferedDocs

is met, then it will start a new segment. When the number of segments specified by mergeFactor

is reached (in this example, 3) then Solr will merge all the segments into a single index file, then

begin writing new documents to a new segment.

Snapshot A directory containing hard links to the data files. Snapshots are distributed from the master server

when the slaves pull them, "smartcopying" the snapshot directory that contains the hard links to

the most recent collection data files.

Configuring the Replication RequestHandler on a Master Server

Before running a replication, you should set the following parameters on initialization of the handler:

Name Description

replicateAfter String specifying action after which replication should occur. Valid values are commit,

optimize, or startup. There can be multiple values for this parameter. If you use

"startup", you need to have a "commit" and/or "optimize" entry also if you want to

trigger replication on future commits or optimizes.

backupAfter String specifying action after which a backup should occur. Valid values are commit,

optimize, or startup. There can be multiple values for this parameter. It is not required

for replication, it just makes a backup.

maxNumberOfBackups Integer specifying how many backups to keep. This can be used to delete all but the

most recent N backups.

confFiles The configuration files to replicate, separated by a comma.

commitReserveDuration If your commits are very frequent and your network is slow, you can tweak this

parameter to increase the amount of time taken to download 5Mb from the master to a

slave. The default is 10 seconds.

The example below shows how to configure the Replication RequestHandler on a master server.

<str name="replicateAfter">optimize</str>

<str name="backupAfter">optimize</str>

<str name="confFiles">schema.xml,stopwords.txt,elevate.xml</str>

</lst>

</requestHandler>

Replicating solrconfig.xml

In the configuration file on the master server, include a line like the following:

<str name="confFiles">solrconfig_slave.xml:solrconfig.xml,x.xml,y.xml</str>

This ensures that the local configuration will be saved as on thesolrconfig_slave.xml solrconfig.xml

482Apache Solr Reference Guide 4.10

slave. All other files will be saved with their original names.

On the master server, the file name of the slave configuration file can be anything, as long as the name is correctly

identified in the string; then it will be saved as whatever file name appears after the colon ':'.confFiles

Configuring the Replication RequestHandler on a Slave Server

The code below shows how to configure a ReplicationHandler on a slave.

483Apache Solr Reference Guide 4.10

<!--fully qualified url for the replication handler of master. It is possible

to pass on this as

a request param for the fetchindex command-->

<str name="masterUrl">http://remote_host:port/solr/corename/replication</str>

<!--Interval in which the slave should poll master .Format is HH:mm:ss . If

this is absent slave does not

poll automatically.

But a fetchindex can be triggered from the admin or the http API -->

<!--to use compression while transferring the index files. The possible values

are internal|external

if the value is 'external' make sure that your master Solr has the settings

to honor the

accept-encoding header.

See here for details: http://wiki.apache.org/solr/SolrHttpCompression

If it is 'internal' everything will be taken care of automatically.

USE THIS ONLY IF YOUR BANDWIDTH IS LOW . THIS CAN ACTUALLY SLOWDOWN

REPLICATION IN A LAN-->

<str name="compression">internal</str>

<!--The following values are used when the slave connects to the master to

download the index files.

Default values implicitly set as 5000ms and 10000ms respectively. The user

DOES NOT need to specify

these unless the bandwidth is extremely low or if there is an extremely high

latency-->

<!-- If HTTP Basic authentication is enabled on the master, then the slave can

configured with the following -->

<str name="httpBasicAuthUser">username</str>

<str name="httpBasicAuthPassword">password</str>

</lst>

</requestHandler>

Setting Up a Repeater with the ReplicationHandler

If you are not using cores, then you simply omit the parameter above in the . Tocorename masterUrl

ensure that the URL is correct, just hit the URL with a browser. You must get a status OK response.

484Apache Solr Reference Guide 4.10

A master may be able to serve only so many slaves without affecting performance. Some organizations have

deployed slave servers across multiple data centers. If each slave downloads the index from a remote data center,

the resulting download may consume too much network bandwidth. To avoid performance degradation in cases like

this, you can configure one or more slaves as repeaters. A repeater is simply a node that acts as both a master and

a slave.

To configure a server as a repeater, the definition of the Replication in the requestHandler solrconfig.

file must include file lists of use for both masters and slaves.xml

Be sure to set the parameter to commit, even if is set to optimize onreplicateAfter replicateAfter

the main master. This is because on a repeater (or any slave), a commit is called only after the index is

downloaded. The optimize command is never called on slaves.

Optionally, one can configure the repeater to fetch compressed files from the master through the

compression parameter to reduce the index download time.

Here is an example of a ReplicationHandler configuration for a repeater:

<str name="replicateAfter">commit</str>

<str name="confFiles">schema.xml,stopwords.txt,synonyms.txt</str>

</lst>

<str name="masterUrl">http://master.solr.company.com:8983/solr/replication</str>

</lst>

</requestHandler>

Commit and Optimize Operations

When a commit or optimize operation is performed on the master, the RequestHandler reads the list of file names

which are associated with each commit point. This relies on the parameter in the configuration toreplicateAfter

decide which types of events should trigger replication.

Setting on the Master Description

commit Triggers replication whenever a commit is performed on the master index.

optimize Triggers replication whenever the master index is optimized.

startup Triggers replication whenever the master index starts up.

The replicateAfter parameter can accept multiple arguments. For example:

<str name="replicateAfter">startup</str>

<str name="replicateAfter">commit</str>

<str name="replicateAfter">optimize</str>

Slave Replication

The master is totally unaware of the slaves. The slave continuously keeps polling the master (depending on the pol

parameter) to check the current index version of the master. If the slave finds out that the master has alInterval

newer version of the index it initiates a replication process. The steps are as follows:

485Apache Solr Reference Guide 4.10

The slave issues a command to get the list of the files. This command returns the names of thefilelist

files as well as some metadata (for example, size, a lastmodified timestamp, an alias if any).

The slave checks with its own index if it has any of those files in the local index. It then runs the filecontent

command to download the missing files. This uses a custom format (akin to the HTTP chunked encoding) to

download the full content or a part of each file. If the connection breaks in between, the download resumes

from the point it failed. At any point, the slave tries 5 times before giving up a replication altogether.

The files are downloaded into a temp directory, so that if either the slave or the master crashes during the

download process, no files will be corrupted. Instead, the current replication will simply abort.

After the download completes, all the new files are moved to the live index directory and the file's timestamp

is same as its counterpart on the master.

A commit command is issued on the slave by the Slave's ReplicationHandler and the new index is loaded.

Replicating Configuration Files

To replicate configuration files, list them using using the parameter. Only files found in the directoconfFiles conf

ry of the master's Solr instance will be replicated.

Solr replicates configuration files only when the index itself is replicated. That means even if a configuration file is

changed on the master, that file will be replicated only after there is a new commit/optimize on master's index.

Unlike the index files, where the timestamp is good enough to figure out if they are identical, configuration files are

compared against their checksum. The files (on master and slave) are judged to be identical if theirschema.xml

checksums are identical.

As a precaution when replicating configuration files, Solr copies configuration files to a temporary directory before

moving them into their ultimate location in the conf directory. The old configuration files are then renamed and kept

in the same directory. The ReplicationHandler does not automatically clean up these old files.conf/

If a replication involved downloading of at least one configuration file, the ReplicationHandler issues a core-reload

command instead of a commit command.

Resolving Corruption Issues on Slave Servers

If documents are added to the slave, then the slave is no longer in sync with its master. However, the slave will not

undertake any action to put itself in sync, until the master has new index data. When a commit operation takes place

on the master, the index version of the master becomes different from that of the slave. The slave then fetches the

list of files and finds that some of the files present on the master are also present in the local index but with different

sizes and timestamps. This means that the master and slave have incompatible indexes. To correct this problem,

the slave then copies all the index files from master to a new index directory and asks the core to load the fresh

index from the new directory.

HTTP API Commands for the ReplicationHandler

You can use the HTTP commands below to control the ReplicationHandler's operations.

Command Description

http:// /solr/replication?commandmaster_host:port

=enablereplication

Enables replication on the master for all its slaves.

http:// /solr/replication?commandmaster_host:port

=disablereplication

Disables replication on the master for all its slaves.

486Apache Solr Reference Guide 4.10

http:// /solr/replication?command=indexvehost:port

rsion

Returns the version of the latest replicatable index on the

specified master or slave.

http:// /solr/replication?command=fslave_host:port

etchindex

Forces the specified slave to fetch a copy of the index from

its master.

If you like, you can pass an extra attribute such as masterUrl

or compression (or any other parameter which is specified in

the tag) to do a one time replication<lst name="slave">

from a master. This obviates the need for hard-coding the

master in the slave.

http:// /solr/replication?command=aslave_host:port

bortfetch

Aborts copying an index from a master to the specified slave.

http:// /solr/replication?command=eslave_host:port

nablepoll

Enables the specified slave to poll for changes on the

master.

http:// /solr/replication?command=dslave_host:port

isablepoll

Disables the specified slave from polling for changes on the

master.

http:// /solr/replication?command=dslave_host:port

etails

Retrieves configuration details and current status.

http:// /solr/replication?command=filelist&ihost:port

ndexversion=< >index-version-number

Retrieves a list of Lucene files present in the specified host's

index. You can discover the version number of the index by

running the command.indexversion

http:// /solr/replication?commandmaster_host:port

=backup

Creates a backup on master if there are committed index

data in the server; otherwise, does nothing. This command is

useful for making periodic backups.

request parameters:

numberToKeep: request parameter can be used with

the backup command unless the maxNumberOfBackup

initialization parameter has been specified on thes

handler – in which case ismaxNumberOfBackups

always used and attempts to use the renumberToKeep

quest parameter will cause an error.

name : (optional) Backup name . The snapshot will be

created in a directory called snapshot.<name> within the

data directory of the core . By default the name is

generated using date in format. IfyyyyMMddHHmmssSSS

parameter is passed , that would be usedlocation

instead of the data directory

location : Backup location

487Apache Solr Reference Guide 4.10

http:// master_host:port /solr/replication?comman

d=deletebackup

Delete any backup created using the command .backup

request parameters:

name snapshot.<name> must exist .If not, an error is

thrown

location: Location where the snapshot is created

Index Replication using ssh and rsync

Solr supports / -based replication. ssh rsync This mechanism only works on systems that support removing open

hard links.

Solr distribution is similar in concept to database replication. All collection changes come to one master Solr server.

All production queries are done against query slaves. Query slaves receive all their collection changes indirectly —

as new versions of a collection which they pull from the master. These collection downloads are polled for on a

cron'd basis.

A collection is a directory of many files. Collections are distributed to the slaves as snapshots of these files. Each

snapshot is made up of hard links to the files so copying of the actual files is not necessary when snapshots are

created. Lucene only rewrites files following an optimization command. Generally, once a file is written,significantly

it will change very little, if at all. This makes the underlying transport of rsync very useful. Files that have already

been transferred and have not changed do not need to be re-transferred with the new edition of a collection.

The Snapshot and Distribution Process

Here are the steps that Solr follows when replicating an index:

The command takes snapshots of the collection on the master. It runs when invoked by Solrsnapshooter

after it has done a commit or an optimize.

The command runs on the query slaves to pull the newest snapshot from the master. This issnappuller

done via rsync in daemon mode running on the master for better performance and lower CPU utilization over

rsync using a remote shell program as the transport.

The runs on the slave after a snapshot has been pulled from the master. This signals the localsnapinstaller

Solr server to open a new index reader, then auto-warming of the cache(s) begins (in the new reader), while

other requests continue to be served by the original index reader. Once auto-warming is complete, Solr

retires the old reader and directs all new queries to the newly cache-warmed reader.

All distribution activity is logged and written back to the master to be viewable on the distribution page of its

GUI.

Old versions of the index are removed from the master and slave servers by a cron'd .snapcleaner

If you are building an index from scratch, distribution is the final step of the process.

Manual copying of index files is not recommended; however, running distribution commands manually (that is, not

relying on to run them) is perfectly fine.crond

Snapshot Directories

Snapshots are stored in directories whose names follow this format: snapshot. yyyymmddHHMMSS

All the files in the index directory are hard links to the latest snapshot. This design offers these advantages:

The Solr implementation can keep multiple snapshots on each host without needing to keep multiple copies

488Apache Solr Reference Guide 4.10

of index files that have not changed.

File copying from master to slave is very fast.

Taking a snapshot is very fast as well.

Solr Distribution Scripts

For the Solr distribution scripts, the name of the index directory is defined either by the environment variable data_

in the configuration file or the command line argument . It should match thedir solr/conf/scripts.conf -d

value used by the Solr server which is defined in .solr/conf/solrconfig.xml

All Solr collection distribution scripts are bundled in a Solr release and reside in the directory .solr/src/scripts

It's recommended that you install the scripts in a directory.solr/bin/

Collection distribution scripts create and prepare for distribution a snapshot of a search collection after each commit

and optimize request if the and event listener is configured in topostCommit postOptimize solrconfig.xml

execute .snapshooter

The script creates a directory , where is a timestamp in the format, snapshooter snapshot.<ts> <ts> yyyymmdd

. It contains hard links to the data files.HHMMSS

Snapshots are distributed from the master server when the slaves pull them, "smartcopying" the snapshot directory

that contains the hard links to the most recent collection data files.

Name Description

snapshooter Creates a snapshot of a collection. Snapshooter is normally configured to run on the master

Solr server when a commit or optimize happens. Snapshooter can also be run manually, but

one must make sure that the index is in a consistent state, which can only be done by

pausing indexing and issuing a commit.

snappuller A shell script that runs as a job on a slave Solr server. The script looks for newcron

snapshots on the master Solr server and pulls them.

snappuller-enable Creates the file , whose presence enables snappuller.solr/logs/snappuller-enabled

snapinstaller Installs the latest snapshot (determined by the timestamp) into the place, using hard links

(similar to the process of taking a snapshot). Then issolr/logs/snapshot.current

written and scp'd (secure copied) back to the master Solr server. snapinstaller then triggers

the Solr server to open a new Searcher.

snapcleaner Runs as a job to remove snapshots more than a configurable number of days old or allcron

snapshots except for the most recent n number of snapshots. Also can be run manually.

rsyncd-start Starts the rsyncd daemon on the master Solr server which handles collection distribution

requests from the slaves.

rsyncd daemon Efficiently synchronizes a collection—between master and slaves—by copying only the files

that actually changed. In addition, rsync can optionally compress data before transmitting it.

rsyncd-stop Stops the rsyncd daemon on the master Solr server. The stop script then makes sure that the

daemon has in fact exited by trying to connect to it for up to 300 seconds. The stop script

exits with error code 2 if it fails to stop the rsyncd daemon.

489Apache Solr Reference Guide 4.10

rsyncd-enable Creates the file , whose presence allows the rsyncd daemonsolr/logs/rsyncd-enabled

to run, allowing replication to occur.

rsyncd-disable Removes the file , whose absence prevents the rsyncdsolr/logs/rsyncd-enabled

daemon from running, preventing replication.

For more information about usage arguments and syntax see the page on the SolrSolrCollectionDistributionScripts

Wiki.

Solr Distribution-related Cron Jobs

The distribution process is automated through the use of cron jobs. The cron jobs should run under the user ID that

the Solr server is running under.

Cron Job Description

snapcleaner The snapcleaner job should be run out of at the regular basis to clean up old snapshots.cron

This should be done on both the master and slave Solr servers. For example, the following jcron

ob runs everyday at midnight and cleans up snapshots 8 days and older:

0 0 * * * <solr.solr.home>/solr/bin/snapcleaner -D 7

Additional cleanup can always be performed on-demand by running snapcleaner manually.

snappuller

snapinstaller

On the slave Solr servers, snappuller should be run out of cron regularly to get the latest index

from the master Solr server. It is a good idea to also run snapinstaller with snappuller back-to-back

in the same crontab entry to install the latest index once it has been copied over to the slave Solr

server.

For example, the following cron job runs every 5 minutes to keep the slave Solr server in sync with the master Solr

server:

0,5,10,15,20,25,30,35,40,45,50,55 * * * *

<solr.solr.home>/solr/bin/snappuller;<solr.solr.home>/solr/bin/snapinstaller

Performance Tuning for Script-based Replication

Because fetching a master index uses the rsync utility, which transfers only the segments that have changed,

replication is normally very fast. However, if the master server has been optimized, then rsync may take a long time,

because many segments will have been changed in the process of optimization.

If replicating to multiple slaves consumes too much network bandwidth, consider the use of a repeater.

Make sure that slaves do not pull from the master so frequently that a previous replication is still running

when a new one is started. In general, it's best to allow at least a minute for the replication process to

complete. But in configurations with low network bandwidth or a very large index, even more time may be

required.

Commit and Optimization

On a very large index, adding even a few documents and then running an optimize operation causes the complete

Modern cron allows this to be shortened to .*/5 * * * *...

490Apache Solr Reference Guide 4.10

index to be rewritten. This consumes a lot of disk I/O and impacts query performance. Optimizing a very large index

may even involve copying the index twice and calling optimize at the beginning at the end. If some documentsand

have been deleted, the first optimize call will rewrite the index even before the second index is merged.

Optimization is an I/O intensive process, as the entire index is read and re-written in optimized form. Anecdotal data

shows that optimizations on modest server hardware can take around 5 minutes per GB, although this obviously

varies considerably with index fragmentation and hardware bottlenecks. We do not know what happens to query

performance on a collection that has not been optimized for a long time. We know that it will get worse as thedo

collection becomes more fragmented, but how much worse is very dependent on the manner of updates and

commits to the collection. The setting of the attribute affects performance as well. Dividing a largemergeFactor

index with millions of documents into even as few as five segments may degrade search performance by as much

as 15-20%.

While optimizing has many benefits, a rapidly changing index will not retain those benefits for long, and since

optimization is an intensive process, it may be better to consider other options, such as lowering the merge factor

(discussed in this Guide in the section on Index Configuration

Distribution and Optimization

The time required to optimize a master index can vary dramatically. A small index may be optimized in minutes. A

very large index may take hours. The variables include the size of the index and the speed of the hardware.

Distributing a newly optimized collection may take only a few minutes or up to an hour or more, again depending on

the size of the index and the performance capabilities of network connections and disks. During optimization the

machine is under load and does not process queries very well. Given a schedule of updates being driven a few

times an hour to the slaves, we cannot run an optimize with every committed snapshot.

Copying an optimized collection means that the collection will need to be transferred during the next snappull.entire

This is a large expense, but not nearly as huge as running the optimize everywhere. Consider this example: on a

three-slave one-master configuration, distributing a newly-optimized collection takes approximately 80 seconds .total

Rolling the change across a tier would require approximately ten minutes per machine (or machine group). If this

optimize were rolled across the query tier, and if each collection being optimized were disabled and not receiving

queries, a rollout would take at least twenty minutes and potentially as long as an hour and a half. Additionally, the

files would need to be synchronized so that the rsync, snappull would not think that the independentlyfollowing

optimized files were different in any way. This would also leave the door open to independent corruption of

collections instead of each being a perfect copy of the master.

Optimizing on the master allows for a straight-forward optimization operation. No query slaves need to be taken out

of service. The optimized collection can be distributed in the background as queries are being normally serviced.

The optimization can occur at any time convenient to the application providing collection updates.

Combining Distribution and Replication

When your index is too large for a single machine and you have a query volume that single shards cannot keep up

with, it's time to replicate each shard in your distributed search setup.

The idea is to combine distributed search with replication. As shown in the figure below, a combined

distributed-replication configuration features a master server for each shard and then 1- slaves that are replicatedn

from the master. As in a standard replicated configuration, the master server handles updates and optimizations

without adversely affecting query handling performance.

Query requests should be load balanced across each of the shard slaves. This gives you both increased query

handling capacity and fail-over backup if a server goes down.

491Apache Solr Reference Guide 4.10

A Solr configuration combining both replication and master-slave distribution.

None of the master shards in this configuration know about each other. You index to each master, the index is

replicated to each slave, and then searches are distributed across the slaves, using one slave from each

master/slave shard.

For high availability you can use a load balancer to set up a virtual IP for each shard's set of slaves. If you are new

to load balancing, HAProxy ( ) is a good open source software load-balancer. If a slave serverhttp://haproxy.1wt.eu/

goes down, a good load-balancer will detect the failure using some technique (generally a heartbeat system), and

forward all requests to the remaining live slaves that served with the failed slave. A single virtual IP should then be

set up so that requests can hit a single IP, and get load balanced to each of the virtual IPs for the search slaves.

With this configuration you will have a fully load balanced, search-side fault-tolerant system (Solr does not yet

support fault-tolerant indexing). Incoming searches will be handed off to one of the functioning slaves, then the slave

will distribute the search request across a slave for each of the shards in your configuration. The slave will issue a

request to each of the virtual IPs for each shard, and the load balancer will choose one of the available slaves.

Finally, the results will be combined into a single results set and returned. If any of the slaves go down, they will be

taken out of rotation and the remaining slaves will be used. If a shard master goes down, searches can still be

served from the slaves until you have corrected the problem and put the master back into production.

Merging Indexes

If you need to combine indexes from two different projects or from multiple servers previously used in a distributed

configuration, you can use either the IndexMergeTool included in or the .lucene-misc CoreAdminHandler

To merge indexes, they must meet these requirements:

The two indexes must be compatible: their schemas should include the same fields and they should analyze

fields the same way.

The indexes must not include duplicate data.

Optimally, the two indexes should be built using the same schema.

Using IndexMergeTool

492Apache Solr Reference Guide 4.10

To merge the indexes, do the following:

Find the lucene-core and lucene-misc JAR files that your version of Solr is using. You can do this by copying

your file somewhere and unpacking it ( ). These two JAR files should be in solr.war jar xvf solr.war W

. They are probably called something like and EB-INF/lib lucene-core-VERSION.jar lucene-misc-V

.ERSION.jar

Copy them somewhere easy to find.

Make sure that both indexes you want to merge are closed.

Issue this command:

java -cp /path/to/lucene-core-VERSION.jar:/path/to/lucene-misc-VERSION.jar

org/apache/lucene/misc/IndexMergeTool

/path/to/newindex

/path/to/index1

/path/to/index2

This will create a new index at that contains both index1 and index2./path/to/newindex

Copy this new directory to the location of your application's solr index (move the old one aside first, of course)

and start Solr.

For example:

java -cp /tmp/lucene-core-4.4.0.jar:

/tmp/lucene-misc-4.4.0.jar org/apache/lucene/misc/IndexMergeTool

./newindex

./app1/solr/data/index

./app2/solr/data/index

Using CoreAdmin

This method uses the to execute the command with either the or CoreAdminHandler MERGEINDEXES indexDir sr

parameters.cCore

The parameter is used to define the path to the indexes for the cores that should be merged, and mergeindexDir

them into a 3rd core that must already exist prior to initiation of the merge process. The indexes must exist on the

disk of the Solr host, which may make using this in a distributed environment cumbersome. With the paraindexDir

meter, a commit should be called on the cores to be merged (so the IndexWriter will close), and no writes should be

allowed on either core until the merge is complete. If writes are allowed, corruption may occur on the merged index.

Once complete, a commit should be called on the merged core to make sure the changes are visible to searchers.

The following example shows how to construct the merge command with :indexDir

http://localhost:8983/solr/admin/cores?action=mergeindexes&core=core0&indexDir=/home/

solr/core1/data/index&indexDir=/home/solr/core2/data/index

In this example, is the new core that is created prior to calling the merge process.core

The parameter is used to call the cores to be merged by name instead of defining the path. The cores dosrcCore

not need to exist on the same disk as the Solr host, and the merged core does not need to exist prior to issuing the

493Apache Solr Reference Guide 4.10

command. also protects against corruption during creation of the merged core index, so writes are stillsrcCore

possible while the merge occurs. However, can only merge Solr Cores - indexes built directly with LucenesrcCore

should be merged with either the IndexMergeTool or the parameter.indexDir

The following example shows how to construct the merge command with :srcCore

http://localhost:8983/solr/admin/cores?action=mergeindexes&core=core0&srcCore=core1&s

rcCore=core2

494Apache Solr Reference Guide 4.10

Client APIs

This section discusses the available client APIs for Solr. It covers the following topics:

Introduction to Client APIs: A conceptual overview of Solr client APIs.

Choosing an Output Format: Information about choosing a response format in Solr.

Using JavaScript: Explains why a client API is not needed for JavaScript responses.

Using Python: Information about Python and JSON responses.

Client API Lineup: A list of all Solr Client APIs, with links.

Using SolrJ: Detailed information about SolrJ, an API for working with Java applications.

Using Solr From Ruby: Detailed information about using Solr with Ruby applications.

MBean Request Handler: Describes the MBean request handler for programmatic access to Solr server statistics

and information.

Introduction to Client APIs

At its heart, Solr is a Web application, but because it is built on open protocols, any type of client application can use

Solr.

HTTP is the fundamental protocol used between client applications and Solr. The client makes a request and Solr

does some work and provides a response. Clients use requests to ask Solr to do things like perform queries or index

documents.

Client applications can reach Solr by creating HTTP requests and parsing the HTTP responses. Client APIs

encapsulate much of the work of sending requests and parsing responses, which makes it much easier to write

client applications.

Clients use Solr's five fundamental operations to work with Solr. The operations are query, index, delete, commit,

and optimize.

Queries are executed by creating a URL that contains all the query parameters. Solr examines the request URL,

performs the query, and returns the results. The other operations are similar, although in certain cases the HTTP

request is a POST operation and contains information beyond whatever is included in the request URL. An index

operation, for example, may contain a document in the body of the request.

Solr also features an EmbeddedSolrServer that offers a Java API without requiring an HTTP connection. For details,

see .Using SolrJ

Choosing an Output Format

Many programming environments are able to send HTTP requests and retrieve responses. Parsing the responses is

a slightly more thorny problem. Fortunately, Solr makes it easy to choose an output format that will be easy to

handle on the client side.

Specify a response format using the parameter in a query. The available response formats are documented in wt R

.esponse Writers

Most client APIs hide this detail for you, so for many types of client applications, you won't ever have to specify a wt

parameter. In JavaScript, however, the interface to Solr is a little closer to the metal, so you will need to add this

parameter yourself.

495Apache Solr Reference Guide 4.10
Client API Lineup
The Solr Wiki contains a list of client APIs at  .http://wiki.apache.org/solr/IntegratingSolr
Here is the list of client APIs, current at this writing (November 2011):
Name Environment URL
SolRuby Ruby http://wiki.apache.org/solr/SolRuby
DelSolr Ruby http://delsolr.rubyforge.org/
acts_as_solr Rails http://acts-as-solr.rubyforge.org/, http://rubyforge.org/projects/background-solr/
Flare Rails http://wiki.apache.org/solr/Flare
SolPHP PHP http://wiki.apache.org/solr/SolPHP
SolrJ Java http://wiki.apache.org/solr/SolJava
Python API Python http://wiki.apache.org/solr/SolPython
PySolr Python http://code.google.com/p/pysolr/
SolPerl Perl http://wiki.apache.org/solr/SolPerl
Solr.pm Perl http://search.cpan.org/~garafola/Solr-0.03/lib/Solr.pm
SolrForrest Forrest/Cocoon http://wiki.apache.org/solr/SolrForrest
SolrSharp C# http://www.codeplex.com/solrsharp
SolColdfusion ColdFusion http://solcoldfusion.riaforge.org/
SolrNet .NET http://code.google.com/p/solrnet/
AJAX Solr AJAX http://github.com/evolvingweb/ajax-solr/wiki
Using JavaScript
Using Solr from JavaScript clients is so straightforward that it deserves a special mention. In fact, it is so
straightforward that there is no client API. You don't need to install any packages or configure anything.
HTTP requests can be sent to Solr using the standard   mechanism.XMLHttpRequest
Out of the box, Solr can send  , which are easily interpreted inJavaScript Object Notation (JSON) responses
JavaScript. Just add   to the request URL to have responses sent as JSON.wt=json
For more information and an excellent example, read the SolJSON page on the Solr Wiki:
http://wiki.apache.org/solr/SolJSON
Using Python
Solr includes an output format specifically for  , but   is a little more robust.Python JSON output
Simple Python
Making a query is a simple matter. First, tell Python you will need to make HTTP connections.

496Apache Solr Reference Guide 4.10

from urllib2 import *

Now open a connection to the server and get a response. The query parameter tells Solr to return results in awt

format that Python can understand.

connection = urlopen(

'http://localhost:8983/solr/select?q=cheese&wt=python')

response = eval(connection.read())

Now interpreting the response is just a matter of pulling out the information that you need.

print response\['response'\]\['numFound'\], "documents found."

# Print the name of each document.

for document in response\['response'\]\['docs'\]:

print " Name =", document\['name'\]

Python with JSON

JSON is a more robust response format, but you will need to add a Python package in order to use it. At a command

line, install the simplejson package like this:

$ sudo easy_install simplejson

Once that is done, making a query is nearly the same as before. However, notice that the wt query parameter is now

json, and the response is now digested by .simplejson.load()

from urllib2 import *

import simplejson

connection = urlopen('http://localhost:8983/solr/select?q=cheese&wt=json')

response = simplejson.load(connection)

print response\['response'\]\['numFound'\], "documents found."

# Print the name of each document.

for document in response\['response'\]\['docs'\]:

print " Name =", document\['name'\]

Using SolrJ

SolrJ is an API that makes it easy for Java applications to talk to Solr. SolrJ hides a lot of the details of connecting to

Solr and allows your application to interact with Solr with simple high-level methods.

The center of SolrJ is the package, which contains just five main classes.org.apache.solr.client.solrj

Begin by creating a , which represents the Solr instance you want to use. Then send SolrServer SolrRequests

or and get back SolrResponses.SolrQuerys

SolrServer is abstract, so to connect to a remote Solr instance, you'll actually create an instance of either HttpSo

, or . Both communicate with Solr via HTTP, the different is that ilrServer CloudSolrServer HttpSolrServer

s configured using an explicit Solr URL, while is configured using the zkHost String for a CloudSolrServer SolrCl

497Apache Solr Reference Guide 4.10

clusteroud

String urlString = "http://localhost:8983/solr";

SolrServer solr = new HttpSolrServer(urlString);

String zkHostString = "zkServerA:2181,zkServerB:2181/solr";

SolrServer solr = new CloudSolrServer(zkHostString);

Once you have a , you can use it by calling methods like , , and .SolrServer query() add() commit()

Building and Running SolrJ Applications

The SolrJ API is included with Solr, so you do not have to download or install anything else. However, in order to

build and run applications that use SolrJ, you have to add some libraries to the classpath.

At build time, the examples presented with this section require to be in the classpath.solr-solrj-4.x.x.jar

At run time, the examples in this section require the libraries found in the 'dist/solrj-lib' directory.

The Ant script bundled with this sections' examples includes the libraries as appropriate when building and running.

You can sidestep a lot of the messing around with the JAR files by using Maven instead of Ant. All you will need to

do to include SolrJ in your application is to put the following dependency in the project's :pom.xml

<groupId>org.apache.solr</groupId>

<artifactId>solr-solrj</artifactId>

</dependency>

If you are worried about the SolrJ libraries expanding the size of your client application, you can use a code

obfuscator like to remove APIs that you are not using.ProGuard

Setting XMLResponseParser

SolrJ uses a binary format, rather than XML, as its default format. Users of earlier Solr releases who wish to

continue working with XML must explicitly set the parser to the XMLResponseParser, like so:

server.setParser(new XMLResponseParser());

Performing Queries

Use to have Solr search for results. You have to pass a object that describes the query, andquery() SolrQuery

you will get back a QueryResponse (from the package).org.apache.solr.client.solrj.response

SolrQuery has methods that make it easy to add parameters to choose a request handler and send parameters to

it. Here is a very simple example that uses the default request handler and sets the parameter:q

Single node Solr client

SolrCloud client

498Apache Solr Reference Guide 4.10

SolrQuery parameters = new SolrQuery();

parameters.set("q", mQueryString);

To choose a different request handler, for example, just set the parameter like this:qt

parameters.set("qt", "/spellCheckCompRH");

Once you have your set up, submit it with :SolrQuery query()

QueryResponse response = solr.query(parameters);

The client makes a network connection and sends the query. Solr processes the query, and the response is sent

and parsed into a .QueryResponse

The is a collection of documents that satisfy the query parameters. You can retrieve theQueryResponse

documents directly with and you can call other methods to find out information about highlighting orgetResults()

facets.

SolrDocumentList list = response.getResults();

Indexing Documents

Other operations are just as simple. To index (add) a document, all you need to do is create a SolrInputDocumen

and pass it along to the 's method.t SolrServer add()

String urlString = "http://localhost:8983/solr";

SolrServer solr = new HttpSolrServer(urlString);

SolrInputDocument document = new SolrInputDocument();

document.addField("id", "552199");

document.addField("name", "Gouda cheese wheel");

document.addField("price", "49.99");

UpdateResponse response = solr.add(document);

// Remember to commit your changes!

solr.commit();

Uploading Content in XML or Binary Formats

SolrJ lets you upload content in XML and binary formats instead of the default XML format. Use the following to

upload using binary format, which is the same format SolrJ uses to fetch results.

server.setRequestWriter(new BinaryRequestWriter());

Using the ConcurrentUpdateSolrServer

When implementing java applications that will be bulk loading a lot of documents at once, ConcurrentUpdateSol

is an alternative to consider instead of using rServer HttpSolrServer. The

499Apache Solr Reference Guide 4.10

buffers all added documents and writes them into open HTTP connections. ThisConcurrentUpdateSolrServer

class is thread safe. Although any SolrServer request can be made with this implementation, it is only recommended

to use the for requests.ConcurrentUpdateSolrServer /update

EmbeddedSolrServer

The class provides an implementation of the client API talking directly to anEmbeddedSolrServer SolrServer

micro-instance of Solr running directly in your Java application. This embedded approach is not recommended in

most cases and fairly limited in the set of features it supports – in particular it can not be used with or SolrCloud Inde

. exists primarily to help facilitate testing.x Replication EmbeddedSolrServer

For information on how to use please review the SolrJ JUnit tests in the EmbeddedSolrServer org.apache.sol

package of the Solr source release.r.client.solrj.embedded

Related Topics

SolrJ API documentation

Solr Wiki page on SolrJ

Indexing and Basic Data Operations

Using Solr From Ruby

For Ruby applications, the solr-ruby gem encapsulates the fundamental Solr operations.

At a command line, install solr-ruby as follows:

$ gem install solr-ruby

Bulk updating Gem source index for: http://gems.rubyforge.org

Successfully installed solr-ruby-0.0.8

1 gem installed

Installing ri documentation for solr-ruby-0.0.8...

Installing RDoc documentation for solr-ruby-0.0.8...

This gives you a class that makes it easy to add documents, perform queries, and do otherSolr::Connection

Solr stuff.

Solr-ruby takes advantage of Solr's Ruby response writer, which is a subclass of the JSON response writer. This

response writer sends information from Solr to Ruby in a form that Ruby can understand and use directly.

Performing Queries

To perform queries, you just need to get a and call its query method. Here is a script thatSolr::Connection

looks for cheese. The return value from is an array of documents, which are dictionaries, so the scriptquery()

iterates through each document and prints out a few fields.

require 'rubygems'

require 'solr'

solr = Solr::Connection.new('http://localhost:8983/solr')

response = solr.query('cheese')

response.each do |hit|

puts hit\['id'\] + ' ' + hit\['name'\] + ' ' + hit\['price'\].to_s

end

An example run looks like this:

500Apache Solr Reference Guide 4.10

$ ruby query.rb

551299 Gouda cheese wheel 49.99

123 Fresh mozzarella cheese

Indexing Documents

Indexing is just as simple. You have to get the just as before. Then call the and Solr::Connection add() commi

methods.t()

require 'rubygems'

require 'solr'

solr = Solr::Connection.new('http://localhost:8983/solr')

solr.add(:id => 123, :name => 'Fresh mozzarella cheese')

solr.commit()

More Information

For more information on solr-ruby, read the page at the Solr Wiki:

http://wiki.apache.org/solr/solr-ruby

MBean Request Handler

The MBean Request Handler offers programmatic access to the information provided on the page ofPlugin/Stats

the Admin UI. You can access the MBean Request Handler here: .http://localhost:8983/solr/admin/mbeans

The MBean Request Handler accepts the following parameters:

Parameter Type Default Description

key multivalued all Restricts results by object key.

cat multivalued all Restricts results by category name.

stats boolean false Specifies whether statistics are returned with results. You can override the s

parameter on a per-field basis.tats

wt multivalued xml The output format. This operates the same as the . parameter in a querywt

Examples

To return information about the CACHE category only:

http://localhost:8983/solr/admin/mbeans?cat=CACHE

To return information and statistics about the CACHE category only:

http://localhost:8983/solr/admin/mbeans?stats=true&cat=CACHE

To return information for everything, and statistics for everything except the :fieldCache

http://localhost:8983/solr/admin/mbeans?stats=true&f.fieldCache.stats=false

To return information and statistics for the only:fieldCache

http://localhost:8983/solr/admin/mbeans?key=fieldCache&stats=true

501Apache Solr Reference Guide 4.10

Further Assistance

There is a very active user community around Solr and Lucene. The solr-user mailing list, and #solr IRC channel are

both great resource for asking questions.

To view the mailing list archives, subscribe to the list, or join the IRC channel, please see https://lucene.apache.org/

solr/discussion.html

502Apache Solr Reference Guide 4.10
Solr Glossary
Where possible, terms are linked to relevant parts of the Solr Reference Guide for more information.
Jump to a letter:
A           G H J K         P         U V   X Y B C D E F I L M N O Q R S T W Z
A
Atomic updates
An approach to updating only one or more fields of a document, instead of reindexing the entire document.
B
Boolean operators
These control the inclusion or exclusion of keywords in a query by using operators such as AND, OR, and NOT.
C
Cluster
In Solr, a cluster is a set of Solr nodes managed as a unit. They may contain many cores, collections, shards, and/or
replicas. See also  .#SolrCloud
Collection
In Solr, one or more documents grouped together in a single logical index. A collection must have a single schema,
but can be spread across multiple cores.
In  , a group of cores managed together as part of a SolrCloud installation.#ZooKeeper
Commit
To make document changes permanent in the index. In the case of added documents, they would be searchable
after a  .commit
Core
An individual Solr instance (represents a logical index). Multiple cores can run on a single node. See also #SolrClou
.d
Core reload
To re-initialize Solr after changes to  ,   or other configuration files.schema.xml solrconfig.xml
D
Distributed search
Distributed search is one where queries are processed across more than one  .shard
Document
A group of   and their values. Documents are the basic unit of data in a  . Documents are assigned to fields collection s

503Apache Solr Reference Guide 4.10

using standard hashing, or by specifically assigning a shard within the document ID. Documents arehards

versioned after each write operation.

Ensemble

A term to indicate multiple ZooKeeper instances running simultaneously.#ZooKeeper

Facet

The arrangement of search results into categories based on indexed terms.

Field

The content to be indexed/searched along with metadata defining how the content should be processed by Solr.

Inverse document frequency (IDF)

A measure of the general importance of a term. It is calculated as the number of total Documents divided by the

number of Documents that a particular word occurs in the collection. See and http://en.wikipedia.org/wiki/Tf-idf the L

for more info on TF-IDF based scoring and Lucene scoring in particular. See also ucene TFIDFSimilarity javadocs #

.Term frequency

Inverted index

A way of creating a searchable index that lists every word and the documents that contain those words, similar to an

index in the back of a book which lists words and the pages on which they can be found. When performing keyword

searches, this method is considered more efficient than the alternative, which would be to create a list of documents

paired with every word used in each document. Since users search using terms they expect to be in documents,

finding the term before the document saves processing resources and time.

Leader

The main node for each shard that routes document adds, updates, or deletes to other replicas in the same shard -

this is a transient responsibility assigned to a node via an election, if the current Shard Leader goes down, a new

node will be elected to take it's place. See also .#SolrCloud

Metadata

Literally, . Metadata is information about a document, such as it's title, author, or location.data about data

Natural language query

A search that is entered as a user would normally speak or write, as in, "What is aspirin?"

504Apache Solr Reference Guide 4.10

Node

A JVM instance running Solr. Also known as a Solr server.

Optimistic concurrency

Also known as "optimistic locking", this is an approach that allows for updates to documents currently in the index

while retaining locking or version control.

Overseer

A single node in SolrCloud that is responsible for processing actions involving the entire cluster. It keeps track of the

state of existing nodes and shards, and assigns shards to nodes - this is a transient responsibility assigned to a

node via an election, if the current Overseer goes down, a new node will be elected to take it's place. See also #Solr

.Cloud

Query parser

A query parser processes the terms entered by a user.

Recall

The ability of a search engine to retrieve of the possible matches to a user's query.all

Relevance

The appropriateness of a document to the search conducted by the user.

Replica

A copy of a shard or single logical index, for use in failover or load balancing.

Replication

A method of copying a master index from one server to one or more "slave" or "child" servers.

RequestHandler

Logic and configuration parameters that tell Solr how to handle incoming "requests", whether the requests are to

return search results, to index documents, or to handle other custom situations.

SearchComponent

Logic and configuration parameters used by request handlers to process query requests. Examples of search

components include faceting, highlighting, and "more like this" functionality.

Shard

In SolrCloud, a logical section of a single collection. This may be spread across multiple nodes. See also #SolrCloud

505Apache Solr Reference Guide 4.10

SolrCloud

Umbrella term for a suite of functionality in Solr which allows managing a cluster of Solr servers for scalability, fault

tolerance, and high availability.

Solr Schema (schema.xml)

The Apache Solr index schema. The schema defines the fields to be indexed and the type for the field (text,

integers, etc.) The schema is stored in schema.xml and is located in the Solr home conf directory.

SolrConfig (solrconfig.xml)

The Apache Solr configuration file. Defines indexing options, RequestHandlers, highlighting, spellchecking and

various other configurations. The file, solrconfig.xml is located in the Solr home conf directory.

Spell Check

The ability to suggest alternative spellings of search terms to a user, as a check against spelling errors causing few

or zero results.

Stopwords

Generally, words that have little meaning to a user's search but which may have been entered as part of a natural

query. Stopwords are generally very small pronouns, conjunctions and prepositions (such as, "the", "with",language

or "and")

Suggester

Functionality in Solr that provides the ability to suggest possible query terms to users as they type.

Synonyms

Synonyms generally are terms which are near to each other in meaning and may substitute for one another. In a

search engine implementation, synonyms may be abbreviations as well as words, or terms that are not consistently

hyphenated. Examples of synonyms in this context would be "Inc." and "Incorporated" or "iPod" and "i-pod".

Term frequency

The number of times a word occurs in a given document. See and http://en.wikipedia.org/wiki/Tf-idf the Lucene

for more info on TF-IDF based scoring and Lucene scoring in particular.TFIDFSimilarity javadocs

See also .#Inverse document frequency (IDF)

Transaction log

An append-only log of write operations maintained by each node. This log is only required with SolrCloud

implementations and is created and managed automatically by Solr.

Wildcard

A wildcard allows a substitution of one or more letters of a word to account for possible variations in spelling or

tenses.

506Apache Solr Reference Guide 4.10

ZooKeeper

Also known as . The system used by SolrCloud to keep track of configuration files and nodeApache ZooKeeper

names for a cluster. A ZooKeeper cluster is used as the central configuration store for the cluster, a coordinator for

operations requiring distributed synchronization, and the system of record for cluster topology. See also .#SolrCloud

507Apache Solr Reference Guide 4.10

Major Changes from Solr 3 to Solr 4

Solr 4 includes some exciting new developments, and also includes many changes from Solr 3.x and earlier.

Highlights of Solr 4

Changes to Consider

System Changes

Index Format

Query Parsers

Schema Configuration

Changes to solrconfig.xml

Other Changes

Highlights of Solr 4

Solr 4 is a major release of Solr, two years in the making, and includes new features for scalability and high

performance for today's data driven, real time search applications. Some of the major improvements include:

SolrCloud

The primary new feature in Solr 4 goes by the name "SolrCloud", a suite of tools to make scalability built into your

project from day one:

Distributed indexing designed from the ground up for near real-time (NRT) and NoSQL features such as

realtime-get, optimistic locking, and durable updates.

High availability with no single points of failure.

Apache Zookeeper integration for distributed coordination and cluster metadata and configuration storage.

Immunity to split-brain issues due to Zookeeper's Paxos distributed consensus protocols.

Updates sent to any node in the cluster and are automatically forwarded to the correct shard and replicated to

multiple nodes for redundancy.

Queries sent to any node automatically perform a full distributed search across the cluster with load balancing

and fail-over.

NoSQL Features

Users wishing to use Solr as their primary data store will be interested in these features:

Update durability - A transaction log ensures that even uncommitted documents are never lost.

Real-time Get - The ability to quickly retrieve the latest version of a document, without the need to commit or

open a new searcher

Versioning and Optimistic Locking - combined with real-time get, this allows read-update-write functionality

that ensures no conflicting changes were made concurrently by other clients.

Atomic updates - the ability to add, remove, change, and increment fields of an existing document without

having to send in the complete document again.

Other Major Features

There's more:

Pivot Faceting - Multi-level or hierarchical faceting where the top constraints for one field are found for each

top constraint of a different field.

Pseudo-fields - The ability to alias fields, or to add metadata along with returned documents, such as function

query values and results of spatial distance calculations.

508Apache Solr Reference Guide 4.10

A spell checker implementation that can work directly from the main index instead of creating a sidecar index.

Pseudo-Join functionality - The ability to select a set of documents based on their relationship to a second set

of documents.

Function query enhancements including conditional function queries and relevancy functions.

New update processors to facilitate modifying documents prior to indexing.

A brand new web admin interface, including support for SolrCloud.

Changes to Consider

There are some major changes in Solr 4 to consider before starting to migrate your configurations and indexes.

There are many hundreds of changes, so a thorough review of the changes.txt file in your Solr instance will help you

plan migration to Solr 4.

System Changes

As of Solr 4.8, Java 1.7 is now required to run Solr. Solr versions 4.0 through 4.7 required Java 1.6.

Index Format

The Lucene index format has changed. As a result, once you upgrade to Solr 4, previous versions of Solr will

no longer be able to read your indices. In a master/slave configuration, all searchers/slaves should be

upgraded before the master. If the master is updated first, older searchers will not be able to read the new

index format.

Query Parsers

The default logic for the parameter of the has changed. If no parameter ismm Dismax Query Parser mm

specified (either in the query or as a default in , then the effective value of the paramsolrconfig.xml q.op

eter is used to influence the behavior (whether is defined in the query, in , or fromq.op solrconfig.xml

the option in ). If is effectively "AND" then . If isdefaultOperator schema.xml q.op mm=100% q.op

effectively "OR" then . If you want to force legacy behavior, set a default value for the parameter inmm=0% mm

your file.solrconfig.xml

Schema Configuration

Due to low level changes to support SolrCloud, the field can no longer be populated via uniqueKey <copyF

or in . If you want to have Solr automatically generate aield/> <field default=...> schema.xml

uniqueKey field value when adding documents, use an instance of solr.UUIDUpdateProcessorFactory in their

update processor chain. See for more details.SOLR-2798

Solr is now much more strict about requiring that the feature (if used) must refer to a field whichuniqueKey

is not multiValued. If you upgrade from an earlier version of Solr and see an error that your fielduniqueKey

"can not be configured to be multivalued" please add to the declaratiomultiValued="false" <field />

n for your field.uniqueKey

Changes to the :HTMLCharFilterFactory

Known offset bugs have been fixed.

The "Mark invalid" exceptions are no longer triggered.

Newlines are now substituted instead of spaces for block-level elements; this corresponds more

closely to on-screen layout, enables sentence segmentation, and doesn't change the offsets.

Supplementary characters in tags are now recognized.

509Apache Solr Reference Guide 4.10

Accepted tag names have been switched from and Unicode[:XID_Start:] [:XID_Continue:]

properties to the more relaxed and properties, in order to[:ID_Start:] [:ID_Continue:]

broaden the range of recognizable input. (The improved security afforded by the properties isXID_*

irrelevant to what a does.)CharFilter

More cases of <script> tags are now properly stripped.

CDATA sections are now recognized.

No space is substituted for inline tags (e.g. , , ). The old version substituted spaces for<b> <i> <span>

all tags.

Broken MS-Word-generated processing instructions instead of ) are now(? ... /) <? ... ?>

handled.

were lower case.

Opening tags with unbalanced quotation marks are now properly stripped.

Literal "<" and ">" characters in opening tags, regardless of whether they appear inside quotation

marks, now inhibit recognition (and stripping) of the tags. The only exception to this is for values of

event-handler attributes, e.g. "onClick", "onLoad", "onSelect".

A newline '\n' is substituted instead of a space for stripped HTML markup.

Nothing is substituted for opening and closing inline tags - they are simply removed. The list of inline

tags is (case insensitively): <a>, <abbr>, <acronym>, <b>, <basefont>, <bdo>, <big>, <cite>, <code>,

<dfn>, <em>, <font>, <i>, <img>, <input>, <kbd>, <label>, <q>, <s>, <samp>, <select>, <small>,

<span>, <strike>, <strong>, <sub>, <sup>, <textarea>, <tt>, <u>, and <var>.

HTMLStripCharFilterFactory now handles HTMLStripCharFilter's "escapedTags" feature: opening and

closing tags with the given names, including any attributes and their values, are left intact in the output.

The replacement character U+FFFD is now used to replace numeric character entities for unpaired

UTF-16 low and high surrogates (in the range [U+D800-U+DFFF]).

Properly paired numeric character entities for UTF-16 surrogates are now converted to the

corresponding code units.

The generated scanner's parse method has been changed from the default to .yylex() nextChar()

Changes to solrconfig.xml

The <indexDefaults> and <mainIndex> sections of solrconfig.xml have been discontinued and replaced with

the <indexConfig> section. There are also better defaults. When migrating, If you don't know what your old

settings mean, delete both the <indexDefaults> and <mainIndex> sections. If you have customized them, put

them in the <indexConfig> section with the same syntax as before.

The no longer looks for a option in the (legacy) sectionPingRequestHandler <healthcheck> <admin>

of . If you want to take advantage of this feature, configure a initializatsolrconfig.xml healthcheckFile

ion parameter directly on the . As part of this change, relative file paths have beenPingRequestHandler

fixed to be resolved against the data directory. The sample has an example of thissolrconfig.xml

configuration.

The update request parameter to choose the Update Request Processor Chain has been renamed from upd

to . The old parameter was deprecated in Solr 3.x, but now has beenate.processor update.chain

removed entirely.

The VelocityResponseWriter is no longer built into the core. Its jar and dependencies now need to be

addressed (via <lib> or solr/home lib inclusion). It also needs to be registered in like this:solrconfig.xml

510Apache Solr Reference Guide 4.10

Other Changes

Two of the SolrServer subclasses in SolrJ have been renamed and replaced. isCommonsHttpSolrServer

now , and is now .HttpSolrServer StreamingUpdateSolrServer ConcurrentUpdateSolrServer

511Apache Solr Reference Guide 4.10

Errata

Errata For This Documentation

Any mistakes found in this documentation after its release will be listed on the on-line version of this page:

https://cwiki.apache.org/confluence/display/solr/Errata

Errata For Past Versions of This Documentation

Any known mistakes in past releases of this documentation will be noted below.

Apache Solr Ref Guide 4.10

apache-solr-ref-guide-4.10

Navigation menu

Versions of this User Manual:

Views

Navigation