Apache Solr Reference Guide: For 7.3 Ref Guide

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 1200

DownloadApache Solr Reference Guide: For 7.3 Apache-solr-ref-guide-7.3
Open PDF In BrowserView PDF
Apache Solr Reference Guide
For Solr 7.3

Written by the Apache Lucene/Solr Project
Published 2018-03-27

Table of Contents
Apache Solr Reference Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
About This Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Hosts and Port Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Directory Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
API Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Special Inline Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Solr Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
A Quick Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Solr System Requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Installing Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Deployment and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Solr Control Script Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Solr Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Taking Solr to Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Making and Restoring Backups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Running Solr on HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
SolrCloud on AWS EC2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Upgrading a Solr Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Solr Upgrade Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Using the Solr Administration User Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Overview of the Solr Admin UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Getting Assistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Logging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Cloud Screens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Collections / Core Admin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Java Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Thread Dump. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Suggestions Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Collection-Specific Tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Core-Specific Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Documents, Fields, and Schema Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Overview of Documents, Fields, and Schema Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Solr Field Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Defining Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Copying Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Dynamic Fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

Other Schema Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Schema API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Putting the Pieces Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
DocValues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Schemaless Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Understanding Analyzers, Tokenizers, and Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Using Analyzers, Tokenizers, and Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Analyzers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
About Tokenizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
About Filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Tokenizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Filter Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
CharFilterFactories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Language Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Phonetic Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
Running Your Analyzer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Indexing and Basic Data Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
Indexing Using Client APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Introduction to Solr Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
Post Tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
Uploading Data with Index Handlers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Uploading Data with Solr Cell using Apache Tika . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Uploading Structured Data Store Data with the Data Import Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
Updating Parts of Documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
Detecting Languages During Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
De-Duplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
Content Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
UIMA Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
Overview of Searching in Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
Velocity Search UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
Query Syntax and Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
JSON Request API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
JSON Facet API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
Faceting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
Highlighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
Spell Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
Query Re-Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
Transforming Result Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556

Suggester . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
MoreLikeThis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576
Pagination of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
Collapse and Expand Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
Result Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
Result Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
Spatial Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
The Terms Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
The Term Vector Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
The Stats Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
The Query Elevation Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
Response Writers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
Near Real Time Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648
RealTime Get . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
Exporting Result Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655
Streaming Expressions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
Parallel SQL Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774
Analytics Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 810
SolrCloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 841
Getting Started with SolrCloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 842
How SolrCloud Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846
SolrCloud Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855
SolrCloud Configuration and Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858
Rule-based Replica Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 923
Cross Data Center Replication (CDCR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 927
SolrCloud Autoscaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 950
Legacy Scaling and Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986
Introduction to Scaling and Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987
Distributed Search with Index Sharding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 988
Index Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 992
Combining Distribution and Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1002
Merging Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004
The Well-Configured Solr Instance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005
Configuring solrconfig.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006
Solr Cores and solr.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047
Configuration APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1063
Implicit RequestHandlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087
Solr Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1090
JVM Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095
v2 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1097

Monitoring Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1101
Metrics Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1102
MBean Request Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1117
Configuring Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1118
Using JMX with Solr. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1122
Monitoring Solr with Prometheus and Grafana. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124
Performance Statistics Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1133
Securing Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1138
Authentication and Authorization Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1139
Enabling SSL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165
Client APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1172
Introduction to Client APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1173
Choosing an Output Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1174
Client API Lineup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1175
Using JavaScript. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176
Using Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177
Using SolrJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1178
Using Solr From Ruby. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1184
Further Assistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186
Solr Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1187
Solr Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1188
Errata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1193
Errata For This Documentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194
How to Contribute to Solr Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195

Apache Solr Reference Guide 7.3

Page 1 of 1195

Licenses
Licensed to the Apache Software Foundation (ASF) under one or more contributor
license agreements. See the NOTICE file distributed with this work for additional
information regarding copyright ownership. The ASF licenses this file to you under
the Apache License, Version 2.0 (the "License"); you may not use this file except in
compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0.
Unless required by applicable law or agreed to in writing, software distributed under the License is
distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied. See the License for the specific language governing permissions and limitations under the License.
Apache and the Apache feather logo are trademarks of The Apache Software Foundation. Apache Lucene,
Apache Solr and their respective logos are trademarks of the Apache Software Foundation. Please see the
Apache Trademark Policy for more information.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 2 of 1195

Apache Solr Reference Guide 7.3

Apache Solr Reference Guide
This reference guide describes Apache Solr, the open source solution for search.
Solr builds on Lucene, an open source Java library that provides indexing and search technology, as well as
spellchecking, hit highlighting and advanced analysis/tokenization capabilities. Both Solr and Lucene are
managed by the Apache Software Foundation (www.apache.org). You can download Apache Solr from the
Solr website at http://lucene.apache.org/solr/.
This Guide contains the following main sections:
Getting Started: This section guides you through the installation and setup of Solr.
Using the Solr Administration User Interface: This section introduces the Solr Web-based user interface.
From your browser you can view configuration files, submit queries, view logfile settings and Java
environment settings, and monitor and control distributed configurations.
Documents, Fields, and Schema Design: This section describes how Solr organizes its data for indexing. It
explains how a Solr schema defines the fields and field types which Solr uses to organize data within the
document files it indexes.
Understanding Analyzers, Tokenizers, and Filters: This section explains how Solr prepares text for
indexing and searching. Analyzers parse text and produce a stream of tokens, lexical units used for indexing
and searching. Tokenizers break field data down into tokens. Filters perform other transformational or
selective work on token streams.
Indexing and Basic Data Operations: This section describes the indexing process and basic index
operations, such as commit, optimize, and rollback.
Searching: This section presents an overview of the search process in Solr. It describes the main
components used in searches, including request handlers, query parsers, and response writers. It lists the
query parameters that can be passed to Solr, and it describes features such as boosting and faceting, which
can be used to fine-tune search results.
The Well-Configured Solr Instance: This section discusses performance tuning for Solr. It begins with an
overview of the solrconfig.xml file, then tells you how to configure cores with solr.xml, how to configure
the Lucene index writer, and more.
Monitoring Solr: Administration and monitoring can be performed using the web-based administration
console, through the command line interface, or using REST APIs.
Deployment and Operations: An important aspect of Solr is that all operations and deployment can be
done online, with minimal or no impact to running applications. This includes minor upgrades and
provisioning and removing nodes, backing up and restoring indexes and editing configurations
SolrCloud: This section describes the newest and most exciting of Solr’s new features, SolrCloud, which
provides comprehensive distributed capabilities.
Securing Solr: When planning how to secure Solr, you should consider which of the available features or
approaches are right for you.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 3 of 1195

Legacy Scaling and Distribution: This section tells you how to grow a Solr distribution by dividing a large
index into sections called shards, which are then distributed across multiple servers, or by replicating a
single index across multiple services.
Client APIs: This section tells you how to access Solr through various client APIs, including JavaScript, JSON,
and Ruby.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 4 of 1195

Apache Solr Reference Guide 7.3

About This Guide
This guide describes all of the important features and functions of Apache Solr.
Solr is free to download from http://lucene.apache.org/solr/.
Designed to provide high-level documentation, this guide is intended to be more encyclopedic and less of a
cookbook. It is structured to address a broad spectrum of needs, ranging from new developers getting
started to well-experienced developers extending their application or troubleshooting. It will be of use at
any point in the application life cycle, for whenever you need authoritative information about Solr.
The material as presented assumes that you are familiar with some basic search concepts and that you can
read XML. It does not assume that you are a Java programmer, although knowledge of Java is helpful when
working directly with Lucene or when developing custom extensions to a Lucene/Solr installation.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 5 of 1195

Hosts and Port Examples
The default port when running Solr is 8983. The samples, URLs and screenshots in this guide may show
different ports, because the port number that Solr uses is configurable.
If you have not customized your installation of Solr, please make sure that you use port 8983 when following
the examples, or configure your own installation to use the port numbers shown in the examples. For
information about configuring port numbers, see the section Monitoring Solr.
Similarly, URL examples use localhost throughout; if you are accessing Solr from a location remote to the
server hosting Solr, replace localhost with the proper domain or IP where Solr is running.
For example, we might provide a sample query like:

http://localhost:8983/solr/gettingstarted/select?q=brown+cow
There are several items in this URL you might need to change locally. First, if your server is running at
"www.example.com", you’ll replace "localhost" with the proper domain. If you aren’t using port 8983, you’ll
replace that also. Finally, you’ll want to replace "gettingstarted" (the collection or core name) with the
proper one in use in your implementation. The URL would then become:

http://www.example.com/solr/mycollection/select?q=brown+cow

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 6 of 1195

Apache Solr Reference Guide 7.3

Directory Paths
Path information is given relative to solr.home, which is the location under the main Solr installation where
Solr’s collections and their conf and data directories are stored.
In many cases, this is is in the server/solr directory of your installation. However, there can be exceptions,
particularly if your installation has customized this.
In several cases of this Guide, our examples are built from the the "techproducts" example (i.e., you have
started Solr with the command bin/solr -e techproducts). In this case, solr.home will be a sub-directory
of the example/ directory created for you automatically.
See also the section Solr Home for further details on what is contained in this directory.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 7 of 1195

API Examples
Solr has two styles of APIs that currently co-exist. The first has grown somewhat organically as Solr has
developed over time, but the second, referred to as the "V2 API", redesigns many of the original APIs with a
modernized and self-documenting API interface.
In many cases, but not all, the parameters and outputs of API calls are the same between the two styles. In
all cases the paths and endpoints used are different.
Throughout this Guide, we have added examples of both styles with sections labeled "V1 API" and "V2 API".
As of the 7.2 version of this Guide, these examples are not yet complete - more coverage will be added as
future versions of the Guide are released.
The section V2 API provides more information about how to work with the new API structure, including how
to disable it if you choose to do so.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 8 of 1195

Apache Solr Reference Guide 7.3

Special Inline Notes
Special notes are included throughout these pages. There are several types of notes:







Information blocks provide additional information that’s useful for you to know.
Important blocks provide information that we want to make sure you are aware of.
Tip blocks provide helpful tips.
Caution blocks provide details on scenarios or configurations you should be careful with.
Warning blocks are used to warn you from a possibly dangerous change or action.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 9 of 1195

Getting Started
Solr makes it easy for programmers to develop sophisticated, high-performance
search applications with advanced features.
This section introduces you to the basic Solr architecture and features to help you get up and running
quickly. It covers the following topics:
Solr Tutorial: This tutorial covers getting Solr up and running
A Quick Overview: A high-level overview of how Solr works.
Solr System Requirements: Solr System Requirement
Installing Solr: A walkthrough of the Solr installation process.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 10 of 1195

Apache Solr Reference Guide 7.3

Solr Tutorial
This tutorial covers getting Solr up and running, ingesting a variety of data sources into Solr collections, and
getting a feel for the Solr administrative and search interfaces.
The tutorial is organized into three sections that each build on the one before it. The first exercise will ask
you to start Solr, create a collection, index some basic documents, and then perform some searches.
The second exercise works with a different set of data, and explores requesting facets with the dataset.
The third exercise encourages you to begin to work with your own data and start a plan for your
implementation.
Finally, we’ll introduce spatial search and show you how to get your Solr instance back into a clean state.

Before You Begin
To follow along with this tutorial, you will need…
1. To meet the system requirements
2. An Apache Solr release download. This tutorial is designed for Apache Solr 7.3.
For best results, please run the browser showing this tutorial and the Solr server on the same machine so
tutorial links will correctly point to your Solr server.

Unpack Solr
Begin by unzipping the Solr release and changing your working directory to the subdirectory where Solr was
installed. For example, with a shell in UNIX, Cygwin, or MacOS:
~$ ls solr*
solr-7.3.0.zip
~$ unzip -q solr-7.3.0.zip
~$ cd solr-7.3.0/
If you’d like to know more about Solr’s directory layout before moving to the first exercise, see the section
Directory Layout for details.

Exercise 1: Index Techproducts Example Data
This exercise will walk you through how to start Solr as a two-node cluster (both nodes on the same
machine) and create a collection during startup. Then you will index some sample data that ships with Solr
and do some basic searches.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 11 of 1195

Launch Solr in SolrCloud Mode
To launch Solr, run: bin/solr start -e cloud on Unix or MacOS; bin\solr.cmd start -e cloud on
Windows.
This will start an interactive session that will start two Solr "servers" on your machine. This command has an
option to run without prompting you for input (-noprompt), but we want to modify two of the defaults so we
won’t use that option now.
solr-7.3.0:$ ./bin/solr start -e cloud
Welcome to the SolrCloud example!
This interactive session will help you launch a SolrCloud cluster on your local workstation.
To begin, how many Solr nodes would you like to run in your local cluster? (specify 1-4 nodes)
[2]:
The first prompt asks how many nodes we want to run. Note the [2] at the end of the last line; that is the
default number of nodes. Two is what we want for this example, so you can simply press enter.
Ok, let's start up 2 Solr nodes for your example SolrCloud cluster.
Please enter the port for node1 [8983]:
This will be the port that the first node runs on. Unless you know you have something else running on port
8983 on your machine, accept this default option also by pressing enter. If something is already using that
port, you will be asked to choose another port.
Please enter the port for node2 [7574]:
This is the port the second node will run on. Again, unless you know you have something else running on
port 8983 on your machine, accept this default option also by pressing enter. If something is already using
that port, you will be asked to choose another port.
Solr will now initialize itself and start running on those two nodes. The script will print the commands it uses
for your reference.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 12 of 1195

Apache Solr Reference Guide 7.3

Starting up 2 Solr nodes for your example SolrCloud cluster.
Creating Solr home directory /solr-7.3.0/example/cloud/node1/solr
Cloning /solr-7.3.0/example/cloud/node1 into
/solr-7.3.0/example/cloud/node2
Starting up Solr on port 8983 using command:
"bin/solr" start -cloud -p 8983 -s "example/cloud/node1/solr"
Waiting up to 180 seconds to see Solr running on port 8983 [\]
Started Solr server on port 8983 (pid=34942). Happy searching!

Starting up Solr on port 7574 using command:
"bin/solr" start -cloud -p 7574 -s "example/cloud/node2/solr" -z localhost:9983
Waiting up to 180 seconds to see Solr running on port 7574 [\]
Started Solr server on port 7574 (pid=35036). Happy searching!
INFO - 2017-07-27 12:28:02.835; org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider;
Cluster at localhost:9983 ready
Notice that two instances of Solr have started on two nodes. Because we are starting in SolrCloud mode, and
did not define any details about an external ZooKeeper cluster, Solr launches its own ZooKeeper and
connects both nodes to it.
After startup is complete, you’ll be prompted to create a collection to use for indexing data.
Now let's create a new collection for indexing documents in your 2-node cluster.
Please provide a name for your new collection: [gettingstarted]
Here’s the first place where we’ll deviate from the default options. This tutorial will ask you to index some
sample data included with Solr, called the "techproducts" data. Let’s name our collection "techproducts" so
it’s easy to differentiate from other collections we’ll create later. Enter techproducts at the prompt and hit
enter.
How many shards would you like to split techproducts into? [2]
This is asking how many shards you want to split your index into across the two nodes. Choosing "2" (the
default) means we will split the index relatively evenly across both nodes, which is a good way to start.
Accept the default by hitting enter.
How many replicas per shard would you like to create? [2]
A replica is a copy of the index that’s used for failover (see also the Solr Glossary definition). Again, the
default of "2" is fine to start with here also, so accept the default by hitting enter.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 13 of 1195

Please choose a configuration for the techproducts collection, available options are:
_default or sample_techproducts_configs [_default]
We’ve reached another point where we will deviate from the default option. Solr has two sample sets of
configuration files (called a configSet) available out-of-the-box.
A collection must have a configSet, which at a minimum includes the two main configuration files for Solr:
the schema file (named either managed-schema or schema.xml), and solrconfig.xml. The question here is
which configSet you would like to start with. The _default is a bare-bones option, but note there’s one
whose name includes "techproducts", the same as we named our collection. This configSet is specifically
designed to support the sample data we want to use, so enter sample_techproducts_configs at the prompt
and hit enter.
At this point, Solr will create the collection and again output to the screen the commands it issues.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 14 of 1195

Apache Solr Reference Guide 7.3

Uploading /solr-7.3.0/server/solr/configsets/_default/conf for config techproducts to ZooKeeper
at localhost:9983
Connecting to ZooKeeper at localhost:9983 ...
INFO - 2017-07-27 12:48:59.289; org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider;
Cluster at localhost:9983 ready
Uploading /solr-7.3.0/server/solr/configsets/sample_techproducts_configs/conf for config
techproducts to ZooKeeper at localhost:9983
Creating new collection 'techproducts' using command:
http://localhost:8983/solr/admin/collections?action=CREATE&name=techproducts&numShards=2&replicat
ionFactor=2&maxShardsPerNode=2&collection.configName=techproducts
{
"responseHeader":{
"status":0,
"QTime":5460},
"success":{
"192.168.0.110:7574_solr":{
"responseHeader":{
"status":0,
"QTime":4056},
"core":"techproducts_shard1_replica_n1"},
"192.168.0.110:8983_solr":{
"responseHeader":{
"status":0,
"QTime":4056},
"core":"techproducts_shard2_replica_n2"}}}
Enabling auto soft-commits with maxTime 3 secs using the Config API
POSTing request to Config API: http://localhost:8983/solr/techproducts/config
{"set-property":{"updateHandler.autoSoftCommit.maxTime":"3000"}}
Successfully set-property updateHandler.autoSoftCommit.maxTime to 3000
SolrCloud example running, please visit: http://localhost:8983/solr
Congratulations! Solr is ready for data!
You can see that Solr is running by launching the Solr Admin UI in your web browser: http://localhost:8983/
solr/. This is the main starting point for administering Solr.
Solr will now be running two "nodes", one on port 7574 and one on port 8983. There is one collection
created automatically, techproducts, a two shard collection, each with two replicas.
The Cloud tab in the Admin UI diagrams the collection nicely:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 15 of 1195

SolrCloud Diagram

Index the Techproducts Data
Your Solr server is up and running, but it doesn’t contain any data yet, so we can’t do any queries.
Solr includes the bin/post tool in order to facilitate indexing various types of documents easily. We’ll use
this tool for the indexing examples below.
You’ll need a command shell to run some of the following examples, rooted in the Solr install directory; the
shell from where you launched Solr works just fine.



Currently the bin/post tool does not have a comparable Windows script, but the
underlying Java program invoked is available. We’ll show examples below for Windows, but
you can also see the Windows section of the Post Tool documentation for more details.

The data we will index is in the example/exampledocs directory. The documents are in a mix of document
formats (JSON, CSV, etc.), and fortunately we can index them all at once:
Linux/Mac
solr-7.3.0:$ bin/post -c techproducts example/exampledocs/*
Windows
C:\solr-7.3.0> java -jar -Dc=techproducts -Dauto example\exampledocs\post.jar
example\exampledocs\*
You should see output similar to the following:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 16 of 1195

Apache Solr Reference Guide 7.3

SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/techproducts/update...
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file books.csv (text/csv) to [base]
POSTing file books.json (application/json) to [base]/json/docs
POSTing file gb18030-example.xml (application/xml) to [base]
POSTing file hd.xml (application/xml) to [base]
POSTing file ipod_other.xml (application/xml) to [base]
POSTing file ipod_video.xml (application/xml) to [base]
POSTing file manufacturers.xml (application/xml) to [base]
POSTing file mem.xml (application/xml) to [base]
POSTing file money.xml (application/xml) to [base]
POSTing file monitor.xml (application/xml) to [base]
POSTing file monitor2.xml (application/xml) to [base]
POSTing file more_books.jsonl (application/json) to [base]/json/docs
POSTing file mp500.xml (application/xml) to [base]
POSTing file post.jar (application/octet-stream) to [base]/extract
POSTing file sample.html (text/html) to [base]/extract
POSTing file sd500.xml (application/xml) to [base]
POSTing file solr-word.pdf (application/pdf) to [base]/extract
POSTing file solr.xml (application/xml) to [base]
POSTing file test_utf8.sh (application/octet-stream) to [base]/extract
POSTing file utf8-example.xml (application/xml) to [base]
POSTing file vidcard.xml (application/xml) to [base]
21 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/techproducts/update...
Time spent: 0:00:00.822
Congratulations again! You have data in your Solr!
Now we’re ready to start searching.

Basic Searching
Solr can be queried via REST clients, curl, wget, Chrome POSTMAN, etc., as well as via native clients available
for many programming languages.
The Solr Admin UI includes a query builder interface via the Query tab for the techproducts collection (at
http://localhost:8983/solr/#/techproducts/query). If you click the [ Execute Query ] button without
changing anything in the form, you’ll get 10 documents in JSON format:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 17 of 1195

Query Screen
The URL sent by the Admin UI to Solr is shown in light grey near the top right of the above screenshot. If you
click on it, your browser will show you the raw response.
To use curl, give the same URL shown in your browser in quotes on the command line:

curl "http://localhost:8983/solr/techproducts/select?indent=on&q=*:*"
What’s happening here is that we are using Solr’s query parameter (q) with a special syntax that requests all
documents in the index (*:*). All of the documents are not returned to us, however, because of the default
for a parameter called rows, which you can see in the form is 10. You can change the parameter in the UI or
in the defaults if you wish.
Solr has very powerful search options, and this tutorial won’t be able to cover all of them. But we can cover
some of the most common types of queries.
Search for a Single Term
To search for a term, enter it as the q parameter value in the Solr Admin UI Query screen, replacing *:* with
the term you want to find.
Enter "foundation" and hit [ Execute Query ] again.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 18 of 1195

Apache Solr Reference Guide 7.3

If you prefer curl, enter something like this:

curl "http://localhost:8983/solr/techproducts/select?q=foundation"
You’ll see something like this:
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":8,
"params":{
"q":"foundation"}},
"response":{"numFound":4,"start":0,"maxScore":2.7879646,"docs":[
{
"id":"0553293354",
"cat":["book"],
"name":"Foundation",
"price":7.99,
"price_c":"7.99,USD",
"inStock":true,
"author":"Isaac Asimov",
"author_s":"Isaac Asimov",
"series_t":"Foundation Novels",
"sequence_i":1,
"genre_s":"scifi",
"_version_":1574100232473411586,
"price_c____l_ns":799}]
}}
The response indicates that there are 4 hits ("numFound":4). We’ve only included one document the above
sample output, but since 4 hits is lower than the rows parameter default of 10 to be returned, you should see
all 4 of them.
Note the responseHeader before the documents. This header will include the parameters you have set for
the search. By default it shows only the parameters you have set for this query, which in this case is only
your query term.
The documents we got back include all the fields for each document that were indexed. This is, again,
default behavior. If you want to restrict the fields in the response, you can use the fl param, which takes a
comma-separated list of field names. This is one of the available fields on the query form in the Admin UI.
Put "id" (without quotes) in the "fl" box and hit [ Execute Query ] again. Or, to specify it with curl:

curl "http://localhost:8983/solr/techproducts/select?q=foundation&fl=id"
You should only see the IDs of the matching records returned.
Field Searches
All Solr queries look for documents using some field. Often you want to query across multiple fields at the
same time, and this is what we’ve done so far with the "foundation" query. This is possible with the use of

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 19 of 1195

copy fields, which are set up already with this set of configurations. We’ll cover copy fields a little bit more in
Exercise 2.
Sometimes, though, you want to limit your query to a single field. This can make your queries more efficient
and the results more relevant for users.
Much of the data in our small sample data set is related to products. Let’s say we want to find all the
"electronics" products in the index. In the Query screen, enter "electronics" (without quotes) in the q box
and hit [ Execute Query ]. You should get 14 results, such as:
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":6,
"params":{
"q":"electronics"}},
"response":{"numFound":14,"start":0,"maxScore":1.5579545,"docs":[
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable",
"manu":"Belkin",
"manu_id_s":"belkin",
"cat":["electronics",
"connector"],
"features":["car power adapter for iPod, white"],
"weight":2.0,
"price":11.5,
"price_c":"11.50,USD",
"popularity":1,
"inStock":false,
"store":"37.7752,-122.4232",
"manufacturedate_dt":"2006-02-14T23:55:59Z",
"_version_":1574100232554151936,
"price_c____l_ns":1150}]
}}
This search finds all documents that contain the term "electronics" anywhere in the indexed fields. However,
we can see from the above there is a cat field (for "category"). If we limit our search for only documents
with the category "electronics", the results will be more precise for our users.
Update your query in the q field of the Admin UI so it’s cat:electronics. Now you get 12 results:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 20 of 1195

Apache Solr Reference Guide 7.3

{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":6,
"params":{
"q":"cat:electronics"}},
"response":{"numFound":12,"start":0,"maxScore":0.9614112,"docs":[
{
"id":"SP2514N",
"name":"Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133",
"manu":"Samsung Electronics Co. Ltd.",
"manu_id_s":"samsung",
"cat":["electronics",
"hard drive"],
"features":["7200RPM, 8MB cache, IDE Ultra ATA-133",
"NoiseGuard, SilentSeek technology, Fluid Dynamic Bearing (FDB) motor"],
"price":92.0,
"price_c":"92.0,USD",
"popularity":6,
"inStock":true,
"manufacturedate_dt":"2006-02-13T15:26:37Z",
"store":"35.0752,-97.032",
"_version_":1574100232511160320,
"price_c____l_ns":9200}]
}}
Using curl, this query would look like this:

curl "http://localhost:8983/solr/techproducts/select?q=cat:electronics"
Phrase Search
To search for a multi-term phrase, enclose it in double quotes: q="multiple terms here". For example,
search for "CAS latency" by entering that phrase in quotes to the q box in the Admin UI.
If you’re following along with curl, note that the space between terms must be converted to "+" in a URL, as
so:

curl "http://localhost:8983/solr/techproducts/select?q=\"CAS+latency\""
We get 2 results:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 21 of 1195

{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":7,
"params":{
"q":"\"CAS latency\""}},
"response":{"numFound":2,"start":0,"maxScore":5.937691,"docs":[
{
"id":"VDBDB1A16",
"name":"A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory
- OEM",
"manu":"A-DATA Technology Inc.",
"manu_id_s":"corsair",
"cat":["electronics",
"memory"],
"features":["CAS latency 3,
2.7v"],
"popularity":0,
"inStock":true,
"store":"45.18414,-93.88141",
"manufacturedate_dt":"2006-02-13T15:26:37Z",
"payloads":"electronics|0.9 memory|0.1",
"_version_":1574100232590852096},
{
"id":"TWINX2048-3200PRO",
"name":"CORSAIR XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) Dual
Channel Kit System Memory - Retail",
"manu":"Corsair Microsystems Inc.",
"manu_id_s":"corsair",
"cat":["electronics",
"memory"],
"features":["CAS latency 2, 2-3-3-6 timing, 2.75v, unbuffered, heat-spreader"],
"price":185.0,
"price_c":"185.00,USD",
"popularity":5,
"inStock":true,
"store":"37.7752,-122.4232",
"manufacturedate_dt":"2006-02-13T15:26:37Z",
"payloads":"electronics|6.0 memory|3.0",
"_version_":1574100232584560640,
"price_c____l_ns":18500}]
}}
Combining Searches
By default, when you search for multiple terms and/or phrases in a single query, Solr will only require that
one of them is present in order for a document to match. Documents containing more terms will be sorted
higher in the results list.
You can require that a term or phrase is present by prefixing it with a +; conversely, to disallow the presence
of a term or phrase, prefix it with a -.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 22 of 1195

Apache Solr Reference Guide 7.3

To find documents that contain both terms "electronics" and "music", enter +electronics +music in the q
box in the Admin UI Query tab.
If you’re using curl, you must encode the + character because it has a reserved purpose in URLs (encoding
the space character). The encoding for + is %2B as in:

curl "http://localhost:8983/solr/techproducts/select?q=%2Belectronics%20%2Bmusic"
You should only get a single result.
To search for documents that contain the term "electronics" but don’t contain the term "music", enter
+electronics -music in the q box in the Admin UI. For curl, again, URL encode + as %2B as in:

curl "http://localhost:8983/solr/techproducts/select?q=%2Belectronics+-music"
This time you get 13 results.
More Information on Searching
We have only scratched the surface of the search options available in Solr. For more Solr search options, see
the section on Searching.

Exercise 1 Wrap Up
At this point, you’ve seen how Solr can index data and have done some basic queries. You can choose now
to continue to the next example which will introduce more Solr concepts, such as faceting results and
managing your schema, or you can strike out on your own.
If you decide not to continue with this tutorial, the data we’ve indexed so far is likely of little value to you.
You can delete your installation and start over, or you can use the bin/solr script we started out with to
delete this collection:

bin/solr delete -c techproducts
And then create a new collection:

bin/solr create -c  -s 2 -rf 2
To stop both of the Solr nodes we started, issue the command:

bin/solr stop -all
For more information on start/stop and collection options with bin/solr, see Solr Control Script Reference.

Exercise 2: Modify the Schema and Index Films Data
This exercise will build on the last one and introduce you to the index schema and Solr’s powerful faceting
features.

Restart Solr
Did you stop Solr after the last exercise? No? Then go ahead to the next section.
If you did, though, and need to restart Solr, issue these commands:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 23 of 1195

./bin/solr start -c -p 8983 -s example/cloud/node1/solr
This starts the first node. When it’s done start the second node, and tell it how to connect to to ZooKeeper:

./bin/solr start -c -p 7574 -s example/cloud/node2/solr -z localhost:9983

Create a New Collection
We’re going to use a whole new data set in this exercise, so it would be better to have a new collection
instead of trying to reuse the one we had before.
One reason for this is we’re going to use a feature in Solr called "field guessing", where Solr attempts to
guess what type of data is in a field while it’s indexing it. It also automatically creates new fields in the
schema for new fields that appear in incoming documents. This mode is called "Schemaless". We’ll see the
benefits and limitations of this approach to help you decide how and where to use it in your real application.

What is a "schema" and why do I need one?
Solr’s schema is a single file (in XML) that stores the details about the fields and field types Solr is
expected to understand. The schema defines not only the field or field type names, but also any
modifications that should happen to a field before it is indexed. For example, if you want to ensure that
a user who enters "abc" and a user who enters "ABC" can both find a document containing the term
"ABC", you will want to normalize (lower-case it, in this case) "ABC" when it is indexed, and normalize
the user query to be sure of a match. These rules are defined in your schema.
Earlier in the tutorial we mentioned copy fields, which are fields made up of data that originated from
other fields. You can also define dynamic fields, which use wildcards (such as *_t or *_s) to dynamically
create fields of a specific field type. These types of rules are also defined in the schema.
When you initially started Solr in the first exercise, we had a choice of a configSet to use. The one we chose
had a schema that was pre-defined for the data we later indexed. This time, we’re going to use a configSet
that has a very minimal schema and let Solr figure out from the data what fields to add.
The data you’re going to index is related to movies, so start by creating a collection named "films" that uses
the _default configSet:

bin/solr create -c films -s 2 -rf 2
Whoa, wait. We didn’t specify a configSet! That’s fine, the _default is appropriately named, since it’s the
default and is used if you don’t specify one at all.
We did, however, set two parameters -s and -rf. Those are the number of shards to split the collection
across (2) and how many replicas to create (2). This is equivalent to the options we had during the
interactive example from the first exercise.
You should see output like:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 24 of 1195

Apache Solr Reference Guide 7.3

WARNING: Using _default configset. Data driven schema functionality is enabled by default, which
is
NOT RECOMMENDED for production use.
To turn it off:
curl http://localhost:7574/solr/films/config -d '{"set-user-property":
{"update.autoCreateFields":"false"}}'
Connecting to ZooKeeper at localhost:9983 ...
INFO - 2017-07-27 15:07:46.191; org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider;
Cluster at localhost:9983 ready
Uploading /7.3.0/server/solr/configsets/_default/conf for config films to ZooKeeper at
localhost:9983
Creating new collection 'films' using command:
http://localhost:7574/solr/admin/collections?action=CREATE&name=films&numShards=2&replicationFact
or=2&maxShardsPerNode=2&collection.configName=films
{
"responseHeader":{
"status":0,
"QTime":3830},
"success":{
"192.168.0.110:8983_solr":{
"responseHeader":{
"status":0,
"QTime":2076},
"core":"films_shard2_replica_n1"},
"192.168.0.110:7574_solr":{
"responseHeader":{
"status":0,
"QTime":2494},
"core":"films_shard1_replica_n2"}}}
The first thing the command printed was a warning about not using this configSet in production. That’s due
to some of the limitations we’ll cover shortly.
Otherwise, though, the collection should be created. If we go to the Admin UI at http://localhost:8983/solr/#
/films/collection-overview we should see the overview screen.
Preparing Schemaless for the Films Data
There are two parallel things happening with the schema that comes with the _default configSet.
First, we are using a "managed schema", which is configured to only be modified by Solr’s Schema API. That
means we should not hand-edit it so there isn’t confusion about which edits come from which source. Solr’s
Schema API allows us to make changes to fields, field types, and other types of schema rules.
Second, we are using "field guessing", which is configured in the solrconfig.xml file (and includes most of
Solr’s various configuration settings). Field guessing is designed to allow us to start using Solr without
having to define all the fields we think will be in our documents before trying to index them. This is why we

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 25 of 1195

call it "schemaless", because you can start quickly and let Solr create fields for you as it encounters them in
documents.
Sounds great! Well, not really, there are limitations. It’s a bit brute force, and if it guesses wrong, you can’t
change much about a field after data has been indexed without having to reindex. If we only have a few
thousand documents that might not be bad, but if you have millions and millions of documents, or, worse,
don’t have access to the original data anymore, this can be a real problem.
For these reasons, the Solr community does not recommend going to production without a schema that you
have defined yourself. By this we mean that the schemaless features are fine to start with, but you should
still always make sure your schema matches your expectations for how you want your data indexed and how
users are going to query it.
It is possible to mix schemaless features with a defined schema. Using the Schema API, you can define a few
fields that you know you want to control, and let Solr guess others that are less important or which you are
confident (through testing) will be guessed to your satisfaction. That’s what we’re going to do here.
Create the "names" Field

The films data we are going to index has a small number of fields for each movie: an ID, director name(s),
film name, release date, and genre(s).
If you look at one of the files in example/films, you’ll see the first film is named .45, released in 2006. As the
first document in the dataset, Solr is going to guess the field type based on the data in the record. If we go
ahead and index this data, that first film name is going to indicate to Solr that that field type is a "float"
numeric field, and will create a "name" field with a type FloatPointField. All data after this record will be
expected to be a float.
Well, that’s not going to work. We have titles like A Mighty Wind and Chicken Run, which are strings decidedly not numeric and not floats. If we let Solr guess the "name" field is a float, what will happen is later
titles will cause an error and indexing will fail. That’s not going to get us very far.
What we can do is set up the "name" field in Solr before we index the data to be sure Solr always interprets
it as a string. At the command line, enter this curl command:
curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"name",
"type":"text_general", "multiValued":false, "stored":true}}'
http://localhost:8983/solr/films/schema
This command uses the Schema API to explicitly define a field named "name" that has the field type
"text_general" (a text field). It will not be permitted to have multiple values, but it will be stored (meaning it
can be retrieved by queries).
You can also use the Admin UI to create fields, but it offers a bit less control over the properties of your field.
It will work for our case, though:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 26 of 1195

Apache Solr Reference Guide 7.3

Creating a field
Create a "catchall" Copy Field

There’s one more change to make before we start indexing.
In the first exercise when we queried the documents we had indexed, we didn’t have to specify a field to
search because the configuration we used was set up to copy fields into a text field, and that field was the
default when no other field was defined in the query.
The configuration we’re using now doesn’t have that rule. We would need to define a field to search for
every query. We can, however, set up a "catchall field" by defining a copy field that will take all data from all
fields and index it into a field named _text_. Let’s do that now.
You can use either the Admin UI or the Schema API for this.
At the command line, use the Schema API again to define a copy field:
curl -X POST -H 'Content-type:application/json' --data-binary '{"add-copy-field" :
{"source":"*","dest":"_text_"}}' http://localhost:8983/solr/films/schema
In the Admin UI, choose [ Add Copy Field ], then fill out the source and destination for your field, as in this
screenshot.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 27 of 1195

Creating a copy field
What this does is make a copy of all fields and put the data into the "_text_" field.



It can be very expensive to do this with your production data because it tells Solr to
effectively index everything twice. It will make indexing slower, and make your index larger.
With your production data, you will want to be sure you only copy fields that really warrant
it for your application.

OK, now we’re ready to index the data and start playing around with it.

Index Sample Film Data
The films data we will index is located in the example/films directory of your installation. It comes in three
formats: JSON, XML and CSV. Pick one of the formats and index it into the "films" collection (in each
example, one command is for Unix/MacOS and the other is for Windows):
To Index JSON Format
bin/post -c films example/films/films.json
C:\solr-7.3.0> java -jar -Dc=films -Dauto example\exampledocs\post.jar example\films\*.json
To Index XML Format
bin/post -c films example/films/films.xml
C:\solr-7.3.0> java -jar -Dc=films -Dauto example\exampledocs\post.jar example\films\*.xml

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 28 of 1195

Apache Solr Reference Guide 7.3

To Index CSV Format
bin/post -c films example/films/films.csv -params
"f.genre.split=true&f.directed_by.split=true&f.genre.separator=|&f.directed_by.separator=|"
C:\solr-7.3.0> java -jar -Dc=films
-Dparams=f.genre.split=true&f.directed_by.split=true&f.genre.separator=|&f.directed_by.separator=
| -Dauto example\exampledocs\post.jar example\films\*.csv
Each command includes these main parameters:
• -c films: this is the Solr collection to index data to.
• example/films/films.json (or films.xml or films.csv): this is the path to the data file to index. You
could simply supply the directory where this file resides, but since you know the format you want to
index, specifying the exact file for that format is more efficient.
Note the CSV command includes extra parameters. This is to ensure multi-valued entries in the "genre" and
"directed_by" columns are split by the pipe (|) character, used in this file as a separator. Telling Solr to split
these columns this way will ensure proper indexing of the data.
Each command will produce output similar to the below seen while indexing JSON:
$ ./bin/post -c films example/films/films.json
/bin/java -classpath /solr-7.3.0/dist/solr-core-7.3.0.jar -Dauto=yes -Dc=films -Ddata=files
org.apache.solr.util.SimplePostTool example/films/films.json
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/films/update...
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file films.json (application/json) to [base]/json/docs
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/films/update...
Time spent: 0:00:00.878
Hooray!
If you go to the Query screen in the Admin UI for films (http://localhost:8983/solr/#/films/query) and hit
[ Execute Query ] you should see 1100 results, with the first 10 returned to the screen.
Let’s do a query to see if the "catchall" field worked properly. Enter "comedy" in the q box and hit [ Execute
Query ] again. You should see get 417 results. Feel free to play around with other searches before we move
on to faceting.

Faceting
One of Solr’s most popular features is faceting. Faceting allows the search results to be arranged into
subsets (or buckets, or categories), providing a count for each subset. There are several types of faceting:
field values, numeric and date ranges, pivots (decision tree), and arbitrary query faceting.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 29 of 1195

Field Facets
In addition to providing search results, a Solr query can return the number of documents that contain each
unique value in the whole result set.
On the Admin UI Query tab, if you check the facet checkbox, you’ll see a few facet-related options appear:

Facet options in the Query screen
To see facet counts from all documents (q=*:*): turn on faceting (facet=true), and specify the field to facet
on via the facet.field param. If you only want facets, and no document contents, specify rows=0. The curl
command below will return facet counts for the genre_str field:

curl "http://localhost:8983/solr/films/select?q=*:*&rows=0&facet=true&facet.field=genre_str"
In your terminal, you’ll see something like:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 30 of 1195

Apache Solr Reference Guide 7.3

{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":11,
"params":{
"q":"*:*",
"facet.field":"genre_str",
"rows":"0",
"facet":"true"}},
"response":{"numFound":1100,"start":0,"maxScore":1.0,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{
"genre_str":[
"Drama",552,
"Comedy",389,
"Romance Film",270,
"Thriller",259,
"Action Film",196,
"Crime Fiction",170,
"World cinema",167]},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}}}
We’ve truncated the output here a little bit, but in the facet_counts section, you see by default you get a
count of the number of documents using each genre for every genre in the index. Solr has a parameter
facet.mincount that you could use to limit the facets to only those that contain a certain number of
documents (this parameter is not shown in the UI). Or, perhaps you do want all the facets, and you’ll let your
application’s front-end control how it’s displayed to users.
If you wanted to control the number of items in a bucket, you could do something like this:

curl
"http://localhost:8983/solr/films/select?=&q=*:*&facet.field=genre_str&facet.mincount=200&fa
cet=on&rows=0"
You should only see 4 facets returned.
There are a great deal of other parameters available to help you control how Solr constructs the facets and
facet lists. We’ll cover some of them in this exercise, but you can also see the section Faceting for more
detail.
Range Facets
For numerics or dates, it’s often desirable to partition the facet counts into ranges rather than discrete
values. A prime example of numeric range faceting, using the example techproducts data from our previous
exercise, is price. In the /browse UI, it looks like this:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 31 of 1195

Range facets
The films data includes the release date for films, and we could use that to create date range facets, which
are another common use for range facets.
The Solr Admin UI doesn’t yet support range facet options, so you will need to use curl or similar command
line tool for the following examples.
If we construct a query that looks like this:
curl 'http://localhost:8983/solr/films/select?q=*:*&rows=0'\
'&facet=true'\
'&facet.range=initial_release_date'\
'&facet.range.start=NOW-20YEAR'\
'&facet.range.end=NOW'\
'&facet.range.gap=%2B1YEAR'
This will request all films and ask for them to be grouped by year starting with 20 years ago (our earliest
release date is in 2000) and ending today. Note that this query again URL encodes a + as %2B.
In the terminal you will see:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 32 of 1195

Apache Solr Reference Guide 7.3

{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":8,
"params":{
"facet.range":"initial_release_date",
"facet.limit":"300",
"q":"*:*",
"facet.range.gap":"+1YEAR",
"rows":"0",
"facet":"on",
"facet.range.start":"NOW-20YEAR",
"facet.range.end":"NOW"}},
"response":{"numFound":1100,"start":0,"maxScore":1.0,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{},
"facet_ranges":{
"initial_release_date":{
"counts":[
"1997-07-28T17:12:06.919Z",0,
"1998-07-28T17:12:06.919Z",0,
"1999-07-28T17:12:06.919Z",48,
"2000-07-28T17:12:06.919Z",82,
"2001-07-28T17:12:06.919Z",103,
"2002-07-28T17:12:06.919Z",131,
"2003-07-28T17:12:06.919Z",137,
"2004-07-28T17:12:06.919Z",163,
"2005-07-28T17:12:06.919Z",189,
"2006-07-28T17:12:06.919Z",92,
"2007-07-28T17:12:06.919Z",26,
"2008-07-28T17:12:06.919Z",7,
"2009-07-28T17:12:06.919Z",3,
"2010-07-28T17:12:06.919Z",0,
"2011-07-28T17:12:06.919Z",0,
"2012-07-28T17:12:06.919Z",1,
"2013-07-28T17:12:06.919Z",1,
"2014-07-28T17:12:06.919Z",1,
"2015-07-28T17:12:06.919Z",0,
"2016-07-28T17:12:06.919Z",0],
"gap":"+1YEAR",
"start":"1997-07-28T17:12:06.919Z",
"end":"2017-07-28T17:12:06.919Z"}},
"facet_intervals":{},
"facet_heatmaps":{}}}
Pivot Facets
Another faceting type is pivot facets, also known as "decision trees", allowing two or more fields to be

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 33 of 1195

nested for all the various possible combinations. Using the films data, pivot facets can be used to see how
many of the films in the "Drama" category (the genre_str field) are directed by a director. Here’s how to get
at the raw data for this scenario:

curl
"http://localhost:8983/solr/films/select?q=*:*&rows=0&facet=on&facet.pivot=genre_str,directe
d_by_str"
This results in the following response, which shows a facet for each category and director combination:
{"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":1147,
"params":{
"q":"*:*",
"facet.pivot":"genre_str,directed_by_str",
"rows":"0",
"facet":"on"}},
"response":{"numFound":1100,"start":0,"maxScore":1.0,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{},
"facet_pivot":{
"genre_str,directed_by_str":[{
"field":"genre_str",
"value":"Drama",
"count":552,
"pivot":[{
"field":"directed_by_str",
"value":"Ridley Scott",
"count":5},
{
"field":"directed_by_str",
"value":"Steven Soderbergh",
"count":5},
{
"field":"directed_by_str",
"value":"Michael Winterbottom",
"count":4}}]}]}}}
We’ve truncated this output as well - you will see a lot of genres and directors in your screen.

Exercise 2 Wrap Up
In this exercise, we learned a little bit more about how Solr organizes data in the indexes, and how to work
with the Schema API to manipulate the schema file. We also learned a bit about facets in Solr, including

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 34 of 1195

Apache Solr Reference Guide 7.3

range facets and pivot facets. In both of these things, we’ve only scratched the surface of the available
options. If you can dream it, it might be possible!
Like our previous exercise, this data may not be relevant to your needs. We can clean up our work by
deleting the collection. To do that, issue this command at the command line:

bin/solr delete -c films

Exercise 3: Index Your Own Data
For this last exercise, work with a dataset of your choice. This can be files on your local hard drive, a set of
data you have worked with before, or maybe a sample of the data you intend to index to Solr for your
production application.
This exercise is intended to get you thinking about what you will need to do for your application:
• What sorts of data do you need to index?
• What will you need to do to prepare Solr for your data (such as, create specific fields, set up copy fields,
determine analysis rules, etc.)
• What kinds of search options do you want to provide to users?
• How much testing will you need to do to ensure everything works the way you expect?

Create Your Own Collection
Before you get started, create a new collection, named whatever you’d like. In this example, the collection
will be named "localDocs"; replace that name with whatever name you choose if you want to.

./bin/solr create -c localDocs -s 2 -rf 2
Again, as we saw from Exercise 2 above, this will use the _default configSet and all the schemaless features
it provides. As we noted previously, this may cause problems when we index our data. You may need to
iterate on indexing a few times before you get the schema right.

Indexing Ideas
Solr has lots of ways to index data. Choose one of the approaches below and try it out with your system:
Local Files with bin/post
If you have a local directory of files, the Post Tool (bin/post) can index a directory of files. We saw this in
action in our first exercise.
We used only JSON, XML and CSV in our exercises, but the Post Tool can also handle HTML, PDF, Microsoft
Office formats (such as MS Word), plain text, and more.
In this example, assume there is a directory named "Documents" locally. To index it, we would issue a
command like this (correcting the collection name after the -c parameter as needed):

./bin/post -c localDocs ~/Documents
You may get errors as it works through your documents. These might be caused by the field guessing, or
the file type may not be supported. Indexing content such as this demonstrates the need to plan Solr for

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 35 of 1195

your data, which requires understanding it and perhaps also some trial and error.
DataImportHandler
Solr includes a tool called the Data Import Handler (DIH) which can connect to databases (if you have a
jdbc driver), mail servers, or other structured data sources. There are several examples included for
feeds, GMail, and a small HSQL database.
The README.txt file in example/example-DIH will give you details on how to start working with this tool.
SolrJ
SolrJ is a Java-based client for interacting with Solr. Use SolrJ for JVM-based languages or other Solr clients
to programmatically create documents to send to Solr.
Documents Screen
Use the Admin UI Documents tab (at http://localhost:8983/solr/#/localDocs/documents) to paste in a
document to be indexed, or select Document Builder from the Document Type dropdown to build a
document one field at a time. Click on the [ Submit Document ] button below the form to index your
document.

Updating Data
You may notice that even if you index content in this tutorial more than once, it does not duplicate the
results found. This is because the example Solr schema (a file named either managed-schema or schema.xml)
specifies a uniqueKey field called id. Whenever you POST commands to Solr to add a document with the
same value for the uniqueKey as an existing document, it automatically replaces it for you.
You can see that that has happened by looking at the values for numDocs and maxDoc in the core-specific
Overview section of the Solr Admin UI.

numDocs represents the number of searchable documents in the index (and will be larger than the number
of XML, JSON, or CSV files since some files contained more than one document). The maxDoc value may be
larger as the maxDoc count includes logically deleted documents that have not yet been physically removed
from the index. You can re-post the sample files over and over again as much as you want and numDocs will
never increase, because the new documents will constantly be replacing the old.
Go ahead and edit any of the existing example data files, change some of the data, and re-run the PostTool
(bin/post). You’ll see your changes reflected in subsequent searches.

Deleting Data
If you need to iterate a few times to get your schema right, you may want to delete documents to clear out
the collection and try again. Note, however, that merely removing documents doesn’t change the
underlying field definitions. Essentially, this will allow you to reindex your data after making changes to
fields for your needs.
You can delete data by POSTing a delete command to the update URL and specifying the value of the
document’s unique key field, or a query that matches multiple documents (be careful with that one!). We
can use bin/post to delete documents also if we structure the request properly.
Execute the following command to delete a specific document:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 36 of 1195

Apache Solr Reference Guide 7.3

bin/post -c localDocs -d "SP2514N"
To delete all documents, you can use "delete-by-query" command like:

bin/post -c localDocs -d "*:*"
You can also modify the above to only delete documents that match a specific query.

Exercise 3 Wrap Up
At this point, you’re ready to start working on your own.
Jump ahead to the overall wrap up when you’re ready to stop Solr and remove all the examples you worked
with and start fresh.

Spatial Queries
Solr has sophisticated geospatial support, including searching within a specified distance range of a given
location (or within a bounding box), sorting by distance, or even boosting results by the distance.
Some of the example techproducts documents we indexed in Exercise 1 have locations associated with them
to illustrate the spatial capabilities. To re-index this data, see Exercise 1.
Spatial queries can be combined with any other types of queries, such as in this example of querying for
"ipod" within 10 kilometers from San Francisco:

Spatial queries and results
This is from Solr’s example search UI (called /browse), which has a nice feature to show a map for each item
and allow easy selection of the location to search near. You can see this yourself by going to
http://localhost:8983/solr/techproducts/browse?q=ipod&pt=37.7752%2C-122.4232&d=10&sfield=store&
fq=%7B%21bbox%7D&queryOpts=spatial&queryOpts=spatial in a browser.
To learn more about Solr’s spatial capabilities, see the section Spatial Search.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 37 of 1195

Wrapping Up
If you’ve run the full set of commands in this quick start guide you have done the following:
• Launched Solr into SolrCloud mode, two nodes, two collections including shards and replicas
• Indexed several types of files
• Used the Schema API to modify your schema
• Opened the admin console, used its query interface to get results
• Opened the /browse interface to explore Solr’s features in a more friendly and familiar interface
Nice work!

Cleanup
As you work through this tutorial, you may want to stop Solr and reset the environment back to the starting
point. The following command line will stop Solr and remove the directories for each of the two nodes that
were created all the way back in Exercise 1:

bin/solr stop -all ; rm -Rf example/cloud/

Where to next?
This Guide will be your best resource for learning more about Solr.
Solr also has a robust community made up of people happy to help you get started. For more information,
check out the Solr website’s Resources page.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 38 of 1195

Apache Solr Reference Guide 7.3

A Quick Overview
Solr is a search server built on top of Apache Lucene, an open source, Java-based, information retrieval
library. It is designed to drive powerful document retrieval applications - wherever you need to serve data to
users based on their queries, Solr can work for you.
Here is a example of how Solr could integrate with an application:

Solr integration with applications
In the scenario above, Solr runs alongside other server applications. For example, an online store application
would provide a user interface, a shopping cart, and a way to make purchases for end users; while an
inventory management application would allow store employees to edit product information. The product
metadata would be kept in some kind of database, as well as in Solr.
Solr makes it easy to add the capability to search through the online store through the following steps:
1. Define a schema. The schema tells Solr about the contents of documents it will be indexing. In the online
store example, the schema would define fields for the product name, description, price, manufacturer,
and so on. Solr’s schema is powerful and flexible and allows you to tailor Solr’s behavior to your
application. See Documents, Fields, and Schema Design for all the details.
2. Feed Solr documents for which your users will search.
3. Expose search functionality in your application.
Because Solr is based on open standards, it is highly extensible. Solr queries are simple HTTP request URLs
and the response is a structured document: mainly JSON, but it could also be XML, CSV, or other formats.
This means that a wide variety of clients will be able to use Solr, from other web applications to browser
clients, rich client applications, and mobile devices. Any platform capable of HTTP can talk to Solr. See Client
APIs for details on client APIs.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 39 of 1195

Solr offers support for the simplest keyword searching through to complex queries on multiple fields and
faceted search results. Searching has more information about searching and queries.
If Solr’s capabilities are not impressive enough, its ability to handle very high-volume applications should do
the trick.
A relatively common scenario is that you have so much data, or so many queries, that a single Solr server is
unable to handle your entire workload. In this case, you can scale up the capabilities of your application
using SolrCloud to better distribute the data, and the processing of requests, across many servers. Multiple
options can be mixed and matched depending on the scalability you need.
For example: "Sharding" is a scaling technique in which a collection is split into multiple logical pieces called
"shards" in order to scale up the number of documents in a collection beyond what could physically fit on a
single server. Incoming queries are distributed to every shard in the collection, which respond with merged
results. Another technique available is to increase the "Replication Factor" of your collection, which allows
you to add servers with additional copies of your collection to handle higher concurrent query load by
spreading the requests around to multiple machines. Sharding and replication are not mutually exclusive,
and together make Solr an extremely powerful and scalable platform.
Best of all, this talk about high-volume applications is not just hypothetical: some of the famous Internet
sites that use Solr today are Macy’s, EBay, and Zappo’s. For more examples, take a look at
https://wiki.apache.org/solr/PublicServers.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 40 of 1195

Apache Solr Reference Guide 7.3

Solr System Requirements
You can install Solr in any system where a suitable Java Runtime Environment (JRE) is available, as detailed
below.
Currently this includes Linux, MacOS/OS X, and Microsoft Windows.

Installation Requirements
Java Requirements
You will need the Java Runtime Environment (JRE) version 1.8 or higher. At a command line, check your Java
version like this:
$ java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
The exact output will vary, but you need to make sure you meet the minimum version requirement. We also
recommend choosing a version that is not end-of-life from its vendor. Oracle or OpenJDK are the most
tested JREs and are recommended. It’s also recommended to use the latest available official release when
possible.
Some versions of Java VM have bugs that may impact your implementation. To be sure, check the page
Lucene JavaBugs.
If you don’t have the required version, or if the java command is not found, download and install the latest
version from Oracle at http://www.oracle.com/technetwork/java/javase/downloads/index.html.

Supported Operating Systems
Solr is tested on several versions of Linux, MacOS, and Windows.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 41 of 1195

Installing Solr
Installation of Solr on Unix-compatible or Windows servers generally requires simply extracting (or,
unzipping) the download package.
Please be sure to review the Solr System Requirements before starting Solr.

Available Solr Packages
Solr is available from the Solr website. Download the latest release https://lucene.apache.org/solr/mirrorssolr-latest-redir.html.
There are three separate packages:
• solr-7.3.0.tgz for Linux/Unix/OSX systems
• solr-7.3.0.zip for Microsoft Windows systems
• solr-7.3.0-src.tgz the package Solr source code. This is useful if you want to develop on Solr without
using the official Git repository.

Preparing for Installation
When getting started with Solr, all you need to do is extract the Solr distribution archive to a directory of
your choosing. This will suffice as an initial development environment, but take care not to overtax this "toy"
installation before setting up your true development and production environments.
When you’ve progressed past initial evaluation of Solr, you’ll want to take care to plan your implementation.
You may need to reinstall Solr on another server or make a clustered SolrCloud environment.
When you’re ready to setup Solr for a production environment, please refer to the instructions provided on
the Taking Solr to Production page.
What Size Server Do I Need?



How to size your Solr installation is a complex question that relies on a number of factors,
including the number and structure of documents, how many fields you intend to store, the
number of users, etc.
It’s highly recommended that you spend a bit of time thinking about the factors that will
impact hardware sizing for your Solr implementation. A very good blog post that discusses
the issues to consider is Sizing Hardware in the Abstract: Why We Don’t have a Definitive
Answer.

One thing to note when planning your installation is that a hard limit exists in Lucene for the number of
documents in a single index: approximately 2.14 billion documents (2,147,483,647 to be exact). In practice, it
is highly unlikely that such a large number of documents would fit and perform well in a single index, and
you will likely need to distribute your index across a cluster before you ever approach this number. If you
know you will exceed this number of documents in total before you’ve even started indexing, it’s best to
plan your installation with SolrCloud as part of your design from the start.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 42 of 1195

Apache Solr Reference Guide 7.3

Package Installation
To keep things simple for now, extract the Solr distribution archive to your local home directory, for instance
on Linux, do:
cd ~/
tar zxf solr-7.3.0.tgz
Once extracted, you are now ready to run Solr using the instructions provided in the Starting Solr section
below.

Directory Layout
After installing Solr, you’ll see the following directories and files within them:
bin/
This directory includes several important scripts that will make using Solr easier.
solr and solr.cmd
This is Solr’s Control Script, also known as bin/solr (*nix) / bin/solr.cmd (Windows). This script is the
preferred tool to start and stop Solr. You can also create collections or cores, configure authentication,
and work with configuration files when running in SolrCloud mode.
post
The PostTool, which provides a simple command line interface for POSTing content to Solr.
solr.in.sh and solr.in.cmd
These are property files for *nix and Windows systems, respectively. System-level properties for Java,
Jetty, and Solr are configured here. Many of these settings can be overridden when using bin/solr /
bin/solr.cmd, but this allows you to set all the properties in one place.
install_solr_services.sh
This script is used on *nix systems to install Solr as a service. It is described in more detail in the
section Taking Solr to Production.
contrib/
Solr’s contrib directory includes add-on plugins for specialized features of Solr.
dist/
The dist directory contains the main Solr .jar files.
docs/
The docs directory includes a link to online Javadocs for Solr.
example/
The example directory includes several types of examples that demonstrate various Solr capabilities. See
the section Solr Examples below for more details on what is in this directory.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 43 of 1195

licenses/
The licenses directory includes all of the licenses for 3rd party libraries used by Solr.
server/
This directory is where the heart of the Solr application resides. A README in this directory provides a
detailed overview, but here are some highlights:
• Solr’s Admin UI (server/solr-webapp)
• Jetty libraries (server/lib)
• Log files (server/logs) and log configurations (server/resources). See the section Configuring
Logging for more details on how to customize Solr’s default logging.
• Sample configsets (server/solr/configsets)

Solr Examples
Solr includes a number of example documents and configurations to use when getting started. If you ran
through the Solr Tutorial, you have already interacted with some of these files.
Here are the examples included with Solr:
exampledocs
This is a small set of simple CSV, XML, and JSON files that can be used with bin/post when first getting
started with Solr. For more information about using bin/post with these files, see Post Tool.
example-DIH
This directory includes a few example DataImport Handler (DIH) configurations to help you get started
with importing structured content in a database, an email server, or even an Atom feed. Each example
will index a different set of data; see the README there for more details about these examples.
files
The files directory provides a basic search UI for documents such as Word or PDF that you may have
stored locally. See the README there for details on how to use this example.
films
The films directory includes a robust set of data about movies in three formats: CSV, XML, and JSON. See
the README there for details on how to use this dataset.

Starting Solr
Solr includes a command line interface tool called bin/solr (Linux/MacOS) or bin\solr.cmd (Windows). This
tool allows you to start and stop Solr, create cores and collections, configure authentication, and check the
status of your system.
To use it to start Solr you can simply enter:
bin/solr start

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 44 of 1195

Apache Solr Reference Guide 7.3

If you are running Windows, you can start Solr by running bin\solr.cmd instead.
bin\solr.cmd start
This will start Solr in the background, listening on port 8983.
When you start Solr in the background, the script will wait to make sure Solr starts correctly before returning
to the command line prompt.



All of the options for the Solr CLI are described in the section Solr Control Script Reference.

Start Solr with a Specific Bundled Example
Solr also provides a number of useful examples to help you learn about key features. You can launch the
examples using the -e flag. For instance, to launch the "techproducts" example, you would do:
bin/solr -e techproducts
Currently, the available examples you can run are: techproducts, dih, schemaless, and cloud. See the section
Running with Example Configurations for details on each example.



Getting Started with SolrCloud
Running the cloud example starts Solr in SolrCloud mode. For more information on
starting Solr in cloud mode, see the section Getting Started with SolrCloud.

Check if Solr is Running
If you’re not sure if Solr is running locally, you can use the status command:
bin/solr status
This will search for running Solr instances on your computer and then gather basic information about them,
such as the version and memory usage.
That’s it! Solr is running. If you need convincing, use a Web browser to see the Admin Console.

http://localhost:8983/solr/

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 45 of 1195

The Solr Admin interface.
If Solr is not running, your browser will complain that it cannot connect to the server. Check your port
number and try again.

Create a Core
If you did not start Solr with an example configuration, you would need to create a core in order to be able
to index and search. You can do so by running:
bin/solr create -c 
This will create a core that uses a data-driven schema which tries to guess the correct field type when you
add documents to the index.
To see all available options for creating a new core, execute:
bin/solr create -help

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 46 of 1195

Apache Solr Reference Guide 7.3

Deployment and Operations
An important aspect of Solr is that all operations and deployment can be done online, with minimal or no
impact to running applications. This includes minor upgrades and provisioning and removing nodes,
backing up and restoring indexes and editing configurations
Common administrative tasks include:
Solr Control Script Reference: This section provides information about all of the options available to the
bin/solr / bin\solr.cmd scripts, which can start and stop Solr, configure authentication, and create or
remove collections and cores.
Solr Configuration Files: Overview of the installation layout and major configuration files.
Taking Solr to Production: Detailed steps to help you install Solr as a service and take your application to
production.
Making and Restoring Backups: Describes backup strategies for your Solr indexes.
Running Solr on HDFS: How to use HDFS to store your Solr indexes and transaction logs.
SolrCloud on AWS EC2: A tutorial on deploying Solr in Amazon Web Services (AWS) using EC2 instances.
Upgrading a Solr Cluster: Information for upgrading a production SolrCloud cluster.
Solr Upgrade Notes: Information about changes made in Solr releases.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 47 of 1195

Solr Control Script Reference
Solr includes a script known as “bin/solr” that allows you to perform many common operations on your Solr
installation or cluster.
You can start and stop Solr, create and delete collections or cores, perform operations on ZooKeeper and
check the status of Solr and configured shards.
You can find the script in the bin/ directory of your Solr installation. The bin/solr script makes Solr easier
to work with by providing simple commands and options to quickly accomplish common goals.
More examples of bin/solr in use are available throughout the Solr Reference Guide, but particularly in the
sections Starting Solr and Getting Started with SolrCloud.

Starting and Stopping
Start and Restart
The start command starts Solr. The restart command allows you to restart Solr while it is already running
or if it has been stopped already.
The start and restart commands have several options to allow you to run in SolrCloud mode, use an
example configuration set, start with a hostname or port that is not the default and point to a local
ZooKeeper ensemble.

bin/solr start [options]
bin/solr start -help
bin/solr restart [options]
bin/solr restart -help
When using the restart command, you must pass all of the parameters you initially passed when you
started Solr. Behind the scenes, a stop request is initiated, so Solr will be stopped before being started again.
If no nodes are already running, restart will skip the step to stop and proceed to starting Solr.
Start Parameters
The bin/solr script provides many options to allow you to customize the server in common ways, such as
changing the listening port. However, most of the defaults are adequate for most Solr installations,
especially when just getting started.

-a ""
Start Solr with additional JVM parameters, such as those starting with -X. If you are passing JVM
parameters that begin with "-D", you can omit the -a option.
Example:
bin/solr start -a "-Xdebug -Xrunjdwp:transport=dt_socket, server=y,suspend=n,address=1044"

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 48 of 1195

Apache Solr Reference Guide 7.3

-cloud
Start Solr in SolrCloud mode, which will also launch the embedded ZooKeeper instance included with Solr.
This option can be shortened to simply -c.
If you are already running a ZooKeeper ensemble that you want to use instead of the embedded (singlenode) ZooKeeper, you should also pass the -z parameter.
For more details, see the section SolrCloud Mode below.
Example: bin/solr start -c

-d 
Define a server directory, defaults to server (as in, $SOLR_HOME/server). It is uncommon to override this
option. When running multiple instances of Solr on the same host, it is more common to use the same
server directory for each instance and use a unique Solr home directory using the -s option.
Example: bin/solr start -d newServerDir

-e 
Start Solr with an example configuration. These examples are provided to help you get started faster with
Solr generally, or just try a specific feature.
The available options are:
• cloud
• techproducts
• dih
• schemaless
See the section Running with Example Configurations below for more details on the example
configurations.
Example: bin/solr start -e schemaless

-f
Start Solr in the foreground; you cannot use this option when running examples with the -e option.
Example: bin/solr start -f

-h 
Start Solr with the defined hostname. If this is not specified, 'localhost' will be assumed.
Example: bin/solr start -h search.mysolr.com

-m 
Start Solr with the defined value as the min (-Xms) and max (-Xmx) heap size for the JVM.
Example: bin/solr start -m 1g

-noprompt
Start Solr and suppress any prompts that may be seen with another option. This would have the side

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 49 of 1195

effect of accepting all defaults implicitly.
For example, when using the "cloud" example, an interactive session guides you through several options
for your SolrCloud cluster. If you want to accept all of the defaults, you can simply add the -noprompt
option to your request.
Example: bin/solr start -e cloud -noprompt

-p 
Start Solr on the defined port. If this is not specified, '8983' will be used.
Example: bin/solr start -p 8655

-s 
Sets the solr.solr.home system property; Solr will create core directories under this directory. This
allows you to run multiple Solr instances on the same host while reusing the same server directory set
using the -d parameter.
If set, the specified directory should contain a solr.xml file, unless solr.xml exists in ZooKeeper. The
default value is server/solr.
This parameter is ignored when running examples (-e), as the solr.solr.home depends on which
example is run.
Example: bin/solr start -s newHome

-v
Be more verbose. This changes the logging level of log4j from INFO to DEBUG, having the same effect as if
you edited log4j.properties accordingly.
Example: bin/solr start -f -v

-q
Be more quiet. This changes the logging level of log4j from INFO to WARN, having the same effect as if you
edited log4j.properties accordingly. This can be useful in a production setting where you want to limit
logging to warnings and errors.
Example: bin/solr start -f -q

-V
Start Solr with verbose messages from the start script.
Example: bin/solr start -V

-z 
Start Solr with the defined ZooKeeper connection string. This option is only used with the -c option, to
start Solr in SolrCloud mode. If this option is not provided, Solr will start the embedded ZooKeeper
instance and use that instance for SolrCloud operations.
Example: bin/solr start -c -z server1:2181,server2:2181

-force
If attempting to start Solr as the root user, the script will exit with a warning that running Solr as "root"

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 50 of 1195

Apache Solr Reference Guide 7.3

can cause problems. It is possible to override this warning with the -force parameter.
Example: sudo bin/solr start -force
To emphasize how the default settings work take a moment to understand that the following commands are
equivalent:

bin/solr start
bin/solr start -h localhost -p 8983 -d server -s solr -m 512m
It is not necessary to define all of the options when starting if the defaults are fine for your needs.
Setting Java System Properties
The bin/solr script will pass any additional parameters that begin with -D to the JVM, which allows you to
set arbitrary Java system properties.
For example, to set the auto soft-commit frequency to 3 seconds, you can do:

bin/solr start -Dsolr.autoSoftCommit.maxTime=3000
SolrCloud Mode
The -c and -cloud options are equivalent:

bin/solr start -c
bin/solr start -cloud
If you specify a ZooKeeper connection string, such as -z 192.168.1.4:2181, then Solr will connect to
ZooKeeper and join the cluster.
If you do not specify the -z option when starting Solr in cloud mode, then Solr will launch an embedded
ZooKeeper server listening on the Solr port + 1000, i.e., if Solr is running on port 8983, then the embedded
ZooKeeper will be listening on port 9983.



If your ZooKeeper connection string uses a chroot, such as localhost:2181/solr, then you
need to create the /solr znode before launching SolrCloud using the bin/solr script.
+ To do this use the mkroot command outlined below, for example: bin/solr zk mkroot

/solr -z 192.168.1.4:2181
When starting in SolrCloud mode, the interactive script session will prompt you to choose a configset to use.
For more information about starting Solr in SolrCloud mode, see also the section Getting Started with
SolrCloud.
Running with Example Configurations

bin/solr start -e 
The example configurations allow you to get started quickly with a configuration that mirrors what you hope
to accomplish with Solr.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 51 of 1195

Each example launches Solr with a managed schema, which allows use of the Schema API to make schema
edits, but does not allow manual editing of a Schema file.
If you would prefer to manually modify a schema.xml file directly, you can change this default as described
in the section Schema Factory Definition in SolrConfig.
Unless otherwise noted in the descriptions below, the examples do not enable SolrCloud nor schemaless
mode.
The following examples are provided:
• cloud: This example starts a 1-4 node SolrCloud cluster on a single machine. When chosen, an interactive
session will start to guide you through options to select the initial configset to use, the number of nodes
for your example cluster, the ports to use, and name of the collection to be created.
When using this example, you can choose from any of the available configsets found in
$SOLR_HOME/server/solr/configsets.
• techproducts: This example starts Solr in standalone mode with a schema designed for the sample
documents included in the $SOLR_HOME/example/exampledocs directory.
The configset used can be found in

$SOLR_HOME/server/solr/configsets/sample_techproducts_configs.
• dih: This example starts Solr in standalone mode with the DataImportHandler (DIH) enabled and several
example dataconfig.xml files pre-configured for different types of data supported with DIH (such as,
database contents, email, RSS feeds, etc.).
The configset used is customized for DIH, and is found in $SOLR_HOME/example/example-DIH/solr/conf.
For more information about DIH, see the section Uploading Structured Data Store Data with the Data
Import Handler.
• schemaless: This example starts Solr in standalone mode using a managed schema, as described in the
section Schema Factory Definition in SolrConfig, and provides a very minimal pre-defined schema. Solr
will run in Schemaless Mode with this configuration, where Solr will create fields in the schema on the fly
and will guess field types used in incoming documents.
The configset used can be found in $SOLR_HOME/server/solr/configsets/_default.



The run in-foreground option (-f) is not compatible with the -e option since the script
needs to perform additional tasks after starting the Solr server.

Stop
The stop command sends a STOP request to a running Solr node, which allows it to shutdown gracefully.
The command will wait up to 180 seconds for Solr to stop gracefully and then will forcefully kill the process
(kill -9).

bin/solr stop [options]
bin/solr stop -help

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 52 of 1195

Apache Solr Reference Guide 7.3

Stop Parameters

-p 
Stop Solr running on the given port. If you are running more than one instance, or are running in
SolrCloud mode, you either need to specify the ports in separate requests or use the -all option.
Example: bin/solr stop -p 8983

-all
Stop all running Solr instances that have a valid PID.
Example: bin/solr stop -all

-k 
Stop key used to protect from stopping Solr inadvertently; default is "solrrocks".
Example: bin/solr stop -k solrrocks

System Information
Version
The version command simply returns the version of Solr currently installed and immediately exists.
$ bin/solr version
X.Y.0

Status
The status command displays basic JSON-formatted information for any Solr nodes found running on the
local system.
The status command uses the SOLR_PID_DIR environment variable to locate Solr process ID files to find
running Solr instances, which defaults to the bin directory.

bin/solr status
The output will include a status of each node of the cluster, as in this example:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 53 of 1195

Found 2 Solr nodes:
Solr process 39920 running on port 7574
{
"solr_home":"/Applications/Solr/example/cloud/node2/solr/",
"version":"X.Y.0",
"startTime":"2015-02-10T17:19:54.739Z",
"uptime":"1 days, 23 hours, 55 minutes, 48 seconds",
"memory":"77.2 MB (%15.7) of 490.7 MB",
"cloud":{
"ZooKeeper":"localhost:9865",
"liveNodes":"2",
"collections":"2"}}
Solr process 39827 running on port 8865
{
"solr_home":"/Applications/Solr/example/cloud/node1/solr/",
"version":"X.Y.0",
"startTime":"2015-02-10T17:19:49.057Z",
"uptime":"1 days, 23 hours, 55 minutes, 54 seconds",
"memory":"94.2 MB (%19.2) of 490.7 MB",
"cloud":{
"ZooKeeper":"localhost:9865",
"liveNodes":"2",
"collections":"2"}}

Assert
The assert command sanity checks common issues with Solr installations. These include checking the
ownership/existence of particular directories, and ensuring Solr is available on the expected URL. The
command can either output a specified error message, or change its exit code to indicate errors.
As an example:
bin/solr assert --exists /opt/bin/solr
Results in the output below:
ERROR: Directory /opt/bin/solr does not exist.
Use bin/solr assert -help for a full list of options.

Healthcheck
The healthcheck command generates a JSON-formatted health report for a collection when running in
SolrCloud mode. The health report provides information about the state of every replica for all shards in a
collection, including the number of committed documents and its current state.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 54 of 1195

Apache Solr Reference Guide 7.3

bin/solr healthcheck [options]
bin/solr healthcheck -help
Healthcheck Parameters

-c 
Name of the collection to run a healthcheck against (required).
Example: bin/solr healthcheck -c gettingstarted

-z 
ZooKeeper connection string, defaults to localhost:9983. If you are running Solr on a port other than
8983, you will have to specify the ZooKeeper connection string. By default, this will be the Solr port + 1000.
Example: bin/solr healthcheck -z localhost:2181
Below is an example healthcheck request and response using a non-standard ZooKeeper connect string,
with 2 nodes running:

$ bin/solr healthcheck -c gettingstarted -z localhost:9865

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 55 of 1195

{
"collection":"gettingstarted",
"status":"healthy",
"numDocs":0,
"numShards":2,
"shards":[
{
"shard":"shard1",
"status":"healthy",
"replicas":[
{
"name":"core_node1",
"url":"http://10.0.1.10:8865/solr/gettingstarted_shard1_replica2/",
"numDocs":0,
"status":"active",
"uptime":"2 days, 1 hours, 18 minutes, 48 seconds",
"memory":"25.6 MB (%5.2) of 490.7 MB",
"leader":true},
{
"name":"core_node4",
"url":"http://10.0.1.10:7574/solr/gettingstarted_shard1_replica1/",
"numDocs":0,
"status":"active",
"uptime":"2 days, 1 hours, 18 minutes, 42 seconds",
"memory":"95.3 MB (%19.4) of 490.7 MB"}]},
{
"shard":"shard2",
"status":"healthy",
"replicas":[
{
"name":"core_node2",
"url":"http://10.0.1.10:8865/solr/gettingstarted_shard2_replica2/",
"numDocs":0,
"status":"active",
"uptime":"2 days, 1 hours, 18 minutes, 48 seconds",
"memory":"25.8 MB (%5.3) of 490.7 MB"},
{
"name":"core_node3",
"url":"http://10.0.1.10:7574/solr/gettingstarted_shard2_replica1/",
"numDocs":0,
"status":"active",
"uptime":"2 days, 1 hours, 18 minutes, 42 seconds",
"memory":"95.4 MB (%19.4) of 490.7 MB",
"leader":true}]}]}

Collections and Cores
The bin/solr script can also help you create new collections (in SolrCloud mode) or cores (in standalone
mode), or delete collections.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 56 of 1195

Apache Solr Reference Guide 7.3

Create a Core or Collection
The create command detects the mode that Solr is running in (standalone or SolrCloud) and then creates a
core or collection depending on the mode.

bin/solr create [options]
bin/solr create -help
Create Core or Collection Parameters

-c 
Name of the core or collection to create (required).
Example: bin/solr create -c mycollection

-d 
The configuration directory. This defaults to _default.
See the section Configuration Directories and SolrCloud below for more details about this option when
running in SolrCloud mode.
Example: bin/solr create -d _default

-n 
The configuration name. This defaults to the same name as the core or collection.
Example: bin/solr create -n basic

-p 
Port of a local Solr instance to send the create command to; by default the script tries to detect the port
by looking for running Solr instances.
This option is useful if you are running multiple standalone Solr instances on the same host, thus
requiring you to be specific about which instance to create the core in.
Example: bin/solr create -p 8983

-s  or -shards
Number of shards to split a collection into, default is 1; only applies when Solr is running in SolrCloud
mode.
Example: bin/solr create -s 2

-rf  or -replicationFactor
Number of copies of each document in the collection. The default is 1 (no replication).
Example: bin/solr create -rf 2

-force
If attempting to run create as "root" user, the script will exit with a warning that running Solr or actions
against Solr as "root" can cause problems. It is possible to override this warning with the -force
parameter.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 57 of 1195

Example: bin/solr create -c foo -force
Configuration Directories and SolrCloud
Before creating a collection in SolrCloud, the configuration directory used by the collection must be
uploaded to ZooKeeper. The create command supports several use cases for how collections and
configuration directories work. The main decision you need to make is whether a configuration directory in
ZooKeeper should be shared across multiple collections.
Let’s work through a few examples to illustrate how configuration directories work in SolrCloud.
First, if you don’t provide the -d or -n options, then the default configuration
($SOLR_HOME/server/solr/configsets/_default/conf) is uploaded to ZooKeeper using the same name as
the collection.
For example, the following command will result in the _default configuration being uploaded to
/configs/contacts in ZooKeeper: bin/solr create -c contacts.
If you create another collection with bin/solr create -c contacts2, then another copy of the _default
directory will be uploaded to ZooKeeper under /configs/contacts2.
Any changes you make to the configuration for the contacts collection will not affect the contacts2
collection. Put simply, the default behavior creates a unique copy of the configuration directory for each
collection you create.
You can override the name given to the configuration directory in ZooKeeper by using the -n option. For
instance, the command bin/solr create -c logs -d _default -n basic will upload the
server/solr/configsets/_default/conf directory to ZooKeeper as /configs/basic.
Notice that we used the -d option to specify a different configuration than the default. Solr provides several
built-in configurations under server/solr/configsets. However you can also provide the path to your own
configuration directory using the -d option. For instance, the command bin/solr create -c mycoll -d
/tmp/myconfigs, will upload /tmp/myconfigs into ZooKeeper under /configs/mycoll.
To reiterate, the configuration directory is named after the collection unless you override it using the -n
option.
Other collections can share the same configuration by specifying the name of the shared configuration
using the -n option. For instance, the following command will create a new collection that shares the basic
configuration created previously: bin/solr create -c logs2 -n basic.
Data-driven Schema and Shared Configurations
The _default schema can mutate as data is indexed, since it has schemaless functionality (i.e., data-driven
changes to the schema). Consequently, we recommend that you do not share data-driven configurations
between collections unless you are certain that all collections should inherit the changes made when
indexing data into one of the collections. You can turn off schemaless functionality (i.e., data-driven changes
to the schema) for a collection by the following (assuming the collection name is mycollection):
curl http://host:8983/solr/mycollection/config -d '{"set-user-property":
{"update.autoCreateFields":"false"}}'

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 58 of 1195

Apache Solr Reference Guide 7.3

Delete Core or Collection
The delete command detects the mode that Solr is running in (standalone or SolrCloud) and then deletes
the specified core (standalone) or collection (SolrCloud) as appropriate.

bin/solr delete [options]
bin/solr delete -help
If running in SolrCloud mode, the delete command checks if the configuration directory used by the
collection you are deleting is being used by other collections. If not, then the configuration directory is also
deleted from ZooKeeper.
For example, if you created a collection with bin/solr create -c contacts, then the delete command
bin/solr delete -c contacts will check to see if the /configs/contacts configuration directory is being
used by any other collections. If not, then the /configs/contacts directory is removed from ZooKeeper.
Delete Core or Collection Parameters

-c 
Name of the core / collection to delete (required).
Example: bin/solr delete -c mycoll

-deleteConfig
Whether or not the configuration directory should also be deleted from ZooKeeper. The default is true.
If the configuration directory is being used by another collection, then it will not be deleted even if you
pass -deleteConfig as true.
Example: bin/solr delete -deleteConfig false

-p 
The port of a local Solr instance to send the delete command to. By default the script tries to detect the
port by looking for running Solr instances.
This option is useful if you are running multiple standalone Solr instances on the same host, thus
requiring you to be specific about which instance to delete the core from.
Example: bin/solr delete -p 8983

Authentication
The bin/solr script allows enabling or disabling Basic Authentication, allowing you to configure
authentication from the command line.
Currently, this script only enables Basic Authentication, and is only available when using SolrCloud mode.

Enabling Basic Authentication
The command bin/solr auth enable configures Solr to use Basic Authentication when accessing the User
Interface, using bin/solr and any API requests.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3



Page 59 of 1195

For more information about Solr’s authentication plugins, see the section Securing Solr. For
more information on Basic Authentication support specifically, see the section Basic
Authentication Plugin.

The bin/solr auth enable command makes several changes to enable Basic Authentication:
• Creates a security.json file and uploads it to ZooKeeper. The security.json file will look similar to:
{
"authentication":{
"blockUnknown": false,
"class":"solr.BasicAuthPlugin",
"credentials":{"user":"vgGVo69YJeUg/O6AcFiowWsdyOUdqfQvOLsrpIPMCzk=
7iTnaKOWe+Uj5ZfGoKKK2G6hrcF10h6xezMQK+LBvpI="}
},
"authorization":{
"class":"solr.RuleBasedAuthorizationPlugin",
"permissions":[
{"name":"security-edit", "role":"admin"},
{"name":"collection-admin-edit", "role":"admin"},
{"name":"core-admin-edit", "role":"admin"}
],
"user-role":{"user":"admin"}
}
}
• Adds two lines to bin/solr.in.sh or bin\solr.in.cmd to set the authentication type, and the path to
basicAuth.conf:
# The following lines added by ./solr for enabling BasicAuth
SOLR_AUTH_TYPE="basic"
SOLR_AUTHENTICATION_OPTS="-Dsolr.httpclient.config=/path/to/solr7.3.0/server/solr/basicAuth.conf"
• Creates the file server/solr/basicAuth.conf to store the credential information that is used with
bin/solr commands.
The command takes the following parameters:

-credentials
The username and password in the format of username:password of the initial user.
If you prefer not to pass the username and password as an argument to the script, you can choose the
-prompt option. Either -credentials or -prompt must be specified.

-prompt
If prompt is preferred, pass true as a parameter to request the script to prompt the user to enter a
username and password.
Either -credentials or -prompt must be specified.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 60 of 1195

Apache Solr Reference Guide 7.3

-blockUnknown
When true, blocks all unauthenticated users from accessing Solr. This defaults to false, which means
unauthenticated users will still be able to access Solr.

-updateIncludeFileOnly
When true, only the settings in bin/solr.in.sh or bin\solr.in.cmd will be updated, and security.json
will not be created.

-z
Defines the ZooKeeper connect string. This is useful if you want to enable authentication before all your
Solr nodes have come up.

-d
Defines the Solr server directory, by default $SOLR_HOME/server. It is not common to need to override the
default, and is only needed if you have customized the $SOLR_HOME directory path.

-s
Defines the location of solr.solr.home, which by default is server/solr. If you have multiple instances
of Solr on the same host, or if you have customized the $SOLR_HOME directory path, you likely need to
define this.

Disabling Basic Authentication
You can disable Basic Authentication with bin/solr auth disable.
If the -updateIncludeFileOnly option is set to true, then only the settings in bin/solr.in.sh or
bin\solr.in.cmd will be updated, and security.json will not be removed.
If the -updateIncludeFileOnly option is set to false, then the settings in bin/solr.in.sh or
bin\solr.in.cmd will be updated, and security.json will be removed. However, the basicAuth.conf file is
not removed with either option.

ZooKeeper Operations
The bin/solr script allows certain operations affecting ZooKeeper. These operations are for SolrCloud mode
only. The operations are available as sub-commands, which each have their own set of options.

bin/solr zk [sub-command] [options]
bin/solr zk -help



Solr should have been started at least once before issuing these commands to initialize
ZooKeeper with the znodes Solr expects. Once ZooKeeper is initialized, Solr doesn’t need to
be running on any node to use these commands.

Upload a Configuration Set
Use the zk upconfig command to upload one of the pre-configured configuration set or a customized
configuration set to ZooKeeper.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 61 of 1195

ZK Upload Parameters
All parameters below are required.

-n 
Name of the configuration set in ZooKeeper. This command will upload the configuration set to the
"configs" ZooKeeper node giving it the name specified.
You can see all uploaded configuration sets in the Admin UI via the Cloud screens. Choose Cloud -> Tree
-> configs to see them.
If a pre-existing configuration set is specified, it will be overwritten in ZooKeeper.
Example: -n myconfig

-d 
The path of the configuration set to upload. It should have a conf directory immediately below it that in
turn contains solrconfig.xml etc.
If just a name is supplied, $SOLR_HOME/server/solr/configsets will be checked for this name. An
absolute path may be supplied instead.
Examples:
• -d directory_under_configsets
• -d /path/to/configset/source

-z 
The ZooKeeper connection string. Unnecessary if ZK_HOST is defined in solr.in.sh or solr.in.cmd.
Example: -z 123.321.23.43:2181
An example of this command with all of the parameters is:
bin/solr zk upconfig -z 111.222.333.444:2181 -n mynewconfig -d /path/to/configset

Reload Collections When Changing Configurations



This command does not automatically make changes effective! It simply uploads the
configuration sets to ZooKeeper. You can use the Collection API’s RELOAD command to
reload any collections that uses this configuration set.

Download a Configuration Set
Use the zk downconfig command to download a configuration set from ZooKeeper to the local filesystem.
ZK Download Parameters
All parameters listed below are required.

-n 
Name of config set in ZooKeeper to download. The Admin UI Cloud -> Tree -> configs node lists all

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 62 of 1195

Apache Solr Reference Guide 7.3

available configuration sets.
Example: -n myconfig

-d 
The path to write the downloaded configuration set into. If just a name is supplied,
$SOLR_HOME/server/solr/configsets will be the parent. An absolute path may be supplied as well.
In either case, pre-existing configurations at the destination will be overwritten!
Examples:
• -d directory_under_configsets
• -d /path/to/configset/destination

-z 
The ZooKeeper connection string. Unnecessary if ZK_HOST is defined in solr.in.sh or solr.in.cmd.
Example: -z 123.321.23.43:2181
An example of this command with all parameters is:
bin/solr zk downconfig -z 111.222.333.444:2181 -n mynewconfig -d /path/to/configset
A "best practice" is to keep your configuration sets in some form of version control as the system-of-record.
In that scenario, downconfig should rarely be used.

Copy between Local Files and ZooKeeper znodes
Use the zk cp command for transferring files and directories between ZooKeeper znodes and your local
drive. This command will copy from the local drive to ZooKeeper, from ZooKeeper to the local drive or from
ZooKeeper to ZooKeeper.
ZK Copy Parameters

-r
Optional. Do a recursive copy. The command will fail if the  has children unless '-r' is specified.
Example: -r


The file or path to copy from. If prefixed with zk: then the source is presumed to be ZooKeeper. If no
prefix or the prefix is 'file:' this is the local drive. At least one of  or  must be prefixed by 'zk:'
or the command will fail.
Examples:
• zk:/configs/myconfigs/solrconfig.xml
• file:/Users/apache/configs/src


The file or path to copy to. If prefixed with zk: then the source is presumed to be ZooKeeper. If no prefix

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 63 of 1195

or the prefix is file: this is the local drive.
At least one of  or  must be prefixed by zk: or the command will fail. If  ends in a slash
character it names a directory.
Examples:
• zk:/configs/myconfigs/solrconfig.xml
• file:/Users/apache/configs/src

-z 
The ZooKeeper connection string. Unnecessary if ZK_HOST is defined in solr.in.sh or solr.in.cmd.
Example: -z 123.321.23.43:2181
An example of this command with the parameters is:
Recursively copy a directory from local to ZooKeeper.

bin/solr zk cp -r file:/apache/confgs/whatever/conf zk:/configs/myconf -z
111.222.333.444:2181
Copy a single file from ZooKeeper to local.

bin/solr zk cp zk:/configs/myconf/managed_schema /configs/myconf/managed_schema -z
111.222.333.444:2181

Remove a znode from ZooKeeper
Use the zk rm command to remove a znode (and optionally all child nodes) from ZooKeeper
ZK Remove Parameters

-r
Optional. Do a recursive removal. The command will fail if the  has children unless '-r' is specified.
Example: -r


The path to remove from ZooKeeper, either a parent or leaf node.
There are limited safety checks, you cannot remove '/' or '/zookeeper' nodes.
The path is assumed to be a ZooKeeper node, no zk: prefix is necessary.
Examples:
• /configs
• /configs/myconfigset
• /configs/myconfigset/solrconfig.xml

-z 
The ZooKeeper connection string. Unnecessary if ZK_HOST is defined in solr.in.sh or solr.in.cmd.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 64 of 1195

Apache Solr Reference Guide 7.3

Example: -z 123.321.23.43:2181
Examples of this command with the parameters are:

bin/solr zk rm -r /configs
bin/solr zk rm /configs/myconfigset/schema.xml

Move One ZooKeeper znode to Another (Rename)
Use the zk mv command to move (rename) a ZooKeeper znode
ZK Move Parameters


The znode to rename. The zk: prefix is assumed.
Example: /configs/oldconfigset


The new name of the znode. The zk: prefix is assumed.
Example: /configs/newconfigset

-z 
The ZooKeeper connection string. Unnecessary if ZK_HOST is defined in solr.in.sh or solr.in.cmd.
Example: -z 123.321.23.43:2181
An example of this command is:

bin/solr zk mv /configs/oldconfigset /configs/newconfigset

List a ZooKeeper znode’s Children
Use the zk ls command to see the children of a znode.
ZK List Parameters

-r Optional. Recursively list all descendants of a znode.
+ Example: -r


The path on ZooKeeper to list.
Example: /collections/mycollection

-z 
The ZooKeeper connection string. Unnecessary if ZK_HOST is defined in solr.in.sh or solr.in.cmd.
Example: -z 123.321.23.43:2181
An example of this command with the parameters is:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 65 of 1195

bin/solr zk ls -r /collections/mycollection
bin/solr zk ls /collections

Create a znode (supports chroot)
Use the zk mkroot command to create a znode. The primary use-case for this command to support
ZooKeeper’s "chroot" concept. However, it can also be used to create arbitrary paths.
Create znode Parameters


The path on ZooKeeper to create. Intermediate znodes will be created if necessary. A leading slash is
assumed even if not specified.
Example: /solr

-z 
The ZooKeeper connection string. Unnecessary if ZK_HOST is defined in solr.in.sh or solr.in.cmd.
Example: -z 123.321.23.43:2181
Examples of this command:

bin/solr zk mkroot /solr -z 123.321.23.43:2181
bin/solr zk mkroot /solr/production

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 66 of 1195

Apache Solr Reference Guide 7.3

Solr Configuration Files
Solr has several configuration files that you will interact with during your implementation.
Many of these files are in XML format, although APIs that interact with configuration settings tend to accept
JSON for programmatic access as needed.

Solr Home
When Solr runs, it needs access to a home directory.
When you first install Solr, your home directory is server/solr. However, some examples may change this
location (such as, if you run bin/solr start -e cloud, your home directory will be example/cloud).
The home directory contains important configuration information and is the place where Solr will store its
index. The layout of the home directory will look a little different when you are running Solr in standalone
mode vs. when you are running in SolrCloud mode.
The crucial parts of the Solr home directory are shown in these examples:
Standalone Mode
/
solr.xml
core_name1/
core.properties
conf/
solrconfig.xml
managed-schema
data/
core_name2/
core.properties
conf/
solrconfig.xml
managed-schema
data/
SolrCloud Mode
/
solr.xml
core_name1/
core.properties
data/
core_name2/
core.properties
data/
You may see other files, but the main ones you need to know are discussed in the next section.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 67 of 1195

Configuration Files
Inside Solr’s Home, you’ll find these files:
• solr.xml specifies configuration options for your Solr server instance. For more information on
solr.xml see Solr Cores and solr.xml.
• Per Solr Core:
◦ core.properties defines specific properties for each core such as its name, the collection the core
belongs to, the location of the schema, and other parameters. For more details on core.properties,
see the section Defining core.properties.
◦ solrconfig.xml controls high-level behavior. You can, for example, specify an alternate location for
the data directory. For more information on solrconfig.xml, see Configuring solrconfig.xml.
◦ managed-schema (or schema.xml instead) describes the documents you will ask Solr to index. The
Schema define a document as a collection of fields. You get to define both the field types and the
fields themselves. Field type definitions are powerful and include information about how Solr
processes incoming field values and query values. For more information on Solr Schemas, see
Documents, Fields, and Schema Design and the Schema API.
◦ data/ The directory containing the low level index files.
Note that the SolrCloud example does not include a conf directory for each Solr Core (so there is no
solrconfig.xml or Schema file). This is because the configuration files usually found in the conf directory
are stored in ZooKeeper so they can be propagated across the cluster.
If you are using SolrCloud with the embedded ZooKeeper instance, you may also see zoo.cfg and zoo.data
which are ZooKeeper configuration and data files. However, if you are running your own ZooKeeper
ensemble, you would supply your own ZooKeeper configuration file when you start it and the copies in Solr
would be unused. For more information about SolrCloud, see the section SolrCloud.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 68 of 1195

Apache Solr Reference Guide 7.3

Taking Solr to Production
This section provides guidance on how to setup Solr to run in production on *nix platforms, such as Ubuntu.
Specifically, we’ll walk through the process of setting up to run a single Solr instance on a Linux host and
then provide tips on how to support multiple Solr nodes running on the same host.

Service Installation Script
Solr includes a service installation script (bin/install_solr_service.sh) to help you install Solr as a service
on Linux. Currently, the script only supports CentOS, Debian, Red Hat, SUSE and Ubuntu Linux distributions.
Before running the script, you need to determine a few parameters about your setup. Specifically, you need
to decide where to install Solr and which system user should be the owner of the Solr files and process.

Planning Your Directory Structure
We recommend separating your live Solr files, such as logs and index files, from the files included in the Solr
distribution bundle, as that makes it easier to upgrade Solr and is considered a good practice to follow as a
system administrator.
Solr Installation Directory
By default, the service installation script will extract the distribution archive into /opt. You can change this
location using the -i option when running the installation script. The script will also create a symbolic link to
the versioned directory of Solr. For instance, if you run the installation script for Solr 7.3.0, then the following
directory structure will be used:
/opt/solr-7.3.0
/opt/solr -> /opt/solr-7.3.0
Using a symbolic link insulates any scripts from being dependent on the specific Solr version. If, down the
road, you need to upgrade to a later version of Solr, you can just update the symbolic link to point to the
upgraded version of Solr. We’ll use /opt/solr to refer to the Solr installation directory in the remaining
sections of this page.
Separate Directory for Writable Files
You should also separate writable Solr files into a different directory; by default, the installation script uses
/var/solr, but you can override this location using the -d option. With this approach, the files in /opt/solr
will remain untouched and all files that change while Solr is running will live under /var/solr.

Create the Solr User
Running Solr as root is not recommended for security reasons, and the control script start command will
refuse to do so. Consequently, you should determine the username of a system user that will own all of the
Solr files and the running Solr process. By default, the installation script will create the solr user, but you can
override this setting using the -u option. If your organization has specific requirements for creating new
user accounts, then you should create the user before running the script. The installation script will make
the Solr user the owner of the /opt/solr and /var/solr directories.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 69 of 1195

You are now ready to run the installation script.

Run the Solr Installation Script
To run the script, you’ll need to download the latest Solr distribution archive and then do the following:
tar xzf solr-7.3.0.tgz solr-7.3.0/bin/install_solr_service.sh --strip-components=2
The previous command extracts the install_solr_service.sh script from the archive into the current
directory. If installing on Red Hat, please make sure lsof is installed before running the Solr installation
script (sudo yum install lsof). The installation script must be run as root:
sudo bash ./install_solr_service.sh solr-7.3.0.tgz
By default, the script extracts the distribution archive into /opt, configures Solr to write files into /var/solr,
and runs Solr as the solr user. Consequently, the following command produces the same result as the
previous command:
sudo bash ./install_solr_service.sh solr-7.3.0.tgz -i /opt -d /var/solr -u solr -s solr -p 8983
You can customize the service name, installation directories, port, and owner using options passed to the
installation script. To see available options, simply do:
sudo bash ./install_solr_service.sh -help
Once the script completes, Solr will be installed as a service and running in the background on your server
(on port 8983). To verify, you can do:
sudo service solr status
If you do not want to start the service immediately, pass the -n option. You can then start the service
manually later, e.g., after completing the configuration setup.
We’ll cover some additional configuration settings you can make to fine-tune your Solr setup in a moment.
Before moving on, let’s take a closer look at the steps performed by the installation script. This gives you a
better overview and will help you understand important details about your Solr installation when reading
other pages in this guide; such as when a page refers to Solr home, you’ll know exactly where that is on your
system.
Solr Home Directory
The Solr home directory (not to be confused with the Solr installation directory) is where Solr manages core
directories with index files. By default, the installation script uses /var/solr/data. If the -d option is used on
the install script, then this will change to the data subdirectory in the location given to the -d option. Take a
moment to inspect the contents of the Solr home directory on your system. If you do not store solr.xml in
ZooKeeper, the home directory must contain a solr.xml file. When Solr starts up, the Solr Control Script

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 70 of 1195

Apache Solr Reference Guide 7.3

passes the location of the home directory using the -Dsolr.solr.home=… system property.
Environment Overrides Include File
The service installation script creates an environment specific include file that overrides defaults used by the
bin/solr script. The main advantage of using an include file is that it provides a single location where all of
your environment-specific overrides are defined. Take a moment to inspect the contents of the
/etc/default/solr.in.sh file, which is the default path setup by the installation script. If you used the -s
option on the install script to change the name of the service, then the first part of the filename will be
different. For a service named solr-demo, the file will be named /etc/default/solr-demo.in.sh. There are
many settings that you can override using this file. However, at a minimum, this script needs to define the
SOLR_PID_DIR and SOLR_HOME variables, such as:
SOLR_PID_DIR=/var/solr
SOLR_HOME=/var/solr/data
The SOLR_PID_DIR variable sets the directory where the control script will write out a file containing the Solr
server’s process ID.
Log Settings
Solr uses Apache Log4J for logging. The installation script copies
/opt/solr/server/resources/log4j.properties to /var/solr/log4j.properties. Take a moment to
verify that the Solr include file is configured to send logs to the correct location by checking the following
settings in /etc/default/solr.in.sh:
LOG4J_PROPS=/var/solr/log4j.properties
SOLR_LOGS_DIR=/var/solr/logs
For more information about Log4J configuration, please see: Configuring Logging
init.d Script
When running a service like Solr on Linux, it’s common to setup an init.d script so that system administrators
can control Solr using the service tool, such as: service solr start. The installation script creates a very
basic init.d script to help you get started. Take a moment to inspect the /etc/init.d/solr file, which is the
default script name setup by the installation script. If you used the -s option on the install script to change
the name of the service, then the filename will be different. Notice that the following variables are setup for
your environment based on the parameters passed to the installation script:
SOLR_INSTALL_DIR=/opt/solr
SOLR_ENV=/etc/default/solr.in.sh
RUNAS=solr
The SOLR_INSTALL_DIR and SOLR_ENV variables should be self-explanatory. The RUNAS variable sets the
owner of the Solr process, such as solr; if you don’t set this value, the script will run Solr as root, which is
not recommended for production. You can use the /etc/init.d/solr script to start Solr by doing the
following as root:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 71 of 1195

service solr start
The /etc/init.d/solr script also supports the stop, restart, and status commands. Please keep in mind
that the init script that ships with Solr is very basic and is intended to show you how to setup Solr as a
service. However, it’s also common to use more advanced tools like supervisord or upstart to control Solr
as a service on Linux. While showing how to integrate Solr with tools like supervisord is beyond the scope of
this guide, the init.d/solr script should provide enough guidance to help you get started. Also, the
installation script sets the Solr service to start automatically when the host machine initializes.

Progress Check
In the next section, we cover some additional environment settings to help you fine-tune your production
setup. However, before we move on, let’s review what we’ve achieved thus far. Specifically, you should be
able to control Solr using /etc/init.d/solr. Please verify the following commands work with your setup:
sudo service solr restart
sudo service solr status
The status command should give some basic information about the running Solr node that looks similar to:
Solr process PID running on port 8983
{
"version":"5.0.0 - ubuntu - 2014-12-17 19:36:58",
"startTime":"2014-12-19T19:25:46.853Z",
"uptime":"0 days, 0 hours, 0 minutes, 8 seconds",
"memory":"85.4 MB (%17.4) of 490.7 MB"}
If the status command is not successful, look for error messages in /var/solr/logs/solr.log.

Fine-Tune Your Production Setup
Dynamic Defaults for ConcurrentMergeScheduler
The Merge Scheduler is configured in solrconfig.xml and defaults to ConcurrentMergeScheduler. This
scheduler uses multiple threads to merge Lucene segments in the background.
By default, the ConcurrentMergeScheduler auto-detects whether the underlying disk drive is rotational or a
SSD and sets defaults for maxThreadCount and maxMergeCount accordingly. If the disk drive is determined to
be rotational then the maxThreadCount is set to 1 and maxMergeCount is set to 6. Otherwise, maxThreadCount
is set to 4 or half the number of processors available to the JVM whichever is greater and maxMergeCount is
set to maxThreadCount+5.
This auto-detection works only on Linux and even then it is not guaranteed to be correct. On all other
platforms, the disk is assumed to be rotational. Therefore, if the auto-detection fails or is incorrect then
indexing performance can suffer badly due to the wrong defaults.
The auto-detected value is exposed by the Metrics API with the key

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 72 of 1195

Apache Solr Reference Guide 7.3

solr.node:CONTAINER.fs.coreRoot.spins. A value of true denotes that the disk is detected to be a
rotational or spinning disk.
It is safer to explicitly set values for maxThreadCount and maxMergeCount in the IndexConfig section of
SolrConfig.xml so that values appropriate to your hardware are used.
Alternatively, the boolean system property lucene.cms.override_spins can be set in the SOLR_OPTS
variable in the include file to override the auto-detected value. Similarily, the system property
lucene.cms.override_core_count can be set to the number of CPU cores to override the auto-detected
processor count.

Memory and GC Settings
By default, the bin/solr script sets the maximum Java heap size to 512M (-Xmx512m), which is fine for
getting started with Solr. For production, you’ll want to increase the maximum heap size based on the
memory requirements of your search application; values between 10 and 20 gigabytes are not uncommon
for production servers. When you need to change the memory settings for your Solr server, use the
SOLR_JAVA_MEM variable in the include file, such as:
SOLR_JAVA_MEM="-Xms10g -Xmx10g"
Also, the Solr Control Script comes with a set of pre-configured Java Garbage Collection settings that have
shown to work well with Solr for a number of different workloads. However, these settings may not work
well for your specific use of Solr. Consequently, you may need to change the GC settings, which should also
be done with the GC_TUNE variable in the /etc/default/solr.in.sh include file. For more information about
tuning your memory and garbage collection settings, see: JVM Settings.
Out-of-Memory Shutdown Hook
The bin/solr script registers the bin/oom_solr.sh script to be called by the JVM if an OutOfMemoryError
occurs. The oom_solr.sh script will issue a kill -9 to the Solr process that experiences the
OutOfMemoryError. This behavior is recommended when running in SolrCloud mode so that ZooKeeper is
immediately notified that a node has experienced a non-recoverable error. Take a moment to inspect the
contents of the /opt/solr/bin/oom_solr.sh script so that you are familiar with the actions the script will
perform if it is invoked by the JVM.

Going to Production with SolrCloud
To run Solr in SolrCloud mode, you need to set the ZK_HOST variable in the include file to point to your
ZooKeeper ensemble. Running the embedded ZooKeeper is not supported in production environments. For
instance, if you have a ZooKeeper ensemble hosted on the following three hosts on the default client port
2181 (zk1, zk2, and zk3), then you would set:
ZK_HOST=zk1,zk2,zk3
When the ZK_HOST variable is set, Solr will launch in "cloud" mode.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 73 of 1195

ZooKeeper chroot
If you’re using a ZooKeeper instance that is shared by other systems, it’s recommended to isolate the
SolrCloud znode tree using ZooKeeper’s chroot support. For instance, to ensure all znodes created by
SolrCloud are stored under /solr, you can put /solr on the end of your ZK_HOST connection string, such as:
ZK_HOST=zk1,zk2,zk3/solr
Before using a chroot for the first time, you need to create the root path (znode) in ZooKeeper by using the
Solr Control Script. We can use the mkroot command for that:
bin/solr zk mkroot /solr -z :



If you also want to bootstrap ZooKeeper with existing solr_home, you can instead use the
zkcli.sh / zkcli.bat bootstrap command, which will also create the chroot path if it does
not exist. See Command Line Utilities for more info.

Solr Hostname
Use the SOLR_HOST variable in the include file to set the hostname of the Solr server.
SOLR_HOST=solr1.example.com
Setting the hostname of the Solr server is recommended, especially when running in SolrCloud mode, as this
determines the address of the node when it registers with ZooKeeper.

Override Settings in solrconfig.xml
Solr allows configuration properties to be overridden using Java system properties passed at startup using
the -Dproperty=value syntax. For instance, in solrconfig.xml, the default auto soft commit settings are set
to:

${solr.autoSoftCommit.maxTime:-1}

In general, whenever you see a property in a Solr configuration file that uses the
${solr.PROPERTY:DEFAULT_VALUE} syntax, then you know it can be overridden using a Java system property.
For instance, to set the maxTime for soft-commits to be 10 seconds, then you can start Solr with
-Dsolr.autoSoftCommit.maxTime=10000, such as:
bin/solr start -Dsolr.autoSoftCommit.maxTime=10000
The bin/solr script simply passes options starting with -D on to the JVM during startup. For running in
production, we recommend setting these properties in the SOLR_OPTS variable defined in the include file.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 74 of 1195

Apache Solr Reference Guide 7.3

Keeping with our soft-commit example, in /etc/default/solr.in.sh, you would do:
SOLR_OPTS="$SOLR_OPTS -Dsolr.autoSoftCommit.maxTime=10000"

File Handles and Processes (ulimit settings)
Two common settings that result in errors on *nix systems are file handles and user processes.
It is common for the default limits for number of processes and file handles to default to values that are too
low for a large Solr installation. The required number of each of these will increase based on a combination
of the number of replicas hosted per node and the number of segments in the index for each replica.
The usual recommendation is to make processes and file handles at least 65,000 each, unlimited if possible.
On most *nix systems, this command will show the currently-defined limits:
ulimit -a
It is strongly recommended that file handle and process limits be permanently raised as above. The exact
form of the command will vary per operating system, and some systems require editing configuration files
and restarting your server. Consult your system administrators for guidance in your particular environment.



If these limits are exceeded, the problems reported by Solr vary depending on the specific
operation responsible for exceeding the limit. Errors such as "too many open files",
"connection error", and "max processes exceeded" have been reported, as well as
SolrCloud recovery failures.
Since exceeding these limits can result in such varied symptoms it is strongly recommended
that these limits be permanently raised as recommended above.

Running Multiple Solr Nodes per Host
The bin/solr script is capable of running multiple instances on one machine, but for a typical installation,
this is not a recommended setup. Extra CPU and memory resources are required for each additional
instance. A single instance is easily capable of handling multiple indexes.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 75 of 1195

When to ignore the recommendation
For every recommendation, there are exceptions. For the recommendation above, that
exception is mostly applicable when discussing extreme scalability. The best reason for
running multiple Solr nodes on one host is decreasing the need for extremely large heaps.



When the Java heap gets very large, it can result in extremely long garbage collection
pauses, even with the GC tuning that the startup script provides by default. The exact point
at which the heap is considered "very large" will vary depending on how Solr is used. This
means that there is no hard number that can be given as a threshold, but if your heap is
reaching the neighborhood of 16 to 32 gigabytes, it might be time to consider splitting
nodes. Ideally this would mean more machines, but budget constraints might make that
impossible.
There is another issue once the heap reaches 32GB. Below 32GB, Java is able to use
compressed pointers, but above that point, larger pointers are required, which uses more
memory and slows down the JVM.
Because of the potential garbage collection issues and the particular issues that happen at
32GB, if a single instance would require a 64GB heap, performance is likely to improve
greatly if the machine is set up with two nodes that each have a 31GB heap.

If your use case requires multiple instances, at a minimum you will need unique Solr home directories for
each node you want to run; ideally, each home should be on a different physical disk so that multiple Solr
nodes don’t have to compete with each other when accessing files on disk. Having different Solr home
directories implies that you’ll need a different include file for each node. Moreover, if using the
/etc/init.d/solr script to control Solr as a service, then you’ll need a separate script for each node. The
easiest approach is to use the service installation script to add multiple services on the same host, such as:
sudo bash ./install_solr_service.sh solr-7.3.0.tgz -s solr2 -p 8984
The command shown above will add a service named solr2 running on port 8984 using /var/solr2 for
writable (aka "live") files; the second server will still be owned and run by the solr user and will use the Solr
distribution files in /opt. After installing the solr2 service, verify it works correctly by doing:
sudo service solr2 restart
sudo service solr2 status

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 76 of 1195

Apache Solr Reference Guide 7.3

Making and Restoring Backups
If you are worried about data loss, and of course you should be, you need a way to back up your Solr indexes
so that you can recover quickly in case of catastrophic failure.
Solr provides two approaches to backing up and restoring Solr cores or collections, depending on how you
are running Solr. If you run in SolrCloud mode, you will use the Collections API. If you run Solr in standalone
mode, you will use the replication handler.

SolrCloud Backups
Support for backups when running SolrCloud is provided with the Collections API. This allows the backups to
be generated across multiple shards, and restored to the same number of shards and replicas as the
original collection.
Two commands are available:
• action=BACKUP: This command backs up Solr indexes and configurations. More information is available
in the section Backup Collection.
• action=RESTORE: This command restores Solr indexes and configurations. More information is available
in the section Restore Collection.

Standalone Mode Backups
Backups and restoration uses Solr’s replication handler. Out of the box, Solr includes implicit support for
replication so this API can be used. Configuration of the replication handler can, however, be customized by
defining your own replication handler in solrconfig.xml. For details on configuring the replication handler,
see the section Configuring the ReplicationHandler.

Backup API
The backup API requires sending a command to the /replication handler to back up the system.
You can trigger a back-up with an HTTP command like this (replace "gettingstarted" with the name of the
core you are working with):
Backup API Example
http://localhost:8983/solr/gettingstarted/replication?command=backup
The backup command is an asynchronous call, and it will represent data from the latest index commit point.
All indexing and search operations will continue to be executed against the index as usual.
Only one backup call can be made against a core at any point in time. While an ongoing backup operation is
happening subsequent calls for restoring will throw an exception.
The backup request can also take the following additional parameters:

location

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 77 of 1195

The path where the backup will be created. If the path is not absolute then the backup path will be
relative to Solr’s instance directory. |name |The snapshot will be created in a directory called
snapshot.. If a name is not specified then the directory name would have the following format:
snapshot..

numberToKeep
The number of backups to keep. If maxNumberOfBackups has been specified on the replication handler in
solrconfig.xml, maxNumberOfBackups is always used and attempts to use numberToKeep will cause an
error. Also, this parameter is not taken into consideration if the backup name is specified. More
information about maxNumberOfBackups can be found in the section Configuring the ReplicationHandler.

repository
The name of the repository to be used for the backup. If no repository is specified then the local
filesystem repository will be used automatically.

commitName
The name of the commit which was used while taking a snapshot using the CREATESNAPSHOT command.

Backup Status
The backup operation can be monitored to see if it has completed by sending the details command to the
/replication handler, as in this example:
Status API Example
http://localhost:8983/solr/gettingstarted/replication?command=details&wt=xml
Output Snippet

Sun Apr 12 16:22:50 DAVT 2015
10
success
Sun Apr 12 16:22:50 DAVT 2015
my_backup

If it failed then a snapShootException will be sent in the response.

Restore API
Restoring the backup requires sending the restore command to the /replication handler, followed by the
name of the backup to restore.
You can restore from a backup with a command like this:
Example Usage
http://localhost:8983/solr/gettingstarted/replication?command=restore&name=backup_name

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 78 of 1195

Apache Solr Reference Guide 7.3

This will restore the named index snapshot into the current core. Searches will start reflecting the snapshot
data once the restore is complete.
The restore request can take these additional parameters:

location
The location of the backup snapshot file. If not specified, it looks for backups in Solr’s data directory.

name
The name of the backed up index snapshot to be restored. If the name is not provided it looks for
backups with snapshot. format in the location directory. It picks the latest timestamp
backup in that case.

repository
The name of the repository to be used for the backup. If no repository is specified then the local
filesystem repository will be used automatically.
The restore command is an asynchronous call. Once the restore is complete the data reflected will be of the
backed up index which was restored.
Only one restore call can can be made against a core at one point in time. While an ongoing restore
operation is happening subsequent calls for restoring will throw an exception.

Restore Status API
You can also check the status of a restore operation by sending the restorestatus command to the
/replication handler, as in this example:
Status API Example
http://localhost:8983/solr/gettingstarted/replication?command=restorestatus&wt=xml
Status API Output


0
0


snapshot.
success


The status value can be "In Progress", "success" or "failed". If it failed then an "exception" will also be sent
in the response.

Create Snapshot API
The snapshot functionality is different from the backup functionality as the index files aren’t copied

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 79 of 1195

anywhere. The index files are snapshotted in the same index directory and can be referenced while taking
backups.
You can trigger a snapshot command with an HTTP command like this (replace "techproducts" with the
name of the core you are working with):
Create Snapshot API Example
http://localhost:8983/solr/admin/cores?action=CREATESNAPSHOT&core=techproducts&commitName=commit1
The CREATESNAPSHOT request parameters are:

commitName
The name to store the snapshot as.

core
The name of the core to perform the snapshot on.

async
Request ID to track this action which will be processed asynchronously.

List Snapshot API
The LISTSNAPSHOTS command lists all the taken snapshots for a particular core.
You can trigger a list snapshot command with an HTTP command like this (replace "techproducts" with the
name of the core you are working with):
List Snapshot API
http://localhost:8983/solr/admin/cores?action=LISTSNAPSHOTS&core=techproducts&commitName=commit1
The list snapshot request parameters are:

core
The name of the core to whose snapshots we want to list.

async
Request ID to track this action which will be processed asynchronously.

Delete Snapshot API
The DELETESNAPSHOT command deletes a snapshot for a particular core.
You can trigger a delete snapshot with an HTTP command like this (replace "techproducts" with the name of
the core you are working with):
Delete Snapshot API Example
http://localhost:8983/solr/admin/cores?action=DELETESNAPSHOT&core=techproducts&commitName=commit1

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 80 of 1195

Apache Solr Reference Guide 7.3

The delete snapshot request parameters are:

commitName
Specify the commit name to be deleted

core
The name of the core whose snapshot we want to delete

async
Request ID to track this action which will be processed asynchronously

Backup/Restore Storage Repositories
Solr provides interfaces to plug different storage systems for backing up and restoring. For example, you
can have a Solr cluster running on a local filesystem like EXT3 but you can backup the indexes to a HDFS
filesystem or vice versa.
The repository interfaces needs to be configured in the solr.xml file. While running backup/restore
commands we can specify the repository to be used.
If no repository is configured then the local filesystem repository will be used automatically.
Example solr.xml section to configure a repository like HDFS:


${solr.hdfs.default.backup.path}
${solr.hdfs.home:}
${solr.hdfs.confdir:}



Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 81 of 1195

Running Solr on HDFS
Solr has support for writing and reading its index and transaction log files to the HDFS distributed
filesystem.
This does not use Hadoop MapReduce to process Solr data, rather it only uses the HDFS filesystem for index
and transaction log file storage.
To use HDFS rather than a local filesystem, you must be using Hadoop 2.x and you will need to instruct Solr
to use the HdfsDirectoryFactory. There are also several additional parameters to define. These can be set
in one of three ways:
• Pass JVM arguments to the bin/solr script. These would need to be passed every time you start Solr
with bin/solr.
• Modify solr.in.sh (or solr.in.cmd on Windows) to pass the JVM arguments automatically when using
bin/solr without having to set them manually.
• Define the properties in solrconfig.xml. These configuration changes would need to be repeated for
every collection, so is a good option if you only want some of your collections stored in HDFS.

Starting Solr on HDFS
Standalone Solr Instances
For standalone Solr instances, there are a few parameters you should modify before starting Solr. These can
be set in solrconfig.xml (more on that below), or passed to the bin/solr script at startup.
• You need to use an HdfsDirectoryFactory and a data directory in the form hdfs://host:port/path
• You need to specify an updateLog location in the form hdfs://host:port/path
• You should specify a lock factory type of 'hdfs' or none.
If you do not modify solrconfig.xml, you can instead start Solr on HDFS with the following command:
bin/solr start -Dsolr.directoryFactory=HdfsDirectoryFactory
-Dsolr.lock.type=hdfs
-Dsolr.data.dir=hdfs://host:port/path
-Dsolr.updatelog=hdfs://host:port/path
This example will start Solr in standalone mode, using the defined JVM properties (explained in more detail
below).

SolrCloud Instances
In SolrCloud mode, it’s best to leave the data and update log directories as the defaults Solr comes with and
simply specify the solr.hdfs.home. All dynamically created collections will create the appropriate directories
automatically under the solr.hdfs.home root directory.
• Set solr.hdfs.home in the form hdfs://host:port/path

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 82 of 1195

Apache Solr Reference Guide 7.3

• You should specify a lock factory type of 'hdfs' or none.
bin/solr start -c -Dsolr.directoryFactory=HdfsDirectoryFactory
-Dsolr.lock.type=hdfs
-Dsolr.hdfs.home=hdfs://host:port/path
This command starts Solr in SolrCloud mode, using the defined JVM properties.

Modifying solr.in.sh (*nix) or solr.in.cmd (Windows)
The examples above assume you will pass JVM arguments as part of the start command every time you use
bin/solr to start Solr. However, bin/solr looks for an include file named solr.in.sh (solr.in.cmd on
Windows) to set environment variables. By default, this file is found in the bin directory, and you can modify
it to permanently add the HdfsDirectoryFactory settings and ensure they are used every time Solr is
started.
For example, to set JVM arguments to always use HDFS when running in SolrCloud mode (as shown above),
you would add a section such as this:
# Set HDFS DirectoryFactory & Settings
-Dsolr.directoryFactory=HdfsDirectoryFactory \
-Dsolr.lock.type=hdfs \
-Dsolr.hdfs.home=hdfs://host:port/path \

The Block Cache
For performance, the HdfsDirectoryFactory uses a Directory that will cache HDFS blocks. This caching
mechanism replaces the standard file system cache that Solr utilizes. By default, this cache is allocated offheap. This cache will often need to be quite large and you may need to raise the off-heap memory limit for
the specific JVM you are running Solr in. For the Oracle/OpenJDK JVMs, the following is an example
command-line parameter that you can use to raise the limit when starting Solr:
-XX:MaxDirectMemorySize=20g

HdfsDirectoryFactory Parameters
The HdfsDirectoryFactory has a number of settings defined as part of the directoryFactory
configuration.

Solr HDFS Settings
solr.hdfs.home
A root location in HDFS for Solr to write collection data to. Rather than specifying an HDFS location for the
data directory or update log directory, use this to specify one root location and have everything
automatically created within this HDFS location. The structure of this parameter is
hdfs://host:port/path/solr.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 83 of 1195

Block Cache Settings
solr.hdfs.blockcache.enabled
Enable the blockcache. The default is true.

solr.hdfs.blockcache.read.enabled
Enable the read cache. The default is true.

solr.hdfs.blockcache.direct.memory.allocation
Enable direct memory allocation. If this is false, heap is used. The default is true.
solr.hdfs.blockcache.slab.count
Number of memory slabs to allocate. Each slab is 128 MB in size. The default is 1.

solr.hdfs.blockcache.global
Enable/Disable using one global cache for all SolrCores. The settings used will be from the first
HdfsDirectoryFactory created. The default is true.

NRTCachingDirectory Settings
solr.hdfs.nrtcachingdirectory.enable
true | Enable the use of NRTCachingDirectory. The default is true.

solr.hdfs.nrtcachingdirectory.maxmergesizemb
NRTCachingDirectory max segment size for merges. The default is 16.

solr.hdfs.nrtcachingdirectory.maxcachedmb
NRTCachingDirectory max cache size. The default is 192.

HDFS Client Configuration Settings
solr.hdfs.confdir
Pass the location of HDFS client configuration files - needed for HDFS HA for example.

Kerberos Authentication Settings
Hadoop can be configured to use the Kerberos protocol to verify user identity when trying to access core
services like HDFS. If your HDFS directories are protected using Kerberos, then you need to configure Solr’s
HdfsDirectoryFactory to authenticate using Kerberos in order to read and write to HDFS. To enable Kerberos
authentication from Solr, you need to set the following parameters:

solr.hdfs.security.kerberos.enabled
Set to true to enable Kerberos authentication. The default is false.
solr.hdfs.security.kerberos.keytabfile
A keytab file contains pairs of Kerberos principals and encrypted keys which allows for password-less
authentication when Solr attempts to authenticate with secure Hadoop.
This file will need to be present on all Solr servers at the same path provided in this parameter.

solr.hdfs.security.kerberos.principal

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 84 of 1195

Apache Solr Reference Guide 7.3

The Kerberos principal that Solr should use to authenticate to secure Hadoop; the format of a typical
Kerberos V5 principal is: primary/instance@realm.

Example solrconfig.xml for HDFS
Here is a sample solrconfig.xml configuration for storing Solr indexes on HDFS:

hdfs://host:port/solr
true
1
true
16384
true
true
16
192

If using Kerberos, you will need to add the three Kerberos related properties to the 
element in solrconfig.xml, such as:

...
true
/etc/krb5.keytab
solr/admin@KERBEROS.COM


Automatically Add Replicas in SolrCloud
The ability to automatically add new replicas when the Overseer notices that a shard has gone down was
previously only available to users running Solr in HDFS, but it is now available to all users via Solr’s
autoscaling framework. See the section Auto Add Replicas Trigger for details on how to enable and disable
this feature.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 85 of 1195

The ability to enable or disable the autoAddReplicas feature with cluster properties has
been deprecated and will be removed in a future version. All users of this feature who have
previously used that approach are encouraged to change their configurations to use the
autoscaling framework to ensure continued operation of this feature in their Solr
installations.
For users using this feature with the deprecated configuration, you can temporarily disable
it cluster-wide by setting the cluster property autoAddReplicas to false, as in these
examples:
V1 API
http://localhost:8983/solr/admin/collections?action=CLUSTERPROP&name=autoAddRepli
cas&val=false



V2 API
curl -X POST -H 'Content-type: application/json' -d '{"set-property":
{"name":"autoAddReplicas", "val":false}}' http://localhost:8983/api/cluster
Re-enable the feature by unsetting the autoAddReplicas cluster property. When no val
parameter is provided, the cluster property is unset:
V1 API
http://localhost:8983/solr/admin/collections?action=CLUSTERPROP&name=autoAddRepli
cas
V2 API
curl -X POST -H 'Content-type: application/json' -d '{"set-property":
{"name":"autoAddReplicas"}}' http://localhost:8983/api/cluster

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 86 of 1195

Apache Solr Reference Guide 7.3

SolrCloud on AWS EC2
This guide is a tutorial on how to set up a multi-node SolrCloud cluster on Amazon Web Services (AWS) EC2
instances for early development and design.
This tutorial is not meant for production systems. For one, it uses Solr’s embedded ZooKeeper instance, and
for production you should have at least 3 ZooKeeper nodes in an ensemble. There are additional steps you
should take for a production installation; refer to Taking Solr to Production for how to deploy Solr in
production.
In this guide we are going to:
1. Launch multiple AWS EC2 instances
◦ Create new Security Group
◦ Configure instances and launch
2. Install, configure and start Solr on newly launched EC2 instances
◦ Install system prerequisites: Java 1.8 and later
◦ Download latest version of Solr
◦ Start the Solr nodes in cloud mode
3. Create a collection, index documents and query the system
◦ Create collection with multiple shards and replicas
◦ Index documents to the newly created collection
◦ Verify documents presence by querying the collection

Before You Start
To use this guide, you must have the following:
• An AWS account.
• Familiarity with setting up a single-node SolrCloud on local machine. Refer to the Solr Tutorial if you have
never used Solr before.

Launch EC2 instances
Create new Security Group
1. Navigate to the AWS EC2 console and to the region of your choice.
2. Configure an AWS security group which will limit access to the installation and allow our launched EC2
instances to talk to each other without restrictions.
a. From the EC2 Dashboard, click [ Security Groups ] from the left-hand menu, under "Network &
Security".
b. Click [ Create Security Group ] under the Security Groups section. Give your security group a
descriptive name.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 87 of 1195

c. You can select one of the existing VPCs or create a new one.
d. We need two ports open for our cloud here:
i. Solr port. In this example we will use Solr’s default port 8983.
ii. ZooKeeper Port: We’ll use Solr’s embedded ZooKeeper, so we’ll use the default port 9983 (see the
Deploying with External ZooKeeper to configure external ZooKeeper).
e. Click [ Inbound ] to set inbound network rules, then select [ Add Rule ]. Select "Custom TCP" as the
type. Enter 8983 for the "Port Range" and choose "My IP for the Source, then enter your public IP.
Create a second rule with the same type and source, but enter 9983 for the port.
This will limit access to your current machine. If you want wider access to the instance in order to
collaborate with others, you can specify that, but make sure you only allow as much access as
needed. A Solr instance should not be exposed to general Internet traffic.
f. Add another rule for SSH access. Choose "SSH" as the type, and again "My IP" for the source and
again enter your public IP. You need SSH access on all instances to install and configure Solr.
g. Review the details, your group configuration should look like this:

h. Click [ Create ] when finished.
i. We need to modify the rules so that instances that are part of the group can talk to all other instances
that are part of the same group. We could not do this while creating the group, so we need to edit the
group after creating it to add this.
i. Select the newly created group in the Security Group overview table. Under the "Inbound" tab,
click [ Edit ].
ii. Click [ Add rule ]. Choose All TCP from the pulldown list for the type, and enter 0-65535 for the
port range. Specify the name of the current Security Group as the solr-sample.
j. Review the details, your group configuration should now look like this:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 88 of 1195

Apache Solr Reference Guide 7.3

k. Click [ Save ] when finished.

Configure Instances and Launch
Once the security group is in place, you can choose [ Instances ] from the left-hand navigation menu.
Under Instances, click [ Launch Instance ] button and follow the wizard steps:
1. Choose your Amazon Machine Image (AMI): Choose Amazon Linux AMI, SSD Volume Type as the AMI.
There are both commercial AMIs and Community based AMIs available, e.g., Amazon Linux AMI (HVM),
SSD Volume Type, but this is a nice AMI to use for our purposes. Click [ Select ] next to the image you
choose.
2. The next screen asks you to choose the instance type, t2.medium is sufficient. Choose it from the list,
then click [ Configure Instance Details ].
3. Configure the instance. Enter 2 in the "Number of instances" field. Make sure the setting for "Autoassign Public IP" is "Enabled".
4. When finished, click [ Add Storage ]. The default of 8 GB for size and General Purpose SSD for the
volume type is sufficient for running this quick start. Optionally select "Delete on termination" if you
know you won’t need the data stored in Solr indexes after you terminate the instances.
5. When finished, click [ Add Tags ]. You do not have to add any tags for this quick start, but you can add
them if you want.
6. Click [ Configure Security Group ]. Choose Select an existing security group and select the security
group you created earlier: solr-sample. You should see the expected inbound rules at the bottom of the
page.
7. Click [ Review ].
8. If everything looks correct, click [ Launch ].
9. Select an existing “private key file” or create a new one and download to your local machine so you will
be able to login into the instances via SSH.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 89 of 1195

10. On the instances list, you can watch the states change. You cannot use the instances until they become
“running”.

Install, Configure and Start
1. Locate the Public DNS record for the instance by selecting the instance from the list of instances, and log
on to each machine one by one.
Using SSH, if your AWS identity key file is aws-key.pem and the AMI uses ec2-user as login user, on each
AWS instance, do the following:
$ ssh-add aws-key.pem
$ ssh -A ec2-user@
2. While logged in to each of the AWS EC2 instances, configure Java 1.8 and download Solr:
#
$
$
$
#
#
$

verify default java version packaged with AWS instances is 1.7
java -version
sudo yum install java-1.8.0
sudo /usr/sbin/alternatives --config java
select jdk-1.8
verify default java version to java-1.8
java -version

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 90 of 1195

Apache Solr Reference Guide 7.3

# download desired version of Solr
$ wget http://archive.apache.org/dist/lucene/solr/7.3.0/solr-7.3.0.tgz
# untar
$ tar -zxvf solr-7.3.0.tgz
# set SOLR_HOME
$ export SOLR_HOME=$PWD/solr-7.3.0
# put the env variable in .bashrc
# vim ~/.bashrc
export SOLR_HOME=/home/ec2-user/solr-7.3.0
3. Resolve the Public DNS to simpler hostnames.
Let’s assume AWS instances public DNS with IPv4 Public IP are as follows:
◦ ec2-54-1-2-3.us-east-2.compute.amazonaws.com: 54.1.2.3
◦ ec2-54-4-5-6.us-east-2.compute.amazonaws.com: 54.4.5.6
Edit /etc/hosts, and add entries for the above machines:
$ sudo vim /etc/hosts
54.1.2.3 solr-node-1
54.4.5.6 solr-node-2
4. Configure Solr in running EC2 instances.
In this case, one of the machines will host ZooKeeper embedded along with Solr node, say, ec2-101-1-23.us-east-2.compute.amazonaws.com (aka, solr-node-1)
See Deploying with External ZooKeeper for configure external ZooKeeper.
Inside the ec2-101-1-2-3.us-east-2.compute.amazonaws.com (solr-node-1)
$ cd $SOLR_HOME
# start Solr node on 8983 and ZooKeeper will start on 8983+1000 9983
$ bin/solr start -c -p 8983 -h solr-node-1
On the other node, ec2-101-4-5-6.us-east-2.compute.amazonaws.com (solr-node-2)
$ cd $SOLR_HOME
# start Solr node on 8983 and connect to ZooKeeper running on first node
$ bin/solr start -c -p 8983 -h solr-node-2 -z solr-node-1:9983
5. Inspect and Verify. Inspect the Solr nodes state from browser on local machine:
Go to:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 91 of 1195

http://ec2-101-1-2-3.us-east-2.compute.amazonaws.com:8983/solr (solr-node-1:8983/solr)
http://ec2-101-4-5-6.us-east-2.compute.amazonaws.com:8983/solr (solr-node-2:8983/solr)
You should able to see Solr UI dashboard for both nodes.

Create Collection, Index and Query
You can refer Solr Tutorial for an extensive walkthrough on creating collections with multiple shards and
replicas, indexing data via different methods and querying documents accordingly.

Deploying with External ZooKeeper
If you want to configure an external ZooKeeper ensemble to avoid using the embedded single-instance
ZooKeeper that runs in the same JVM as the Solr node, you need to make few tweaks in the above listed
steps as follows.
• When creating the security group, instead of opening port 9983 for ZooKeeper, you’ll open 2181 (or
whatever port you are using for ZooKeeper: it’s default is 2181).
• When configuring the number of instances to launch, choose to open 3 instances instead of 2.
• When modifying the /etc/hosts on each machine, add a third line for the 3rd instance and give it a
recognizable name:
$ sudo vim /etc/hosts
54.1.2.3 solr-node-1
54.4.5.6 solr-node-2
54.7.8.9 zookeeper-node
• You’ll need to install ZooKeeper manually, described in the next section.

Install ZooKeeper
These steps will help you install and configure a single instance of ZooKeeper on AWS. This is not sufficient
for a production, use, however, where a ZooKeeper ensemble of at least three nodes is recommended. See
the section Setting Up an External ZooKeeper Ensemble for information about how to change this singleinstance into an ensemble.
1. Download a stable version of ZooKeeper. In this example we’re using ZooKeeper v3.4.6. On the node
you’re using to host ZooKeeper (zookeeper-node), download the package and untar it:
#
$
#
$

download stable version of ZooKeeper, here 3.4.6
wget https://archive.apache.org/dist/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz
untar
tar -zxvf zookeeper-3.4.6.tar.gz

Add an environment variable for ZooKeeper’s home directory (ZOO_HOME) to the .bashrc for the user

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 92 of 1195

Apache Solr Reference Guide 7.3

who will be running the process. The rest of the instructions assume you have set this variable. Correct
the path to the ZooKeeper installation as appropriate if where you put it does not match the below.
$ export ZOO_HOME=$PWD/zookeeper-3.4.6
# put the env variable in .bashrc
# vim ~/.bashrc
export ZOO_HOME=/home/ec2-user/zookeeper-3.4.6
2. Change directories to ZOO_HOME, and create the ZooKeeper configuration by using the template provided
by ZooKeeper.
$ cd $ZOO_HOME
# create ZooKeeper config by using zoo_sample.cfg
$ cp conf/zoo_sample.cfg conf/zoo.cfg
3. Create the ZooKeeper data directory in the filesystem, and edit the zoo.cfg file to uncomment the
autopurge parameters and define the location of the data directory.
# create data dir for ZooKeeper, edit zoo.cfg, uncomment autopurge parameters
$ mkdir data
$ vim conf/zoo.cfg
# -- uncomment -autopurge.snapRetainCount=3
autopurge.purgeInterval=1
# -- edit -dataDir=data
4. Start ZooKeeper.
$ cd $ZOO_HOME
# start ZooKeeper, default port: 2181
$ bin/zkServer.sh start
5. On the the first node being used for Solr (solr-node-1), start Solr and tell it where to find ZooKeeper.
$ cd $SOLR_HOME
# start Solr node on 8983 and connect to ZooKeeper running on ZooKeeper node
$ bin/solr start -c -p 8983 -h solr-node-1 -z zookeeper-node:2181
6. On the second Solr node (solr-node-2), again start Solr and tell it where to find ZooKeeper.
$ cd $SOLR_HOME
# start Solr node on 8983 and connect to ZooKeeper running on ZooKeeper node
$ bin/solr start -c -p 8983 -h solr-node-1 -z zookeeper-node:2181

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3



Page 93 of 1195

As noted earlier, a single ZooKeeper node is not sufficient for a production installation. See
these additional resources for more information about deploying Solr in production, which
can be used once you have the EC2 instances up and running:
• Taking Solr to Production
• Setting Up an External ZooKeeper Ensemble

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 94 of 1195

Apache Solr Reference Guide 7.3

Upgrading a Solr Cluster
This page covers how to upgrade an existing Solr cluster that was installed using the service installation
scripts.



The steps outlined on this page assume you use the default service name of solr. If you
use an alternate service name or Solr installation directory, some of the paths and
commands mentioned below will have to be modified accordingly.

Planning Your Upgrade
Here is a checklist of things you need to prepare before starting the upgrade process:
1. Examine the Solr Upgrade Notes to determine if any behavior changes in the new version of Solr will
affect your installation.
2. If not using replication (i.e., collections with replicationFactor less than 1), then you should make a
backup of each collection. If all of your collections use replication, then you don’t technically need to
make a backup since you will be upgrading and verifying each node individually.
3. Determine which Solr node is currently hosting the Overseer leader process in SolrCloud, as you should
upgrade this node last. To determine the Overseer, use the Overseer Status API, see: Collections API.
4. Plan to perform your upgrade during a system maintenance window if possible. You’ll be doing a rolling
restart of your cluster (each node, one-by-one), but we still recommend doing the upgrade when system
usage is minimal.
5. Verify the cluster is currently healthy and all replicas are active, as you should not perform an upgrade
on a degraded cluster.
6. Re-build and test all custom server-side components against the new Solr JAR files.
7. Determine the values of the following variables that are used by the Solr Control Scripts:
◦ ZK_HOST: The ZooKeeper connection string your current SolrCloud nodes use to connect to
ZooKeeper; this value will be the same for all nodes in the cluster.
◦ SOLR_HOST: The hostname each Solr node used to register with ZooKeeper when joining the
SolrCloud cluster; this value will be used to set the host Java system property when starting the new
Solr process.
◦ SOLR_PORT: The port each Solr node is listening on, such as 8983.
◦ SOLR_HOME: The absolute path to the Solr home directory for each Solr node; this directory must
contain a solr.xml file. This value will be passed to the new Solr process using the solr.solr.home
system property, see: Solr Cores and solr.xml.
If you are upgrading from an installation of Solr 5.x or later, these values can typically be found in
either /var/solr/solr.in.sh or /etc/default/solr.in.sh.
You should now be ready to upgrade your cluster. Please verify this process in a test or staging cluster
before doing it in production.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 95 of 1195

Upgrade Process
The approach we recommend is to perform the upgrade of each Solr node, one-by-one. In other words, you
will need to stop a node, upgrade it to the new version of Solr, and restart it before moving on to the next
node. This means that for a short period of time, there will be a mix of "Old Solr" and "New Solr" nodes
running in your cluster. We also assume that you will point the new Solr node to your existing Solr home
directory where the Lucene index files are managed for each collection on the node. This means that you
won’t need to move any index files around to perform the upgrade.

Step 1: Stop Solr
Begin by stopping the Solr node you want to upgrade. After stopping the node, if using a replication (i.e.,
collections with replicationFactor less than 1), verify that all leaders hosted on the downed node have
successfully migrated to other replicas; you can do this by visiting the Cloud panel in the Solr Admin UI. If
not using replication, then any collections with shards hosted on the downed node will be temporarily offline.

Step 2: Install Solr as a Service
Please follow the instructions to install Solr as a Service on Linux documented at Taking Solr to Production.
Use the -n parameter to avoid automatic start of Solr by the installer script. You need to update the
/etc/default/solr.in.sh include file in the next step to complete the upgrade process.



If you have a /var/solr/solr.in.sh file for your existing Solr install, running the
install_solr_service.sh script will move this file to its new location:
/etc/default/solr.in.sh (see SOLR-8101 for more details)

Step 3: Set Environment Variable Overrides
Open /etc/default/solr.in.sh with a text editor and verify that the following variables are set correctly,
or add them bottom of the include file as needed:
ZK_HOST=
SOLR_HOST=
SOLR_PORT=
SOLR_HOME=
Make sure the user you plan to own the Solr process is the owner of the SOLR_HOME directory. For instance, if
you plan to run Solr as the "solr" user and SOLR_HOME is /var/solr/data, then you would do: sudo chown -R

solr: /var/solr/data

Step 4: Start Solr
You are now ready to start the upgraded Solr node by doing: sudo service solr start. The upgraded
instance will join the existing cluster because you’re using the same SOLR_HOME, SOLR_PORT, and SOLR_HOST
settings used by the old Solr node; thus, the new server will look like the old node to the running cluster. Be
sure to look in /var/solr/logs/solr.log for errors during startup.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 96 of 1195

Apache Solr Reference Guide 7.3

Step 5: Run Healthcheck
You should run the Solr healthcheck command for all collections that are hosted on the upgraded node
before proceeding to upgrade the next node in your cluster. For instance, if the newly upgraded node hosts
a replica for the MyDocuments collection, then you can run the following command (replace ZK_HOST with
the ZooKeeper connection string):
/opt/solr/bin/solr healthcheck -c MyDocuments -z ZK_HOST
Look for any problems reported about any of the replicas for the collection.
Lastly, repeat Steps 1-5 for all nodes in your cluster.

IndexUpgrader Tool
The Lucene distribution includes a tool that upgrades an index from previous Lucene versions to the current
file format.
The tool can be used from command line, or it can be instantiated and executed in Java.
Indexes can only be upgraded from the previous major release version to the current
major release version.



This means that the IndexUpgrader Tool in any Solr 7.x release, for example, can only work
with indexes from 6.x releases, but cannot work with indexes from Solr 5.x or earlier.
If you are currently using an earlier release such as 5.x and want to move more than one
major version ahead, you need to first upgrade your indexes to the next major version
(6.x), then again to the major version after that (7.x), etc.

In a Solr distribution, the Lucene files are located in ./server/solr-webapp/webapp/WEB-INF/lib. You will
need to include the lucene-core-.jar and lucene-backwards-codecs-.jar on the
classpath when running the tool.
java -cp lucene-core-7.3.0.jar:lucene-backward-codecs-7.3.0.jar
org.apache.lucene.index.IndexUpgrader [-delete-prior-commits] [-verbose] /path/to/index
This tool keeps only the last commit in an index. For this reason, if the incoming index has more than one
commit, the tool refuses to run by default. Specify -delete-prior-commits to override this, allowing the tool
to delete all but the last commit.
Upgrading large indexes may take a long time. As a rule of thumb, the upgrade processes about 1 GB per
minute.



This tool may reorder documents if the index was partially upgraded before execution (e.g.,
documents were added). If your application relies on monotonicity of document IDs (i.e.,
the order in which the documents were added to the index is preserved), do a full optimize
instead.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 97 of 1195

Solr Upgrade Notes
The following notes describe changes to Solr in recent releases that you should be aware of before
upgrading.
These notes highlight the biggest changes that may impact the largest number of implementations. It is not
a comprehensive list of all changes to Solr in any release.
When planning your Solr upgrade, consider the customizations you have made to your system and review
the CHANGES.txt file found in your Solr package. That file includes all the changes and updates that may
effect your existing implementation.
Detailed steps for upgrading a Solr cluster are in the section Upgrading a Solr Cluster.

Upgrading to 7.x Releases
Solr 7.3
See the 7.3 Release Notes for an overview of the main new features in Solr 7.3.
When upgrading to Solr 7.3, users should be aware of the following major changes from v7.2:
• Collections created without specifying a configset name have used a copy of the _default configset
since Solr 7.0. Before 7.3, the copied configset was named the same as the collection name, but from 7.3
onwards it will be named with a new ".AUTOCREATED" suffix. This is to prevent overwriting custom
configset names.
• The rq parameter used with Learning to Rank rerank query parsing no longer considers the defType
parameter. See Running a Rerank Query for more information about this parameter.
• The default value of autoReplicaFailoverWaitAfterExpiration, used with the AutoAddReplicas
feature, has increased to 120 seconds from the previous default of 30 seconds. This affects how soon Solr
adds new replicas to replace the replicas on nodes which have either crashed or shutdown.
• The default Solr log file size and number of backups have been raised to 32MB and 10 respectively. See
the section Configuring Logging for more information about how to configure logging.
• The old Leader-In-Recovery implementation (implemented in Solr 4.9) is now deprecated and replaced.
Solr will support rolling upgrades from old 7.x versions of Solr to future 7.x releases until the last release
of the 7.x major version.
This means to upgrade to Solr 8 in the future, you will need to be on Solr 7.3 or higher.
• Replicas which are not up-to-date are no longer allowed to become leader. Use the FORCELEADER
command of the Collections API to allow these replicas become leader.
• The behaviour of the autoscaling system will now pause all triggers from execution between the start of
actions and the end of a cool down period. The triggers will resume after the cool down period expires.
Previously, the cool down period was a fixed period started after actions for a trigger event completed
and during this time all triggers continued to run but any events were rejected and tried later.
• The throttling mechanism used to limit the rate of autoscaling events processed has been removed. This
deprecates the actionThrottlePeriodSeconds setting in the set-properties Autoscaling API which is

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 98 of 1195

Apache Solr Reference Guide 7.3

now non-operational. Use the triggerCooldownPeriodSeconds parameter instead to pause event
processing.
• If you are using the spatial JTS library with Solr, you must upgrade to 1.15.0. This new version of JTS is
now dual-licensed to include a BSD style license. See the section on Spatial Search for more information.
• The top-level  element in solrconfig.xml is now officially deprecated in favour of the
equivalent  syntax. This element has been out of use in default Solr installations for
several releases already.

Solr 7.2
See the 7.2 Release Notes for an overview of the main new features in Solr 7.2.
When upgrading to Solr 7.2, users should be aware of the following major changes from v7.1:
• Starting a query string with local parameters {!myparser …} is used to switch from one query parser to
another, and is intended for use by Solr system developers, not end users doing searches. To reduce
negative side-effects of unintended hack-ability, Solr now limits the cases when local parameters will be
parsed to only contexts in which the default parser is "lucene" or "func".
So, if defType=edismax then q={!myparser …} won’t work. In that example, put the desired query parser
into the defType parameter.
Another example is if deftype=edismax then hl.q={!myparser …} won’t work for the same reason. In
this example, either put the desired query parser into the hl.qparser parameter or set
hl.qparser=lucene. Most users won’t run into these cases but some will need to change.
If you must have full backwards compatibility, use luceneMatchVersion=7.1.0 or an earlier version.
• The eDisMax parser by default no longer allows subqueries that specify a Solr parser using either local
parameters, or the older _query_ magic field trick.
For example, {!prefix f=myfield v=enterp} or _query_:"{!prefix f=myfield v=enterp}" are not
supported by default any longer. If you want to allow power-users to do this, set uf=* query or some
other value that includes _query_.
If you need full backwards compatibility for the time being, use luceneMatchVersion=7.1.0 or
something earlier.

Solr 7.1
See the 7.1 Release Notes for an overview of the main new features of Solr 7.1.
When upgrading to Solr 7.1, users should be aware of the following major changes from v7.0:
• The feature to automatically add replicas if a replica goes down, previously available only when storing
indexes in HDFS, has been ported to the autoscaling framework. Due to this, autoAddReplicas is now
available to all users even if their indexes are on local disks.
Existing users of this feature should not have to change anything. However, they should note these
changes:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 99 of 1195

◦ Behavior: Changing the autoAddReplicas property from disabled (false) to enabled (true) using
MODIFYCOLLECTION API no longer replaces down replicas for the collection immediately. Instead,
replicas are only added if a node containing them went down while autoAddReplicas was enabled.
The parameters autoReplicaFailoverBadNodeExpiration and autoReplicaFailoverWorkLoopDelay
are no longer used.
◦ Deprecations: Enabling/disabling autoAddReplicas cluster-wide with the API will be deprecated; use
suspend/resume trigger APIs with name=".auto_add_replicas" instead.
More information about the changes to this feature can be found in the section SolrCloud
Automatically Adding Replicas.
• Shard and cluster metric reporter configuration now require a class attribute.
◦ If a reporter configures the group="shard" attribute then please also configure the
class="org.apache.solr.metrics.reporters.solr.SolrShardReporter" attribute.
◦ If a reporter configures the group="cluster" attribute then please also configure the
class="org.apache.solr.metrics.reporters.solr.SolrClusterReporter" attribute.
See the section Shard and Cluster Reporters for more information.
• All Stream Evaluators in solrj.io.eval have been refactored to have a simpler and more robust
structure. This simplifies and condenses the code required to implement a new Evaluator and makes it
much easier for evaluators to handle differing data types (primitives, objects, arrays, lists, and so forth).
• In the ReplicationHandler, the master.commitReserveDuration sub-element is deprecated. Instead
please configure a direct commitReserveDuration element for use in all modes (master, slave, cloud).
• The RunExecutableListener was removed for security reasons. If you want to listen to events caused by
updates, commits, or optimize, write your own listener as native Java class as part of a Solr plugin.
• In the XML query parser (defType=xmlparser or {!xmlparser … }) the resolving of external entities is
now disallowed by default.

Upgrading to 7.x from Any 6.x Release
The upgrade from Solr 6.x to Solr 7.0 introduces several major changes that you should be aware of before
upgrading. Please do a thorough review of the section Major Changes in Solr 7 before starting your
upgrade.

Upgrading to 7.x from pre-6.x Versions of Solr
Users upgrading from versions of Solr prior to 6.x are strongly encouraged to consult CHANGES.txt for the
details of all changes since the version they are upgrading from.
A summary of the significant changes between Solr 5.x and Solr 6.0 is in the section Major Changes from Solr
5 to Solr 6.

Major Changes in Solr 7
Solr 7 is a major new release of Solr which introduces new features and a number of other changes that may
impact your existing installation.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 100 of 1195

Apache Solr Reference Guide 7.3

Upgrade Planning
There are major changes in Solr 7 to consider before starting to migrate your configurations and indexes.
This page is designed to highlight the biggest changes - new features you may want to be aware of, but also
changes in default behavior and deprecated features that have been removed.
There are many hundreds of changes in Solr 7, however, so a thorough review of the Solr Upgrade Notes as
well as the CHANGES.txt file in your Solr instance will help you plan your migration to Solr 7. This section
attempts to highlight some of the major changes you should be aware of.
You should also consider all changes that have been made to Solr in any version you have not upgraded to
already. For example, if you are currently using Solr 6.2, you should review changes made in all subsequent
6.x releases in addition to changes for 7.0.
Re-indexing your data is considered the best practice and you should try to do so if possible. However, if reindexing is not feasible, keep in mind you can only upgrade one major version at a time. Thus, Solr 6.x
indexes will be compatible with Solr 7 but Solr 5.x indexes will not be.
If you do not re-index now, keep in mind that you will need to either re-index your data or upgrade your
indexes before you will be able to move to Solr 8 when it is released in the future. See the section
IndexUpgrader Tool for more details on how to upgrade your indexes.
See also the section Upgrading a Solr Cluster for details on how to upgrade a SolrCloud cluster.

New Features & Enhancements
Replication Modes
Until Solr 7, the SolrCloud model for replicas has been to allow any replica to become a leader when a leader
is lost. This is highly effective for most users, providing reliable failover in case of issues in the cluster.
However, it comes at a cost in large clusters because all replicas must be in sync at all times.
To provide additional flexibility, two new types of replicas have been added, named TLOG & PULL. These new
types provide options to have replicas which only sync with the leader by copying index segments from the
leader. The TLOG type has an additional benefit of maintaining a transaction log (the "tlog" of its name),
which would allow it to recover and become a leader if necessary; the PULL type does not maintain a
transaction log, so cannot become a leader.
As part of this change, the traditional type of replica is now named NRT. If you do not explicitly define a
number of TLOG or PULL replicas, Solr defaults to creating NRT replicas. If this model is working for you, you
will not have to change anything.
See the section Types of Replicas for more details on the new replica modes, and how define the replica type
in your cluster.
Autoscaling
Solr autoscaling is a new suite of features in Solr to make managing a SolrCloud cluster easier and more
automated.
At its core, Solr autoscaling provides users with a rule syntax to define preferences and policies for how to
distribute nodes and shards in a cluster, with the goal of maintaining a balance in the cluster. As of Solr 7,

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 101 of 1195

Solr will take any policy or preference rules into account when determining where to place new shards and
replicas created or moved with various Collections API commands.
See the section SolrCloud Autoscaling for details on the options available in 7.0. Expect more features to be
released in subsequent 7.x releases in this area.
Other Features & Enhancements
• The Analytics Component has been refactored.
◦ The documentation for this component is in progress; until it is available, please refer to SOLR-11144
for more details.
• There were several other new features released in earlier 6.x releases, which you may have missed:
◦ Learning to Rank
◦ Unified Highlighter
◦ Metrics API. See also information about related deprecations in the section JMX Support and MBeans
below.
◦ Payload queries
◦ Streaming Evaluators
◦ /v2 API
◦ Graph streaming expressions

Configuration and Default Changes
New Default ConfigSet
Several changes have been made to configSets that ship with Solr; not only their content but how Solr
behaves in regard to them:
• The data_driven_configset and basic_configset have been removed, and replaced by the _default
configset. The sample_techproducts_configset also remains, and is designed for use with the example
documents shipped with Solr in the example/exampledocs directory.
• When creating a new collection, if you do not specify a configSet, the _default will be used.
◦ If you use SolrCloud, the _default configSet will be automatically uploaded to ZooKeeper.
◦ If you use standalone mode, the instanceDir will be created automatically, using the _default
configSet as it’s basis.
Schemaless Improvements
To improve the functionality of Schemaless Mode, Solr now behaves differently when it detects that data in
an incoming field should have a text-based field type.
• Incoming fields will be indexed as text_general by default (you can change this). The name of the field
will be the same as the field name defined in the document.
• A copy field rule will be inserted into your schema to copy the new text_general field to a new field with
the name _str. This field’s type will be a strings field (to allow for multiple values). The first 256
characters of the text field will be inserted to the new strings field.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 102 of 1195

Apache Solr Reference Guide 7.3

This behavior can be customized if you wish to remove the copy field rule, or to change the number of
characters inserted to the string field, or the field type used. See the section Schemaless Mode for details.



Because copy field rules can slow indexing and increase index size, it’s recommended you
only use copy fields when you need to. If you do not need to sort or facet on a field, you
should remove the automatically-generated copy field rule.

Automatic field creation can be disabled with the update.autoCreateFields property. To do this, you can
use the Config API with a command such as:
V1 API
curl http://host:8983/solr/mycollection/config -d '{"set-user-property":
{"update.autoCreateFields":"false"}}'

V2 API
curl http://host:8983/api/collections/mycollection/config -d '{"set-user-property":
{"update.autoCreateFields":"false"}}'

Changes to Default Behaviors
• JSON is now the default response format. If you rely on XML responses, you must now define wt=xml in
your request. In addition, line indentation is enabled by default (indent=on).
• The sow parameter (short for "Split on Whitespace") now defaults to false, which allows support for
multi-word synonyms out of the box. This parameter is used with the eDismax and standard/"lucene"
query parsers. If this parameter is not explicitly specified as true, query text will not be split on
whitespace before analysis.
• The legacyCloud parameter now defaults to false. If an entry for a replica does not exist in state.json,
that replica will not get registered.
This may affect users who bring up replicas and they are automatically registered as a part of a shard. It
is possible to fall back to the old behavior by setting the property legacyCloud=true, in the cluster
properties using the following command:

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:2181 -cmd clusterprop -name
legacyCloud -val true
• The eDismax query parser parameter lowercaseOperators now defaults to false if the
luceneMatchVersion in solrconfig.xml is 7.0.0 or above. Behavior for luceneMatchVersion lower than
7.0.0 is unchanged (so, true). This means that clients must sent boolean operators (such as AND, OR and
NOT) in upper case in order to be recognized, or you must explicitly set this parameter to true.
• The handleSelect parameter in solrconfig.xml now defaults to false if the luceneMatchVersion is
7.0.0 or above. This causes Solr to ignore the qt parameter if it is present in a request. If you have
request handlers without a leading '/', you can set handleSelect="true" or consider migrating your

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 103 of 1195

configuration.
The qt parameter is still used as a SolrJ special parameter that specifies the request handler (tail URL
path) to use.
• The lucenePlusSort query parser (aka the "Old Lucene Query Parser") has been deprecated and is no
longer implicitly defined. If you wish to continue using this parser until Solr 8 (when it will be removed),
you must register it in your solrconfig.xml, as in: .
• The name of TemplateUpdateRequestProcessorFactory is changed to template from Template and the
name of AtomicUpdateProcessorFactory is changed to atomic from Atomic
◦ Also, TemplateUpdateRequestProcessorFactory now uses {} instead of ${} for template.

Deprecations and Removed Features
Point Fields Are Default Numeric Types
Solr has implemented *PointField types across the board, to replace Trie* based numeric fields. All Trie*
fields are now considered deprecated, and will be removed in Solr 8.
If you are using Trie* fields in your schema, you should consider moving to PointFields as soon as feasible.
Changing to the new PointField types will require you to re-index your data.
Spatial Fields
The following spatial-related fields have been deprecated:
• LatLonType
• GeoHashField
• SpatialVectorFieldType
• SpatialTermQueryPrefixTreeFieldType
Choose one of these field types instead:
• LatLonPointSpatialField
• SpatialRecursivePrefixTreeField
• RptWithGeometrySpatialField
See the section Spatial Search for more information.
JMX Support and MBeans
• The  element in solrconfig.xml has been removed in favor of  elements
defined in solr.xml.
Limited back-compatibility is offered by automatically adding a default instance of SolrJmxReporter if
it’s missing AND when a local MBean server is found. A local MBean server can be activated either via
ENABLE_REMOTE_JMX_OPTS in solr.in.sh or via system properties, e.g.,
-Dcom.sun.management.jmxremote. This default instance exports all Solr metrics from all registries as
hierarchical MBeans.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 104 of 1195

Apache Solr Reference Guide 7.3

This behavior can be also disabled by specifying a SolrJmxReporter configuration with a boolean init
argument enabled set to false. For a more fine-grained control users should explicitly specify at least
one SolrJmxReporter configuration.
See also the section The  Element, which describes how to set up Metrics Reporters
in solr.xml. Note that back-compatibility support may be removed in Solr 8.
• MBean names and attributes now follow the hierarchical names used in metrics. This is reflected also in
/admin/mbeans and /admin/plugins output, and can be observed in the UI Plugins tab, because now all
these APIs get their data from the metrics API. The old (mostly flat) JMX view has been removed.
SolrJ
The following changes were made in SolrJ.
• HttpClientInterceptorPlugin is now HttpClientBuilderPlugin and must work with a
SolrHttpClientBuilder rather than an HttpClientConfigurer.
• HttpClientUtil now allows configuring HttpClient instances via SolrHttpClientBuilder rather than
an HttpClientConfigurer. Use of env variable SOLR_AUTHENTICATION_CLIENT_CONFIGURER no longer
works, please use SOLR_AUTHENTICATION_CLIENT_BUILDER
• SolrClient implementations now use their own internal configuration for socket timeouts, connect
timeouts, and allowing redirects rather than what is set as the default when building the HttpClient
instance. Use the appropriate setters on the SolrClient instance.
• HttpSolrClient#setAllowCompression has been removed and compression must be enabled as a
constructor param.
• HttpSolrClient#setDefaultMaxConnectionsPerHost and HttpSolrClient#setMaxTotalConnections
have been removed. These now default very high and can only be changed via parameter when creating
an HttpClient instance.
Other Deprecations and Removals
• The defaultOperator parameter in the schema is no longer supported. Use the q.op parameter instead.
This option had been deprecated for several releases. See the section Standard Query Parser Parameters
for more information.
• The defaultSearchField parameter in the schema is no longer supported. Use the df parameter
instead. This option had been deprecated for several releases. See the section Standard Query Parser
Parameters for more information.
• The mergePolicy, mergeFactor and maxMergeDocs parameters have been removed and are no longer
supported. You should define a mergePolicyFactory instead. See the section the mergePolicyFactory for
more information.
• The PostingsSolrHighlighter has been deprecated. It’s recommended that you move to using the
UnifiedHighlighter instead. See the section Unified Highlighter for more information about this
highlighter.
• Index-time boosts have been removed from Lucene, and are no longer available from Solr. If any boosts
are provided, they will be ignored by the indexing chain. As a replacement, index-time scoring factors
should be indexed in a separate field and combined with the query score using a function query. See the
section Function Queries for more information.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 105 of 1195

• The StandardRequestHandler is deprecated. Use SearchHandler instead.
• To improve parameter consistency in the Collections API, the parameter names fromNode for the
MOVEREPLICA command and source, target for the REPLACENODE command have been deprecated
and replaced with sourceNode and targetNode instead. The old names will continue to work for backcompatibility but they will be removed in Solr 8.
• The unused valType option has been removed from ExternalFileField, if you have this in your schema
you can safely remove it.

Major Changes in Earlier 6.x Versions
The following summary of changes in earlier 6.x releases highlights significant changes released between
Solr 6.0 and 6.6 that were listed in earlier versions of this Guide. Mentions of deprecations are likely
superseded by removal in Solr 7, as noted in the above sections.
Note again that this is not a complete list of all changes that may impact your installation, so a thorough
review of CHANGES.txt is highly recommended if upgrading from any version earlier than 6.6.
• The Solr contribs map-reduce, morphlines-core and morphlines-cell have been removed.
• JSON Facet API now uses hyper-log-log for numBuckets cardinality calculation and calculates cardinality
before filtering buckets by any mincount greater than 1.
• If you use historical dates, specifically on or before the year 1582, you should re-index for better date
handling.
• If you use the JSON Facet API (json.facet) with method=stream, you must now set sort='index asc' to
get the streaming behavior; otherwise it won’t stream. Reminder: method is a hint that doesn’t change
defaults of other parameters.
• If you use the JSON Facet API (json.facet) to facet on a numeric field and if you use mincount=0 or if you
set the prefix, you will now get an error as these options are incompatible with numeric faceting.
• Solr’s logging verbosity at the INFO level has been greatly reduced, and you may need to update the log
configs to use the DEBUG level to see all the logging messages you used to see at INFO level before.
• We are no longer backing up solr.log and solr_gc.log files in date-stamped copies forever. If you
relied on the solr_log_ or solr_gc_log_ being in the logs folder that will no longer be the
case. See the section Configuring Logging for details on how log rotation works as of Solr 6.3.
• The create/deleteCollection methods on MiniSolrCloudCluster have been deprecated. Clients should
instead use the CollectionAdminRequest API. In addition,
MiniSolrCloudCluster#uploadConfigDir(File, String) has been deprecated in favour of
#uploadConfigSet(Path, String).
• The bin/solr.in.sh (bin/solr.in.cmd on Windows) is now completely commented by default.
Previously, this wasn’t so, which had the effect of masking existing environment variables.
• The _version_ field is no longer indexed and is now defined with indexed=false by default, because the
field has DocValues enabled.
• The /export handler has been changed so it no longer returns zero (0) for numeric fields that are not in
the original document. One consequence of this change is that you must be aware that some tuples will
not have values if there were none in the original document.
• Metrics-related classes in org.apache.solr.util.stats have been removed in favor of the Dropwizard

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 106 of 1195

Apache Solr Reference Guide 7.3

metrics library. Any custom plugins using these classes should be changed to use the equivalent classes
from the metrics library. As part of this, the following changes were made to the output of Overseer
Status API:
◦ The "totalTime" metric has been removed because it is no longer supported.
◦ The metrics "75thPctlRequestTime", "95thPctlRequestTime", "99thPctlRequestTime" and
"999thPctlRequestTime" in Overseer Status API have been renamed to "75thPcRequestTime",
"95thPcRequestTime" and so on for consistency with stats output in other parts of Solr.
◦ The metrics "avgRequestsPerMinute", "5minRateRequestsPerMinute" and
"15minRateRequestsPerMinute" have been replaced by corresponding per-second rates viz.
"avgRequestsPerSecond", "5minRateRequestsPerSecond" and "15minRateRequestsPerSecond" for
consistency with stats output in other parts of Solr.
• A new highlighter named UnifiedHighlighter has been added. You are encouraged to try out the
UnifiedHighlighter by setting hl.method=unified and report feedback. It’s more efficient/faster than the
other highlighters, especially compared to the original Highlighter. See HighlightParams.java for a
listing of highlight parameters annotated with which highlighters use them.
hl.useFastVectorHighlighter is now considered deprecated in lieu of hl.method=fastVector.
• The maxWarmingSearchers parameter now defaults to 1, and more importantly commits will now block if
this limit is exceeded instead of throwing an exception (a good thing). Consequently there is no longer a
risk in overlapping commits. Nonetheless users should continue to avoid excessive committing. Users
are advised to remove any pre-existing maxWarmingSearchers entries from their solrconfig.xml files.
• The Complex Phrase query parser now supports leading wildcards. Beware of its possible heaviness,
users are encouraged to use ReversedWildcardFilter in index time analysis.
• The JMX metric "avgTimePerRequest" (and the corresponding metric in the metrics API for each handler)
used to be a simple non-decaying average based on total cumulative time and the number of requests.
The Codahale Metrics implementation applies exponential decay to this value, which heavily biases the
average towards the last 5 minutes.
• Parallel SQL now uses Apache Calcite as its SQL framework. As part of this change the default
aggregation mode has been changed to facet rather than map_reduce. There have also been changes to
the SQL aggregate response and some SQL syntax changes. Consult the Parallel SQL Interface
documentation for full details.

Major Changes from Solr 5 to Solr 6
There are some major changes in Solr 6 to consider before starting to migrate your configurations and
indexes.
There are many hundreds of changes, so a thorough review of the Solr Upgrade Notes section as well as the
CHANGES.txt file in your Solr instance will help you plan your migration to Solr 6. This section attempts to
highlight some of the major changes you should be aware of.

Highlights of New Features in Solr 6
Some of the major improvements in Solr 6 include:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 107 of 1195

Streaming Expressions
Introduced in Solr 5, Streaming Expressions allow querying Solr and getting results as a stream of data,
sorted and aggregated as requested.
Several new expression types have been added in Solr 6:
• Parallel expressions using a MapReduce-like shuffling for faster throughput of high-cardinality fields.
• Daemon expressions to support continuous push or pull streaming.
• Advanced parallel relational algebra like distributed joins, intersections, unions and complements.
• Publish/Subscribe messaging.
• JDBC connections to pull data from other systems and join with documents in the Solr index.
Parallel SQL Interface
Built on streaming expressions, new in Solr 6 is a Parallel SQL interface to be able to send SQL queries to
Solr. SQL statements are compiled to streaming expressions on the fly, providing the full range of
aggregations available to streaming expression requests. A JDBC driver is included, which allows using SQL
clients and database visualization tools to query your Solr index and import data to other systems.
Cross Data Center Replication
Replication across data centers is now possible with Cross Data Center Replication. Using an active-passive
model, a SolrCloud cluster can be replicated to another data center, and monitored with a new API.
Graph QueryParser
A new graph query parser makes it possible to to graph traversal queries of Directed (Cyclic) Graphs
modelled using Solr documents.
DocValues
Most non-text field types in the Solr sample configsets now default to using DocValues.

Java 8 Required
The minimum supported version of Java for Solr 6 (and the SolrJ client libraries) is now Java 8.

Index Format Changes
Solr 6 has no support for reading Lucene/Solr 4.x and earlier indexes. Be sure to run the Lucene
IndexUpgrader included with Solr 5.5 if you might still have old 4x formatted segments in your index.
Alternatively: fully optimize your index with Solr 5.5 to make sure it consists only of one up-to-date index
segment.

Managed Schema is now the Default
Solr’s default behavior when a solrconfig.xml does not explicitly define a  is now
dependent on the luceneMatchVersion specified in that solrconfig.xml. When luceneMatchVersion <
6.0, ClassicIndexSchemaFactory will continue to be used for back compatibility, otherwise an instance of

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 108 of 1195

Apache Solr Reference Guide 7.3

ManagedIndexSchemaFactory will be used.
The most notable impacts of this change are:
• Existing solrconfig.xml files that are modified to use luceneMatchVersion >= 6.0, but do not have an
explicitly configured ClassicIndexSchemaFactory, will have their schema.xml file automatically
upgraded to a managed-schema file.
• Schema modifications via the Schema API will now be enabled by default.
Please review the Schema Factory Definition in SolrConfig section for more details.

Default Similarity Changes
Solr’s default behavior when a Schema does not explicitly define a global  is now dependent
on the luceneMatchVersion specified in the solrconfig.xml. When luceneMatchVersion < 6.0, an
instance of ClassicSimilarityFactory will be used, otherwise an instance of SchemaSimilarityFactory
will be used. Most notably this change means that users can take advantage of per Field Type similarity
declarations, without needing to also explicitly declare a global usage of SchemaSimilarityFactory.
Regardless of whether it is explicitly declared, or used as an implicit global default,
SchemaSimilarityFactory 's implicit behavior when a Field Types do not declare an explicit 
has also been changed to depend on the the luceneMatchVersion. When luceneMatchVersion < 6.0, an
instance of ClassicSimilarity will be used, otherwise an instance of BM25Similarity will be used. A
defaultSimFromFieldType init option may be specified on the SchemaSimilarityFactory declaration to
change this behavior. Please review the SchemaSimilarityFactory javadocs for more details

Replica & Shard Delete Command Changes
DELETESHARD and DELETEREPLICA now default to deleting the instance directory, data directory, and index
directory for any replica they delete. Please review the Collection API documentation for details on new
request parameters to prevent this behavior if you wish to keep all data on disk when using these
commands

facet.date.* Parameters Removed
The facet.date parameter (and associated facet.date.* parameters) that were deprecated in Solr 3.x have
been removed completely. If you have not yet switched to using the equivalent facet.range functionality
you must do so now before upgrading.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 109 of 1195

Using the Solr Administration User Interface
This section discusses the Solr Administration User Interface ("Admin UI").
The Overview of the Solr Admin UI explains the basic features of the user interface, what’s on the initial
Admin UI page, and how to configure the interface. In addition, there are pages describing each screen of
the Admin UI:
• Getting Assistance shows you how to get more information about the UI.
• Logging shows recent messages logged by this Solr node and provides a way to change logging levels
for specific classes.
• Cloud Screens display information about nodes when running in SolrCloud mode.
• Collections / Core Admin explains how to get management information about each core.
• Java Properties shows the Java information about each core.
• Thread Dump lets you see detailed information about each thread, along with state information.
• Suggestions Screen displays the state of the system with regard to the autoscaling policies that are in
place.
• Collection-Specific Tools is a section explaining additional screens available for each collection.
◦ Analysis - lets you analyze the data found in specific fields.
◦ Dataimport - shows you information about the current status of the Data Import Handler.
◦ Documents - provides a simple form allowing you to execute various Solr indexing commands
directly from the browser.
◦ Files - shows the current core configuration files such as solrconfig.xml.
◦ Query - lets you submit a structured query about various elements of a core.
◦ Stream - allows you to submit streaming expressions and see results and parsing explanations.
◦ Schema Browser - displays schema data in a browser window.
• Core-Specific Tools is a section explaining additional screens available for each named core.
◦ Ping - lets you ping a named core and determine whether the core is active.
◦ Plugins/Stats - shows statistics for plugins and other installed components.
◦ Replication - shows you the current replication status for the core, and lets you enable/disable
replication.
◦ Segments Info - Provides a visualization of the underlying Lucene index segments.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 110 of 1195

Apache Solr Reference Guide 7.3

Overview of the Solr Admin UI
Solr features a Web interface that makes it easy for Solr administrators and programmers to view Solr
configuration details, run queries and analyze document fields in order to fine-tune a Solr configuration and
access online documentation and other help.

Solr Dashboard
Accessing the URL http://hostname:8983/solr/ will show the main dashboard, which is divided into two
parts.
A left-side of the screen is a menu under the Solr logo that provides the navigation through the screens of
the UI. The first set of links are for system-level information and configuration and provide access to
Logging, Collection/Core Administration, and Java Properties, among other things. At the end of this
information is at least one pulldown listing Solr cores configured for this instance. On SolrCloud nodes, an
additional pulldown list shows all collections in this cluster. Clicking on a collection or core name shows
secondary menus of information for the specified collection or core, such as a Schema Browser, Config Files,
Plugins & Statistics, and an ability to perform Queries on indexed data.
The center of the screen shows the detail of the option selected. This may include a sub-navigation for the
option or text or graphical representation of the requested data. See the sections in this guide for each
screen for more details.
Under the covers, the Solr Admin UI re-uses the same HTTP APIs available to all clients to access Solr-related
data to drive an external interface.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3



Page 111 of 1195

The path to the Solr Admin UI given above is http://hostname:port/solr, which redirects
to http://hostname:port/solr/#/ in the current version. A convenience redirect is also
supported, so simply accessing the Admin UI at http://hostname:port/ will also redirect
to http://hostname:port/solr/#/.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 112 of 1195

Apache Solr Reference Guide 7.3

Getting Assistance
At the bottom of each screen of the Admin UI is a set of links that can be used to get more assistance with
configuring and using Solr.

Assistance icons
These icons include the following links.
Link

Description

Documentation

Navigates to the Apache Solr documentation hosted on
https://lucene.apache.org/solr/.

Issue Tracker

Navigates to the JIRA issue tracking server for the Apache Solr project. This
server resides at https://issues.apache.org/jira/browse/SOLR.

IRC Channel

Navigates to Solr’s IRC live-chat room: http://webchat.freenode.net/?
channels=#solr.

Community forum

Navigates to the Apache Wiki page which has further information about ways to
engage in the Solr User community mailing lists: https://wiki.apache.org/solr/
UsingMailingLists.

Solr Query Syntax

Navigates to the section Query Syntax and Parsing in this Reference Guide.

These links cannot be modified without editing the index.html in the server/solr/solr-webapp directory
that contains the Admin UI files.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 113 of 1195

Logging
The Logging page shows recent messages logged by this Solr node.
When you click the link for "Logging", a page similar to the one below will be displayed:

The Main Logging Screen, including an example of an error due to a bad document sent by a client
While this example shows logged messages for only one core, if you have multiple cores in a single instance,
they will each be listed, with the level for each.

Selecting a Logging Level
When you select the Level link on the left, you see the hierarchy of classpaths and classnames for your
instance. A row highlighted in yellow indicates that the class has logging capabilities. Click on a highlighted
row, and a menu will appear to allow you to change the log level for that class. Characters in boldface
indicate that the class will not be affected by level changes to root.

Log level selection
For an explanation of the various logging levels, see Configuring Logging.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 114 of 1195

Apache Solr Reference Guide 7.3

Cloud Screens
When running in SolrCloud mode, a "Cloud" option will appear in the Admin UI between Logging and
Collections/Core Admin.
This screen provides status information about each collection & node in your cluster, as well as access to the
low level data being stored in ZooKeeper.



Only Visible When using SolrCloud
The "Cloud" menu option is only available on Solr instances running in SolrCloud mode.
Single node or master/slave replication instances of Solr will not display this option.

Click on the Cloud option in the left-hand navigation, and a small sub-menu appears with options called
"Tree", "Graph", "Graph (Radial)" and "Dump". The default view ("Graph") shows a graph of each collection,
the shards that make up those collections, and the addresses of each replica for each shard.
This example shows the very simple two-node cluster created using the bin/solr -e cloud -noprompt
example command. In addition to the 2 shard, 2 replica "gettingstarted" collection, there is an additional
"films" collection consisting of a single shard/replica:

The "Graph (Radial)" option provides a different visual view of each node. Using the same example cluster,
the radial graph view looks like:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 115 of 1195

The "Tree" option shows a directory structure of the data in ZooKeeper, including cluster wide information
regarding the live_nodes and overseer status, as well as collection specific information such as the
state.json, current shard leaders, and configuration files in use. In this example, we see the state.json file
definition for the "films" collection:

The final option is "Dump", which returns a JSON document containing all nodes, their contents and their
children (recursively). This can be used to export a snapshot of all the data that Solr has kept inside
ZooKeeper and can aid in debugging SolrCloud problems.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 116 of 1195

Apache Solr Reference Guide 7.3

Collections / Core Admin
The Collections screen provides some basic functionality for managing your Collections, powered by the
Collections API.



If you are running a single node Solr instance, you will not see a Collections option in the
left nav menu of the Admin UI.
You will instead see a "Core Admin" screen that supports some comparable Core level
information & manipulation via the CoreAdmin API instead.

The main display of this page provides a list of collections that exist in your cluster. Clicking on a collection
name provides some basic metadata about how the collection is defined, and its current shards & replicas,
with options for adding and deleting individual replicas.
The buttons at the top of the screen let you make various collection level changes to your cluster, from add
new collections or aliases to reloading or deleting a single collection.

Replicas can be deleted by clicking the red "X" next to the replica name.
If the shard is inactive, for example after a SPLITSHARD action, an option to delete the shard will appear as a
red "X" next to the shard name.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

© 2018, Apache Software Foundation

Page 117 of 1195

Guide Version 7.3 - Published: 2018-03-27

Page 118 of 1195

Apache Solr Reference Guide 7.3

Java Properties
The Java Properties screen provides easy access to one of the most essential components of a topperforming Solr systems. With the Java Properties screen, you can see all the properties of the JVM running
Solr, including the class paths, file encodings, JVM memory settings, operating system, and more.

Java Properties Screen

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 119 of 1195

Thread Dump
The Thread Dump screen lets you inspect the currently active threads on your server.
Each thread is listed and access to the stacktraces is available where applicable. Icons to the left indicate the
state of the thread: for example, threads with a green check-mark in a green circle are in a "RUNNABLE"
state. On the right of the thread name, a down-arrow means you can expand to see the stacktrace for that
thread.

List of Threads
When you move your cursor over a thread name, a box floats over the name with the state for that thread.
Thread states can be:
State

Meaning

NEW

A thread that has not yet started.

RUNNABLE

A thread executing in the Java virtual machine.

BLOCKED

A thread that is blocked waiting for a monitor lock.

WAITING

A thread that is waiting indefinitely for another thread to perform a particular
action.

TIMED_WAITING

A thread that is waiting for another thread to perform an action for up to a
specified waiting time.

TERMINATED

A thread that has exited.

When you click on one of the threads that can be expanded, you’ll see the stacktrace, as in the example
below:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 120 of 1195

Apache Solr Reference Guide 7.3

Inspecting a Thread
You can also check the Show all Stacktraces button to automatically enable expansion for all threads.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 121 of 1195

Suggestions Screen
The Suggestions screen shows violations to an autoscaling policy that exist in the
system, and allows you to take action to correct the violations.
This screen is a visual representation of the output of the Suggestions API.
When there are no violations or other suggestions, the screen will appear somewhat blank:

When the system is in violation of an aspect of a policy, each violation will be shown, as in this screenshot:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 122 of 1195

Apache Solr Reference Guide 7.3

A line is shown for each violation. In this case, we have defined a policy where no replica can exist on a node
that has less than 500Gb of available disk space. In this example, 4 replicas in our sample cluster violates this
rule.
In the "Action" column, the green button allows you to execute the recommended change to allow the
system to return to compliance with the policy. If you hover your mouse over this button, you will see the
recommended Collections API command:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 123 of 1195

In this case, the recommendation is to issue a MOVEREPLICA command to move this replica to a node with
more available disk space.



Since autoscaling features are only available in SolrCloud mode, this screen will only appear
when running Solr in SolrCloud mode.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 124 of 1195

Apache Solr Reference Guide 7.3

Collection-Specific Tools
In the left-hand navigation bar, you will see a pull-down menu titled "Collection Selector" that can be used to
access collection specific administration screens.
Only Visible When Using SolrCloud



The "Collection Selector" pull-down menu is only available on Solr instances running in
SolrCloud mode.
Single node or master/slave replication instances of Solr will not display this menu, instead
the Collection specific UI pages described in this section will be available in the Core
Selector pull-down menu.

Clicking on the Collection Selector pull-down menu will show a list of the collections in your Solr cluster, with
a search box that can be used to find a specific collection by name. When you select a collection from the
pull-down, the main display of the page will display some basic metadata about the collection, and a
secondary menu will appear in the left nav with links to additional collection specific administration screens.

The collection-specific UI screens are listed below, with a link to the section of this guide to find out more:
• Analysis - lets you analyze the data found in specific fields.
• Dataimport - shows you information about the current status of the Data Import Handler.
• Documents - provides a simple form allowing you to execute various Solr indexing commands directly
from the browser.
• Files - shows the current core configuration files such as solrconfig.xml.
• Query - lets you submit a structured query about various elements of a core.
• Stream - allows you to submit streaming expressions and see results and parsing explanations.
• Schema Browser - displays schema data in a browser window.

Analysis Screen
The Analysis screen lets you inspect how data will be handled according to the field, field type and dynamic

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 125 of 1195

field configurations found in your Schema. You can analyze how content would be handled during indexing
or during query processing and view the results separately or at the same time. Ideally, you would want
content to be handled consistently, and this screen allows you to validate the settings in the field type or
field analysis chains.
Enter content in one or both boxes at the top of the screen, and then choose the field or field type
definitions to use for analysis.

If you click the Verbose Output check box, you see more information, including more details on the
transformations to the input (such as, convert to lower case, strip extra characters, etc.) including the raw
bytes, type and detailed position information at each stage. The information displayed will vary depending
on the settings of the field or field type. Each step of the process is displayed in a separate section, with an
abbreviation for the tokenizer or filter that is applied in that step. Hover or click on the abbreviation, and
you’ll see the name and path of the tokenizer or filter.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 126 of 1195

Apache Solr Reference Guide 7.3

In the example screenshot above, several transformations are applied to the input "Running is a sport." The
words "is" and "a" have been removed and the word "running" has been changed to its basic form, "run".
This is because we are using the field type text_en in this scenario, which is configured to remove stop
words (small words that usually do not provide a great deal of context) and "stem" terms when possible to
find more possible matches (this is particularly helpful with plural forms of words). If you click the question
mark next to the Analyze Fieldname/Field Type pull-down menu, the Schema Browser window will open,
showing you the settings for the field specified.
The section Understanding Analyzers, Tokenizers, and Filters describes in detail what each option is and how
it may transform your data and the section Running Your Analyzer has specific examples for using the
Analysis screen.

Dataimport Screen
The Dataimport screen shows the configuration of the DataImportHandler (DIH) and allows you start, and
monitor the status of, import commands as defined by the options selected on the screen and defined in the
configuration file.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 127 of 1195

The Dataimport Screen
This screen also lets you adjust various options to control how the data is imported to Solr, and view the
data import configuration file that controls the import.
For more information about data importing with DIH, see the section on Uploading Structured Data Store
Data with the Data Import Handler.

Documents Screen
The Documents screen provides a simple form allowing you to execute various Solr indexing commands in a
variety of formats directly from the browser.

The Documents Screen
The screen allows you to:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 128 of 1195

Apache Solr Reference Guide 7.3

• Submit JSON, CSV or XML documents in Solr-specific format for indexing
• Upload documents (in JSON, CSV or XML) for indexing
• Construct documents by selecting fields and field values
There are other ways to load data, see also these sections:



• Uploading Data with Index Handlers
• Uploading Data with Solr Cell using Apache Tika

Common Fields
• Request-Handler: The first step is to define the RequestHandler. By default /update will be defined.
Change the request handler to /update/extract to use Solr Cell.
• Document Type: Select the Document Type to define the format of document to load. The remaining
parameters may change depending on the document type selected.
• Document(s): Enter a properly-formatted Solr document corresponding to the Document Type selected.
XML and JSON documents must be formatted in a Solr-specific format, a small illustrative document will
be shown. CSV files should have headers corresponding to fields defined in the schema. More details can
be found at: Uploading Data with Index Handlers.
• Commit Within: Specify the number of milliseconds between the time the document is submitted and
when it is available for searching.
• Overwrite: If true the new document will replace an existing document with the same value in the id
field. If false multiple documents with the same id can be added.



Setting Overwrite to false is very rare in production situations, the default is true.

CSV, JSON and XML Documents
When using these document types the functionality is similar to submitting documents via curl or similar.
The document structure must be in a Solr-specific format appropriate for the document type. Examples are
illustrated in the Document(s) text box when you select the various types.
These options will only add or overwrite documents; for other update tasks, see the Solr Command option.

Document Builder
The Document Builder provides a wizard-like interface to enter fields of a document.

File Upload
The File Upload option allows choosing a prepared file and uploading it. If using /update for the RequestHandler option, you will be limited to XML, CSV, and JSON.
Other document types (e.g Word, PDF, etc.) can be indexed using the ExtractingRequestHandler (aka, Solr
Cell). You must modify the RequestHandler to /update/extract, which must be defined in your
solrconfig.xml file with your desired defaults. You should also add &literal.id shown in the "Extracting
Request Handler Params" field so the file chosen is given a unique id. More information can be found at:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 129 of 1195

Uploading Data with Solr Cell using Apache Tika

Solr Command
The Solr Command option allows you use the /update request handler with XML or JSON formatted
commands to perform specific actions. A few examples are:
• Deleting documents
• Updating only certain fields of documents
• Issuing commit commands on the index

Files Screen
The Files screen lets you browse & view the various configuration files (such solrconfig.xml and the
schema file) for the collection you selected.

The Files Screen
If you are using SolrCloud, the files displayed are the configuration files for this collection stored in
ZooKeeper. In a standalone Solr installations, all files in the conf directory are displayed.
While solrconfig.xml defines the behavior of Solr as it indexes content and responds to queries, the
Schema allows you to define the types of data in your content (field types), the fields your documents will be
broken into, and any dynamic fields that should be generated based on patterns of field names in the
incoming documents. Any other configuration files are used depending on how they are referenced in either
solrconfig.xml or your schema.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 130 of 1195

Apache Solr Reference Guide 7.3

Configuration files cannot be edited with this screen, so a text editor of some kind must be used.
This screen is related to the Schema Browser Screen, in that they both can display information from the
schema, but the Schema Browser provides a way to drill into the analysis chain and displays linkages
between field types, fields, and dynamic field rules.
Many of the options defined in these configuration files are described throughout the rest of this Guide. In
particular, you will want to review these sections:
• Indexing and Basic Data Operations
• Searching
• The Well-Configured Solr Instance
• Documents, Fields, and Schema Design

Query Screen
You can use the Query screen to submit a search query to a Solr collection and analyze the results.
In the example in the screenshot, a query has been submitted, and the screen shows the query results sent
to the browser as JSON.

JSON Results of a Query

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 131 of 1195

In this example, a query for genre:Fantasy was sent to a "films" collection. Defaults were used for all other
options in the form, which are explained briefly in the table below, and covered in detail in later parts of this
Guide.
The response is shown to the right of the form. Requests to Solr are simply HTTP requests, and the query
submitted is shown in light type above the results; if you click on this it will open a new browser window with
just this request and response (without the rest of the Solr Admin UI). The rest of the response is shown in
JSON, which is the default output format.
The response has at least two sections, but may have several more depending on the options chosen. The
two sections it always has are the responseHeader and the response. The responseHeader includes the
status of the search (status), the processing time (QTime), and the parameters (params) that were used to
process the query.
The response includes the documents that matched the query, in doc sub-sections. The fields return depend
on the parameters of the query (and the defaults of the request handler used). The number of results is also
included in this section.
This screen allows you to experiment with different query options, and inspect how your documents were
indexed. The query parameters available on the form are some basic options that most users want to have
available, but there are dozens more available which could be simply added to the basic request by hand (if
opened in a browser). The following parameters are available:
Request-handler (qt)
Specifies the query handler for the request. If a query handler is not specified, Solr processes the
response with the standard query handler.
q
The query event. See Searching for an explanation of this parameter.
fq
The filter queries. See Common Query Parameters for more information on this parameter.
sort
Sorts the response to a query in either ascending or descending order based on the response’s score or
another specified characteristic.
start, rows

start is the offset into the query result starting at which documents should be returned. The default
value is 0, meaning that the query should return results starting with the first document that matches.
This field accepts the same syntax as the start query parameter, which is described in Searching. rows is
the number of rows to return.
fl
Defines the fields to return for each document. You can explicitly list the stored fields, functions, and doc
transformers you want to have returned by separating them with either a comma or a space.
wt
Specifies the Response Writer to be used to format the query response. Defaults to JSON if not specified.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 132 of 1195

Apache Solr Reference Guide 7.3

indent
Click this button to request that the Response Writer use indentation to make the responses more
readable.
debugQuery
Click this button to augment the query response with debugging information, including "explain info" for
each document returned. This debugging information is intended to be intelligible to the administrator or
programmer.
dismax
Click this button to enable the Dismax query parser. See The DisMax Query Parser for further
information.
edismax
Click this button to enable the Extended query parser. See The Extended DisMax Query Parser for further
information.
hl
Click this button to enable highlighting in the query response. See Highlighting for more information.
facet
Enables faceting, the arrangement of search results into categories based on indexed terms. See Faceting
for more information.
spatial
Click to enable using location data for use in spatial or geospatial searches. See Spatial Search for more
information.
spellcheck
Click this button to enable the Spellchecker, which provides inline query suggestions based on other,
similar, terms. See Spell Checking for more information.

Stream Screen
The Stream screen allows you to enter a streaming expression and see the results. It is very similar to the
Query Screen, except the input box is at the top and all options must be declared in the expression.
The screen will insert everything up to the streaming expression itself, so you do not need to enter the full
URI with the hostname, port, collection, etc. Simply input the expression after the expr= part, and the URL
will be constructed dynamically as appropriate.
Under the input box, the Execute button will run the expression. An option "with explanation" will show the
parts of the streaming expression that were executed. Under this, the streamed results are shown. A URL to
be able to view the output in a browser is also available.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 133 of 1195

Stream Screen with query and results

Schema Browser Screen
The Schema Browser screen lets you review schema data in a browser window.
If you have accessed this window from the Analysis screen, it will be opened to a specific field, dynamic field
rule or field type. If there is nothing chosen, use the pull-down menu to choose the field or field type.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 134 of 1195

Apache Solr Reference Guide 7.3

Schema Browser Screen
The screen provides a great deal of useful information about each particular field and fieldtype in the
Schema, and provides a quick UI for adding fields or fieldtypes using the Schema API (if enabled). In the
example above, we have chosen the cat field. On the left side of the main view window, we see the field
name, that it is copied to the _text_ (because of a copyField rule) and that it use the strings fieldtype. Click
on one of those field or fieldtype names, and you can see the corresponding definitions.
In the right part of the main view, we see the specific properties of how the cat field is defined – either
explicitly or implicitly via its fieldtype, as well as how many documents have populated this field. Then we see
the analyzer used for indexing and query processing. Click the icon to the left of either of those, and you’ll
see the definitions for the tokenizers and/or filters that are used. The output of these processes is the
information you see when testing how content is handled for a particular field with the Analysis Screen.
Under the analyzer information is a button to Load Term Info. Clicking that button will show the top N
terms that are in a sample shard for that field, as well as a histogram showing the number of terms with
various frequencies. Click on a term, and you will be taken to the Query Screen to see the results of a query
of that term in that field. If you want to always see the term information for a field, choose Autoload and it
will always appear when there are terms for a field. A histogram shows the number of terms with a given
frequency in the field.



Term Information is loaded from single arbitrarily selected core from the collection, to
provide a representative sample for the collection. Full Field Facet query results are needed
to see precise term counts across the entire collection.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 135 of 1195

Core-Specific Tools
The Core-Specific tools are a group of UI screens that allow you to see core-level information.
In the left-hand navigation bar, you will see a pull-down menu titled "Core Selector". Clicking on the menu
will show a list of Solr cores hosted on this Solr node, with a search box that can be used to find a specific
core by name.
When you select a core from the pull-down, the main display of the page will show some basic metadata
about the core, and a secondary menu will appear in the left nav with links to additional core specific
administration screens.

Core overview screen
The core-specific UI screens are listed below, with a link to the section of this guide to find out more:
• Ping - lets you ping a named core and determine whether the core is active.
• Plugins/Stats - shows statistics for plugins and other installed components.
• Replication - shows you the current replication status for the core, and lets you enable/disable
replication.
• Segments Info - Provides a visualization of the underlying Lucene index segments.
If you are running a single node instance of Solr, additional UI screens normally displayed on a percollection bases will also be listed:
• Analysis - lets you analyze the data found in specific fields.
• Dataimport - shows you information about the current status of the Data Import Handler.
• Documents - provides a simple form allowing you to execute various Solr indexing commands directly
from the browser.
• Files - shows the current core configuration files such as solrconfig.xml.
• Query - lets you submit a structured query about various elements of a core.
• Stream - allows you to submit streaming expressions and see results and parsing explanations.
• Schema Browser - displays schema data in a browser window.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 136 of 1195

Apache Solr Reference Guide 7.3

Ping
Choosing Ping under a core name issues a ping request to check whether the core is up and responding to
requests.

Ping Option in Core Dropdown
The search executed by a Ping is configured with the Request Parameters API. See Implicit RequestHandlers
for the paramset to use for the /admin/ping endpoint.
The Ping option doesn’t open a page, but the status of the request can be seen on the core overview page
shown when clicking on a collection name. The length of time the request has taken is displayed next to the
Ping option, in milliseconds.

Ping API Examples
While the UI screen makes it easy to see the ping response time, the underlying ping command can be more
useful when executed by remote monitoring tools:
Input
http://localhost:8983/solr//admin/ping
This command will ping the core name for a response.
Input
http://localhost:8983/solr//admin/ping?distrib=true&wt=xml
This command will ping all replicas of the given collection name for a response
Sample Output

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 137 of 1195



0
13

{!lucene}*:*
false
_text_
10
all


OK

Both API calls have the same output. A status=OK indicates that the nodes are responding.
SolrJ Example
SolrPing ping = new SolrPing();
ping.getParams().add("distrib", "true"); //To make it a distributed request against a collection
rsp = ping.process(solrClient, collectionName);
int status = rsp.getStatus();

Plugins & Stats Screen
The Plugins screen shows information and statistics about the status and performance of various plugins
running in each Solr core. You can find information about the performance of the Solr caches, the state of
Solr’s searchers, and the configuration of Request Handlers and Search Components.
Choose an area of interest on the right, and then drill down into more specifics by clicking on one of the
names that appear in the central part of the window. In this example, we’ve chosen to look at the Searcher
stats, from the Core area:

Searcher Statistics

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 138 of 1195

Apache Solr Reference Guide 7.3

The display is a snapshot taken when the page is loaded. You can get updated status by choosing to either
Watch Changes or Refresh Values. Watching the changes will highlight those areas that have changed,
while refreshing the values will reload the page with updated information.

Replication Screen
The Replication screen shows you the current replication state for the core you have specified. SolrCloud has
supplanted much of this functionality, but if you are still using Master-Slave index replication, you can use
this screen to:
1. View the replicatable index state. (on a master node)
2. View the current replication status (on a slave node)
3. Disable replication. (on a master node)
Caution When Using SolrCloud
When using SolrCloud, do not attempt to disable replication via this screen.



More details on how to configure replication is available in the section called Index Replication.

Segments Info
The Segments Info screen lets you see a visualization of the various segments in the underlying Lucene
index for this core, with information about the size of each segment – both bytes and in number of
documents – as well as other basic metadata about those segments. Most visible is the the number of
deleted documents, but you can hover your mouse over the segments to see additional numeric details.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 139 of 1195

This information may be useful for people to help make decisions about the optimal merge settings for their
data.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 140 of 1195

Apache Solr Reference Guide 7.3

Documents, Fields, and Schema Design
This section discusses how Solr organizes its data into documents and fields, as well as how to work with a
schema in Solr.
This section includes the following topics:
Overview of Documents, Fields, and Schema Design: An introduction to the concepts covered in this section.
Solr Field Types: Detailed information about field types in Solr, including the field types in the default Solr
schema.
Defining Fields: Describes how to define fields in Solr.
Copying Fields: Describes how to populate fields with data copied from another field.
Dynamic Fields: Information about using dynamic fields in order to catch and index fields that do not exactly
conform to other field definitions in your schema.
Schema API: Use curl commands to read various parts of a schema or create new fields and copyField rules.
Other Schema Elements: Describes other important elements in the Solr schema.
Putting the Pieces Together: A higher-level view of the Solr schema and how its elements work together.
DocValues: Describes how to create a docValues index for faster lookups.
Schemaless Mode: Automatically add previously unknown schema fields using value-based field type
guessing.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 141 of 1195

Overview of Documents, Fields, and Schema Design
The fundamental premise of Solr is simple. You give it a lot of information, then later you can ask it
questions and find the piece of information you want. The part where you feed in all the information is
called indexing or updating. When you ask a question, it’s called a query.
One way to understand how Solr works is to think of a loose-leaf book of recipes. Every time you add a
recipe to the book, you update the index at the back. You list each ingredient and the page number of the
recipe you just added. Suppose you add one hundred recipes. Using the index, you can very quickly find all
the recipes that use garbanzo beans, or artichokes, or coffee, as an ingredient. Using the index is much
faster than looking through each recipe one by one. Imagine a book of one thousand recipes, or one million.
Solr allows you to build an index with many different fields, or types of entries. The example above shows
how to build an index with just one field, ingredients. You could have other fields in the index for the
recipe’s cooking style, like Asian, Cajun, or vegan, and you could have an index field for preparation times.
Solr can answer questions like "What Cajun-style recipes that have blood oranges as an ingredient can be
prepared in fewer than 30 minutes?"
The schema is the place where you tell Solr how it should build indexes from input documents.

How Solr Sees the World
Solr’s basic unit of information is a document, which is a set of data that describes something. A recipe
document would contain the ingredients, the instructions, the preparation time, the cooking time, the tools
needed, and so on. A document about a person, for example, might contain the person’s name, biography,
favorite color, and shoe size. A document about a book could contain the title, author, year of publication,
number of pages, and so on.
In the Solr universe, documents are composed of fields, which are more specific pieces of information. Shoe
size could be a field. First name and last name could be fields.
Fields can contain different kinds of data. A name field, for example, is text (character data). A shoe size field
might be a floating point number so that it could contain values like 6 and 9.5. Obviously, the definition of
fields is flexible (you could define a shoe size field as a text field rather than a floating point number, for
example), but if you define your fields correctly, Solr will be able to interpret them correctly and your users
will get better results when they perform a query.
You can tell Solr about the kind of data a field contains by specifying its field type. The field type tells Solr how
to interpret the field and how it can be queried.
When you add a document, Solr takes the information in the document’s fields and adds that information to
an index. When you perform a query, Solr can quickly consult the index and return the matching documents.

Field Analysis
Field analysis tells Solr what to do with incoming data when building an index. A more accurate name for this
process would be processing or even digestion, but the official name is analysis.
Consider, for example, a biography field in a person document. Every word of the biography must be

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 142 of 1195

Apache Solr Reference Guide 7.3

indexed so that you can quickly find people whose lives have had anything to do with ketchup, or
dragonflies, or cryptography.
However, a biography will likely contains lots of words you don’t care about and don’t want clogging up
your index—words like "the", "a", "to", and so forth. Furthermore, suppose the biography contains the word
"Ketchup", capitalized at the beginning of a sentence. If a user makes a query for "ketchup", you want Solr
to tell you about the person even though the biography contains the capitalized word.
The solution to both these problems is field analysis. For the biography field, you can tell Solr how to break
apart the biography into words. You can tell Solr that you want to make all the words lower case, and you
can tell Solr to remove accents marks.
Field analysis is an important part of a field type. Understanding Analyzers, Tokenizers, and Filters is a
detailed description of field analysis.

Solr’s Schema File
Solr stores details about the field types and fields it is expected to understand in a schema file. The name
and location of this file may vary depending on how you initially configured Solr or if you modified it later.
• managed-schema is the name for the schema file Solr uses by default to support making Schema changes
at runtime via the Schema API, or Schemaless Mode features. You may explicitly configure the managed
schema features to use an alternative filename if you choose, but the contents of the files are still
updated automatically by Solr.
• schema.xml is the traditional name for a schema file which can be edited manually by users who use the
ClassicIndexSchemaFactory.
• If you are using SolrCloud you may not be able to find any file by these names on the local filesystem.
You will only be able to see the schema through the Schema API (if enabled) or through the Solr Admin
UI’s Cloud Screens.
Whichever name of the file in use in your installation, the structure of the file is not changed. However, the
way you interact with the file will change. If you are using the managed schema, it is expected that you only
interact with the file with the Schema API, and never make manual edits. If you do not use the managed
schema, you will only be able to make manual edits to the file, the Schema API will not support any
modifications.
Note that if you are not using the Schema API yet you do use SolrCloud, you will need to interact with
schema.xml through ZooKeeper using upconfig and downconfig commands to make a local copy and upload
your changes. The options for doing this are described in Solr Control Script Reference and Using ZooKeeper
to Manage Configuration Files.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 143 of 1195

Solr Field Types
The field type defines how Solr should interpret data in a field and how the field can be queried. There are
many field types included with Solr by default, and they can also be defined locally.
Topics covered in this section:
• Field Type Definitions and Properties
• Field Types Included with Solr
• Working with Currencies and Exchange Rates
• Working with Dates
• Working with Enum Fields
• Working with External Files and Processes
• Field Properties by Use Case



See also the FieldType Javadoc.

Field Type Definitions and Properties
A field type defines the analysis that will occur on a field when documents are indexed or queries are sent to
the index.
A field type definition can include four types of information:
• The name of the field type (mandatory).
• An implementation class name (mandatory).
• If the field type is TextField, a description of the field analysis for the field type.
• Field type properties - depending on the implementation class, some properties may be mandatory.

Field Type Definitions in schema.xml
Field types are defined in schema.xml. Each field type is defined between fieldType elements. They can
optionally be grouped within a types element. Here is an example of a field type definition for a type called
text_general:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 144 of 1195

Apache Solr Reference Guide 7.3














① The first line in the example above contains the field type name, text_general, and the name of the
implementing class, solr.TextField.

② The rest of the definition is about field analysis, described in Understanding Analyzers, Tokenizers, and
Filters.

The implementing class is responsible for making sure the field is handled correctly. In the class names in
schema.xml, the string solr is shorthand for org.apache.solr.schema or org.apache.solr.analysis.
Therefore, solr.TextField is really org.apache.solr.schema.TextField.

Field Type Properties
The field type class determines most of the behavior of a field type, but optional properties can also be
defined. For example, the following definition of a date field type defines two properties, sortMissingLast
and omitNorms.

The properties that can be specified for a given field type fall into three major categories:
• Properties specific to the field type’s class.
• General Properties Solr supports for any field type.
• Field Default Properties that can be specified on the field type that will be inherited by fields that use this
type instead of the default behavior.
General Properties
These are the general properties for fields

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 145 of 1195

name
The name of the fieldType. This value gets used in field definitions, in the "type" attribute. It is strongly
recommended that names consist of alphanumeric or underscore characters only and not start with a
digit. This is not currently strictly enforced.

class
The class name that gets used to store and index the data for this type. Note that you may prefix included
class names with "solr." and Solr will automatically figure out which packages to search for the class - so
solr.TextField will work.
If you are using a third-party class, you will probably need to have a fully qualified class name. The fully
qualified equivalent for solr.TextField is org.apache.solr.schema.TextField.

positionIncrementGap
For multivalued fields, specifies a distance between multiple values, which prevents spurious phrase
matches.

autoGeneratePhraseQueries
For text fields. If true, Solr automatically generates phrase queries for adjacent terms. If false, terms
must be enclosed in double-quotes to be treated as phrases.

synonymQueryStyle
Query used to combine scores of overlapping query terms (i.e., synonyms). Consider a search for "blue
tee" with query-time synonyms tshirt,tee.
Use as_same_term (default) to blend terms, i.e., SynonymQuery(tshirt,tee) where each term will be
treated as equally important. Use pick_best to select the most significant synonym when scoring
Dismax(tee,tshirt). Use as_distinct_terms to bias scoring towards the most significant synonym
(pants OR slacks).

as_same_term is appropriate when terms are true synonyms (television, tv). Use pick_best or
as_distinct_terms when synonyms are expanding to hyponyms (q=jeans w/ jeans=>jeans,pants)
and you want exact to come before parent and sibling concepts. See this blog article.

enableGraphQueries
For text fields, applicable when querying with sow=false (which is the default for the sow parameter). Use
true, the default, for field types with query analyzers including graph-aware filters, e.g., Synonym Graph
Filter and Word Delimiter Graph Filter.
Use false for field types with query analyzers including filters that can match docs when some tokens are
missing, e.g., Shingle Filter.

docValuesFormat
Defines a custom DocValuesFormat to use for fields of this type. This requires that a schema-aware codec,
such as the SchemaCodecFactory has been configured in solrconfig.xml.

postingsFormat
Defines a custom PostingsFormat to use for fields of this type. This requires that a schema-aware codec,
such as the SchemaCodecFactory has been configured in solrconfig.xml.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 146 of 1195



Apache Solr Reference Guide 7.3

Lucene index back-compatibility is only supported for the default codec. If you choose to
customize the postingsFormat or docValuesFormat in your schema.xml, upgrading to a
future version of Solr may require you to either switch back to the default codec and
optimize your index to rewrite it into the default codec before upgrading, or re-build your
entire index from scratch after upgrading.

Field Default Properties
These are properties that can be specified either on the field types, or on individual fields to override the
values provided by the field types.
The default values for each property depend on the underlying FieldType class, which in turn may depend
on the version attribute of the . The table below includes the default value for most FieldType
implementations provided by Solr, assuming a schema.xml that declares version="1.6".
Property

Description

Values

Implicit Default

indexed

If true, the value of the field can be used
in queries to retrieve matching
documents.

true or false

true

stored

If true, the actual value of the field can be true or false
retrieved by queries.

true

docValues

If true, the value of the field will be put in
a column-oriented DocValues structure.

true or false

false

sortMissingFirst
sortMissingLast

Control the placement of documents
when a sort field is not present.

true or false

false

multiValued

If true, indicates that a single document
true or false
might contain multiple values for this field
type.

false

omitNorms

If true, omits the norms associated with
true or false
this field (this disables length
normalization for the field, and saves
some memory). Defaults to true for all
primitive (non-analyzed) field types,
such as int, float, data, bool, and string.
Only full-text fields or fields need norms.

*

omitTermFreqAndP If true, omits term frequency, positions,
true or false
ositions
and payloads from postings for this field.
This can be a performance boost for fields
that don’t require that information. It also
reduces the storage space required for
the index. Queries that rely on position
that are issued on a field with this option
will silently fail to find documents. This
property defaults to true for all field
types that are not text fields.

*

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 147 of 1195

Property

Description

Values

omitPositions

Similar to omitTermFreqAndPositions but true or false
preserves term frequency information.

*

termVectors
termPositions
termOffsets
termPayloads

These options instruct Solr to maintain full true or false
term vectors for each document,
optionally including position, offset and
payload information for each term
occurrence in those vectors. These can be
used to accelerate highlighting and other
ancillary functionality, but impose a
substantial cost in terms of index size.
They are not necessary for typical uses of
Solr.

false

required

Instructs Solr to reject any attempts to
add a document which does not have a
value for this field. This property defaults
to false.

true or false

false

useDocValuesAsStor If the field has docValues enabled, setting true or false
ed
this to true would allow the field to be
returned as if it were a stored field (even
if it has stored=false) when matching
“*” in an fl parameter.

true

large

false

Large fields are always lazy loaded and
true or false
will only take up space in the document
cache if the actual value is < 512KB. This
option requires stored="true" and
multiValued="false". It’s intended for
fields that might have very large values so
that they don’t get cached in memory.

Implicit Default

Field Type Similarity
A field type may optionally specify a  that will be used when scoring documents that refer to
fields with this type, as long as the "global" similarity for the collection allows it.
By default, any field type which does not define a similarity, uses BM25Similarity. For more details, and
examples of configuring both global & per-type Similarities, please see Other Schema Elements.

Field Types Included with Solr
The following table lists the field types that are available in Solr. The org.apache.solr.schema package
includes all the classes listed in this table.
Class

Description

BinaryField

Binary data.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 148 of 1195

Apache Solr Reference Guide 7.3

Class

Description

BoolField

Contains either true or false. Values of 1, t, or T in the first character are
interpreted as true. Any other values in the first character are interpreted as
false.

CollationField

Supports Unicode collation for sorting and range queries. The ICUCollationField
is a better choice if you can use ICU4J. See the section Unicode Collation for
more information.

CurrencyField

Deprecated. Use CurrencyFieldType instead.

CurrencyFieldType

Supports currencies and exchange rates. See the section Working with
Currencies and Exchange Rates for more information.

DateRangeField

Supports indexing date ranges, to include point in time date instances as well
(single-millisecond durations). See the section Working with Dates for more
detail on using this field type. Consider using this field type even if it’s just for
date instances, particularly when the queries typically fall on UTC
year/month/day/hour, etc., boundaries.

DatePointField

Date field. Represents a point in time with millisecond precision, encoded using
a "Dimensional Points" based data structure that allows for very efficient
searches for specific values, or ranges of values. See the section Working with
Dates for more details on the supported syntax. For single valued fields,
docValues="true" must be used to enable sorting.

DoublePointField

Double field (64-bit IEEE floating point). This class encodes double values using
a "Dimensional Points" based data structure that allows for very efficient
searches for specific values, or ranges of values. For single valued fields,
docValues="true" must be used to enable sorting.

ExternalFileField

Pulls values from a file on disk. See the section Working with External Files and
Processes for more information.

EnumField

Deprecated. Use EnumFieldType instead.

EnumFieldType

Allows defining an enumerated set of values which may not be easily sorted by
either alphabetic or numeric order (such as a list of severities, for example). This
field type takes a configuration file, which lists the proper order of the field
values. See the section Working with Enum Fields for more information.

FloatPointField

Floating point field (32-bit IEEE floating point). This class encodes float values
using a "Dimensional Points" based data structure that allows for very efficient
searches for specific values, or ranges of values. For single valued fields,
docValues="true" must be used to enable sorting.

ICUCollationField

Supports Unicode collation for sorting and range queries. See the section
Unicode Collation for more information.

IntPointField

Integer field (32-bit signed integer). This class encodes int values using a
"Dimensional Points" based data structure that allows for very efficient
searches for specific values, or ranges of values. For single valued fields,
docValues="true" must be used to enable sorting.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 149 of 1195

Class

Description

LatLonPointSpatialField

A latitude/longitude coordinate pair; possibly multi-valued for multiple points.
Usually it’s specified as "lat,lon" order with a comma. See the section Spatial
Search for more information.

LatLonType

Deprecated. Consider using the LatLonPointSpatialField instead. A singlevalued latitude/longitude coordinate pair. Usually it’s specified as "lat,lon" order
with a comma. See the section Spatial Search for more information.

LongPointField

Long field (64-bit signed integer). This class encodes foo values using a
"Dimensional Points" based data structure that allows for very efficient
searches for specific values, or ranges of values. For single valued fields,
docValues="true" must be used to enable sorting.

PointType

A single-valued n-dimensional point. It’s both for sorting spatial data that is not
lat-lon, and for some more rare use-cases. (NOTE: this is not related to the
"Point" based numeric fields). See Spatial Search for more information.

PreAnalyzedField

Provides a way to send to Solr serialized token streams, optionally with
independent stored values of a field, and have this information stored and
indexed without any additional text processing.
Configuration and usage of PreAnalyzedField is documented in the section
Working with External Files and Processes.

RandomSortField

Does not contain a value. Queries that sort on this field type will return results in
random order. Use a dynamic field to use this feature.

SpatialRecursivePrefixTre (RPT for short) Accepts latitude comma longitude strings or other shapes in
eFieldType
WKT format. See Spatial Search for more information.
StrField

String (UTF-8 encoded string or Unicode). Strings are intended for small fields
and are not tokenized or analyzed in any way. They have a hard limit of slightly
less than 32K.

SortableTextField

A specialized version of TextField that allows (and defaults to)
docValues="true" for sorting on the first 1024 characters of the original string
prior to analysis. The number of characters used for sorting can be overridden
with the maxCharsForDocValues attribute.

TextField

Text, usually multiple words or tokens.

TrieDateField

Deprecated. Use DatePointField instead.

TrieDoubleField

Deprecated. Use DoublePointField instead.

TrieFloatField

Deprecated. Use FloatPointField instead.

TrieIntField

Deprecated. Use IntPointField instead.

TrieLongField

Deprecated. Use LongPointField instead.

TrieField

Deprecated. This field takes a type parameter to define the specific class of
Trie* field to use; Use an appropriate Point Field type instead.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 150 of 1195

Apache Solr Reference Guide 7.3

Class

Description

UUIDField

Universally Unique Identifier (UUID). Pass in a value of NEW and Solr will create a
new UUID.
Note: configuring a UUIDField instance with a default value of NEW is not
advisable for most users when using SolrCloud (and not possible if the UUID
value is configured as the unique key field) since the result will be that each
replica of each document will get a unique UUID value. Using
UUIDUpdateProcessorFactory to generate UUID values when documents are
added is recommended instead.



All Trie* numeric and date field types have been deprecated in favor of *Point field types.
Point field types are better at range queries (speed, memory, disk), however simple
field:value queries underperform relative to Trie. Either accept this, or continue to use Trie
fields. This shortcoming may be addressed in a future release.

Working with Currencies and Exchange Rates
The currency FieldType provides support for monetary values to Solr/Lucene with query-time currency
conversion and exchange rates. The following features are supported:
• Point queries
• Range queries
• Function range queries
• Sorting
• Currency parsing by either currency code or symbol
• Symmetric & asymmetric exchange rates (asymmetric exchange rates are useful if there are fees
associated with exchanging the currency)
• Range faceting (using either facet.range or type:range in json.facet) as long as the start and end
values are specified in the same Currency.

Configuring Currencies



CurrencyField has been Deprecated
CurrencyField has been deprecated in favor of CurrencyFieldType; all configuration
examples below use CurrencyFieldType.

The currency field type is defined in schema.xml. This is the default configuration of this type.

In this example, we have defined the name and class of the field type, and defined the defaultCurrency as
"USD", for U.S. Dollars. We have also defined a currencyConfig to use a file called "currency.xml". This is a

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 151 of 1195

file of exchange rates between our default currency to other currencies. There is an alternate
implementation that would allow regular downloading of currency data. See Exchange Rates below for
more.
Many of the example schemas that ship with Solr include a dynamic field that uses this type, such as this
example:


This dynamic field would match any field that ends in _c and make it a currency typed field.
At indexing time, money fields can be indexed in a native currency. For example, if a product on an ecommerce site is listed in Euros, indexing the price field as "1000,EUR" will index it appropriately. The price
should be separated from the currency by a comma, and the price must be encoded with a floating point
value (a decimal point).
During query processing, range and point queries are both supported.
Sub-field Suffixes
You must specify parameters amountLongSuffix and codeStrSuffix, corresponding to dynamic fields to be
used for the raw amount and the currency dynamic sub-fields, e.g.:

In the above example, the raw amount field will use the "*_l_ns" dynamic field, which must exist in the
schema and use a long field type, i.e., one that extends LongValueFieldType. The currency code field will
use the "*_s_ns" dynamic field, which must exist in the schema and use a string field type, i.e., one that is or
extends StrField.
Atomic Updates won’t work if dynamic sub-fields are stored



As noted on Updating Parts of Documents, stored dynamic sub-fields will cause indexing to
fail when you use Atomic Updates. To avoid this problem, specify stored="false" on those
dynamic fields.

Exchange Rates
You configure exchange rates by specifying a provider. Natively, two provider types are supported:
FileExchangeRateProvider or OpenExchangeRatesOrgProvider.
FileExchangeRateProvider
This provider requires you to provide a file of exchange rates. It is the default, meaning that to use this
provider you only need to specify the file path and name as a value for currencyConfig in the definition for
this type.
There is a sample currency.xml file included with Solr, found in the same directory as the schema.xml file.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 152 of 1195

Apache Solr Reference Guide 7.3

Here is a small snippet from this file:








rate="0.869914" />
rate="7.800095" />
rate="8.966508" />





OpenExchangeRatesOrgProvider
You can configure Solr to download exchange rates from OpenExchangeRates.Org, with updates rates
between USD and 170 currencies hourly. These rates are symmetrical only.
In this case, you need to specify the providerClass in the definitions for the field type and sign up for an API
key. Here is an example:

The refreshInterval is minutes, so the above example will download the newest rates every 60 minutes.
The refresh interval may be increased, but not decreased.

Working with Dates
Date Formatting
Solr’s date fields (DatePointField, DateRangeField and the deprecated TrieDateField) represent "dates"
as a point in time with millisecond precision. The format used is a restricted form of the canonical
representation of dateTime in the XML Schema specification – a restricted subset of ISO-8601. For those
familiar with Java 8, Solr uses DateTimeFormatter.ISO_INSTANT for formatting, and parsing too with
"leniency".

YYYY-MM-DDThh:mm:ssZ

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 153 of 1195

• YYYY is the year.
• MM is the month.
• DD is the day of the month.
• hh is the hour of the day as on a 24-hour clock.
• mm is minutes.
• ss is seconds.
• Z is a literal 'Z' character indicating that this string representation of the date is in UTC
Note that no time zone can be specified; the String representations of dates is always expressed in
Coordinated Universal Time (UTC). Here is an example value:

1972-05-20T17:33:18Z
You can optionally include fractional seconds if you wish, although any precision beyond milliseconds will be
ignored. Here are example values with sub-seconds:
• 1972-05-20T17:33:18.772Z
• 1972-05-20T17:33:18.77Z
• 1972-05-20T17:33:18.7Z
There must be a leading '-' for dates prior to year 0000, and Solr will format dates with a leading '+' for
years after 9999. Year 0000 is considered year 1 BC; there is no such thing as year 0 AD or BC.
Query escaping may be required
As you can see, the date format includes colon characters separating the hours, minutes,
and seconds. Because the colon is a special character to Solr’s most common query
parsers, escaping is sometimes required, depending on exactly what you are trying to do.



This is normally an invalid query: datefield:1972-05-20T17:33:18.772Z
These are valid queries:

datefield:1972-05-20T17\:33\:18.772Z
datefield:"1972-05-20T17:33:18.772Z"
datefield:[1972-05-20T17:33:18.772Z TO *]
Date Range Formatting
Solr’s DateRangeField supports the same point in time date syntax described above (with date math
described below) and more to express date ranges. One class of examples is truncated dates, which
represent the entire date span to the precision indicated. The other class uses the range syntax ([ TO ]).
Here are some examples:
• 2000-11 – The entire month of November, 2000.
• 2000-11T13 – Likewise but for an hour of the day (1300 to before 1400, i.e., 1pm to 2pm).
• -0009 – The year 10 BC. A 0 in the year position is 0 AD, and is also considered 1 BC.
• [2000-11-01 TO 2014-12-01] – The specified date range at a day resolution.
• [2014 TO 2014-12-01] – From the start of 2014 till the end of the first day of December.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 154 of 1195

Apache Solr Reference Guide 7.3

• [* TO 2014-12-01] – From the earliest representable time thru till the end of the day on 2014-12-01.
Limitations: The range syntax doesn’t support embedded date math. If you specify a date instance
supported by DatePointField with date math truncating it, like NOW/DAY, you still get the first millisecond of
that day, not the entire day’s range. Exclusive ranges (using { & }) work in queries but not for indexing
ranges.

Date Math
Solr’s date field types also supports date math expressions, which makes it easy to create times relative to
fixed moments in time, include the current time which can be represented using the special value of “NOW”.
Date Math Syntax
Date math expressions consist either adding some quantity of time in a specified unit, or rounding the
current time by a specified unit. expressions can be chained and are evaluated left to right.
For example: this represents a point in time two months from now:

NOW+2MONTHS
This is one day ago:

NOW-1DAY
A slash is used to indicate rounding. This represents the beginning of the current hour:

NOW/HOUR
The following example computes (with millisecond precision) the point in time six months and three days
into the future and then rounds that time to the beginning of that day:

NOW+6MONTHS+3DAYS/DAY
Note that while date math is most commonly used relative to NOW it can be applied to any fixed moment in
time as well:

1972-05-20T17:33:18.772Z+6MONTHS+3DAYS/DAY
Request Parameters That Affect Date Math
NOW

The NOW parameter is used internally by Solr to ensure consistent date math expression parsing across
multiple nodes in a distributed request. But it can be specified to instruct Solr to use an arbitrary moment in
time (past or future) to override for all situations where the the special value of “NOW” would impact date
math expressions.
It must be specified as a (long valued) milliseconds since epoch
Example:

q=solr&fq=start_date:[* TO NOW]&NOW=1384387200000

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 155 of 1195

TZ

By default, all date math expressions are evaluated relative to the UTC TimeZone, but the TZ parameter can
be specified to override this behaviour, by forcing all date based addition and rounding to be relative to the
specified time zone.
For example, the following request will use range faceting to facet over the current month, "per day"
relative UTC:
http://localhost:8983/solr/my_collection/select?q=*:*&facet.range=my_date_field&facet=true&facet.
range.start=NOW/MONTH&facet.range.end=NOW/MONTH%2B1MONTH&facet.range.gap=%2B1DAY&wt=xml

0
name="2013-11-02T00:00:00Z">0
name="2013-11-03T00:00:00Z">0
name="2013-11-04T00:00:00Z">0
name="2013-11-05T00:00:00Z">0
name="2013-11-06T00:00:00Z">0
name="2013-11-07T00:00:00Z">0

While in this example, the "days" will be computed relative to the specified time zone - including any
applicable Daylight Savings Time adjustments:
http://localhost:8983/solr/my_collection/select?q=*:*&facet.range=my_date_field&facet=true&facet.
range.start=NOW/MONTH&facet.range.end=NOW/MONTH%2B1MONTH&facet.range.gap=%2B1DAY&TZ=America/Los_A
ngeles&wt=xml

0
name="2013-11-02T07:00:00Z">0
name="2013-11-03T07:00:00Z">0
name="2013-11-04T08:00:00Z">0
name="2013-11-05T08:00:00Z">0
name="2013-11-06T08:00:00Z">0
name="2013-11-07T08:00:00Z">0

More DateRangeField Details
DateRangeField is almost a drop-in replacement for places where DatePointField is used. The only
difference is that Solr’s XML or SolrJ response formats will expose the stored data as a String instead of a
Date. The underlying index data for this field will be a bit larger. Queries that align to units of time a second
on up should be faster than TrieDateField, especially if it’s in UTC.
The main point of DateRangeField, as its name suggests, is to allow indexing date ranges. To do that, simply
supply strings in the format shown above. It also supports specifying 3 different relational predicates
between the indexed data, and the query range:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 156 of 1195

Apache Solr Reference Guide 7.3

• Intersects (default)
• Contains
• Within
You can specify the predicate by querying using the op local-params parameter like so:
fq={!field f=dateRange op=Contains}[2013 TO 2018]
Unlike most local parameters, op is actually not defined by any query parser (field), it is defined by the field
type, in this case DateRangeField. In the above example, it would find documents with indexed ranges that
contain (or equals) the range 2013 thru 2018. Multi-valued overlapping indexed ranges in a document are
effectively coalesced.
For a DateRangeField example use-case, see see Solr’s community wiki.

Working with Enum Fields
EnumFieldType allows defining a field whose values are a closed set, and the sort order is pre-determined
but is not alphabetic nor numeric. Examples of this are severity lists, or risk definitions.



EnumField has been Deprecated
EnumField has been deprecated in favor of EnumFieldType; all configuration examples
below use EnumFieldType.

Defining an EnumFieldType in schema.xml
The EnumFieldType type definition is quite simple, as in this example defining field types for "priorityLevel"
and "riskLevel" enumerations:


Besides the name and the class, which are common to all field types, this type also takes two additional
parameters:

enumsConfig
the name of a configuration file that contains the  list of field values and their order that you wish
to use with this field type. If a path to the file is not defined specified, the file should be in the conf
directory for the collection.

enumName
the name of the specific enumeration in the enumsConfig file to use for this type.
Note that docValues="true" must be specified either in the EnumFieldType fieldType or field specification.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 157 of 1195

Defining the EnumFieldType Configuration File
The file named with the enumsConfig parameter can contain multiple enumeration value lists with different
names if there are multiple uses for enumerations in your Solr schema.
In this example, there are two value lists defined. Each list is between enum opening and closing tags:



Not Available
Low
Medium
High
Urgent


Unknown
Very Low
Low
Medium
High
Critical



Changing Values



You cannot change the order, or remove, existing values in an  without reindexing.
You can however add new values to the end.

Working with External Files and Processes
The ExternalFileField Type
The ExternalFileField type makes it possible to specify the values for a field in a file outside the Solr index.
For such a field, the file contains mappings from a key field to the field value. Another way to think of this is
that, instead of specifying the field in documents as they are indexed, Solr finds values for this field in the
external file.



External fields are not searchable. They can be used only for function queries or display.
For more information on function queries, see the section on Function Queries.

The ExternalFileField type is handy for cases where you want to update a particular field in many
documents more often than you want to update the rest of the documents. For example, suppose you have
implemented a document rank based on the number of views. You might want to update the rank of all the
documents daily or hourly, while the rest of the contents of the documents might be updated much less
frequently. Without ExternalFileField, you would need to update each document just to change the rank.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 158 of 1195

Apache Solr Reference Guide 7.3

Using ExternalFileField is much more efficient because all document values for a particular field are
stored in an external file that can be updated as frequently as you wish.
In schema.xml, the definition of this field type might look like this:

The keyField attribute defines the key that will be defined in the external file. It is usually the unique key for
the index, but it doesn’t need to be as long as the keyField can be used to identify documents in the index.
A defVal defines a default value that will be used if there is no entry in the external file for a particular
document.
Format of the External File
The file itself is located in Solr’s index directory, which by default is $SOLR_HOME/data. The name of the file
should be external_fieldname_ or external_fieldname_.*. For the example above, then, the file could be
named external_entryRankFile or external_entryRankFile.txt.



If any files using the name pattern .* (such as .txt) appear, the last (after being sorted by
name) will be used and previous versions will be deleted. This behavior supports
implementations on systems where one may not be able to overwrite a file (for example,
on Windows, if the file is in use).

The file contains entries that map a key field, on the left of the equals sign, to a value, on the right. Here are
a few example entries:
doc33=1.414
doc34=3.14159
doc40=42
The keys listed in this file do not need to be unique. The file does not need to be sorted, but Solr will be able
to perform the lookup faster if it is.
Reloading an External File
It’s possible to define an event listener to reload an external file when either a searcher is reloaded or when
a new searcher is started. See the section Query-Related Listeners for more information, but a sample
definition in solrconfig.xml might look like this:



The PreAnalyzedField Type
The PreAnalyzedField type provides a way to send to Solr serialized token streams, optionally with
independent stored values of a field, and have this information stored and indexed without any additional
text processing applied in Solr. This is useful if user wants to submit field content that was already processed

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 159 of 1195

by some existing external text processing pipeline (e.g., it has been tokenized, annotated, stemmed,
synonyms inserted, etc.), while using all the rich attributes that Lucene’s TokenStream provides (per-token
attributes).
The serialization format is pluggable using implementations of PreAnalyzedParser interface. There are two
out-of-the-box implementations:
• JsonPreAnalyzedParser: as the name suggests, it parses content that uses JSON to represent field’s
content. This is the default parser to use if the field type is not configured otherwise.
• SimplePreAnalyzedParser: uses a simple strict plain text format, which in some situations may be easier
to create than JSON.
There is only one configuration parameter, parserImpl. The value of this parameter should be a fully
qualified class name of a class that implements PreAnalyzedParser interface. The default value of this
parameter is org.apache.solr.schema.JsonPreAnalyzedParser.
By default, the query-time analyzer for fields of this type will be the same as the index-time analyzer, which
expects serialized pre-analyzed text. You must add a query type analyzer to your fieldType in order to
perform analysis on non-pre-analyzed queries. In the example below, the index-time analyzer expects the
default JSON serialization format, and the query-time analyzer will employ
StandardTokenizer/LowerCaseFilter:






JsonPreAnalyzedParser
This is the default serialization format used by PreAnalyzedField type. It uses a top-level JSON map with the
following keys:
Key

Description

Required

v

Version key. Currently the supported version is 1.

required

str

Stored string value of a field. You can use at most one of str or optional
bin.

bin

Stored binary value of a field. The binary value has to be Base64 optional
encoded.

tokens

serialized token stream. This is a JSON list.

optional

Any other top-level key is silently ignored.
Token Stream Serialization

The token stream is expressed as a JSON list of JSON maps. The map for each token consists of the following

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 160 of 1195

Apache Solr Reference Guide 7.3

keys and values:
Key

Description

Lucene Attribute

Value

Required?

t

token

CharTermAttribute

UTF-8 string representing the
current token

required

s

start offset

OffsetAttribute

Non-negative integer

optional

e

end offset

OffsetAttribute

Non-negative integer

optional

i

position increment

PositionIncrementAt Non-negative integer - default
tribute
is 1

optional

p

payload

PayloadAttribute

Base64 encoded payload

optional

y

lexical type

TypeAttribute

UTF-8 string

optional

f

flags

FlagsAttribute

String representing an integer
value in hexadecimal format

optional

Any other key is silently ignored.
JsonPreAnalyzedParser Example

{
"v":"1",
"str":"test ąćęłńóśźż",
"tokens": [
{"t":"one","s":123,"e":128,"i":22,"p":"DQ4KDQsODg8=","y":"word"},
{"t":"two","s":5,"e":8,"i":1,"y":"word"},
{"t":"three","s":20,"e":22,"i":1,"y":"foobar"}
]
}
SimplePreAnalyzedParser
The fully qualified class name to use when specifying this format via the parserImpl configuration
parameter is org.apache.solr.schema.SimplePreAnalyzedParser.
SimplePreAnalyzedParser Syntax

The serialization format supported by this parser is as follows:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 161 of 1195

Serialization format
content ::= version (stored)? tokens
version ::= digit+ " "
; stored field value - any "=" inside must be escaped!
stored ::= "=" text "="
tokens ::= (token ((" ") + token)*)*
token ::= text ("," attrib)*
attrib ::= name '=' value
name ::= text
value ::= text
Special characters in "text" values can be escaped using the escape character \. The following escape
sequences are recognized:
EscapeSequence

Description

\

literal space character

\,

literal , character

\=

literal = character

\\

literal \ character

\n

newline

\r

carriage return

\t

horizontal tab

Please note that Unicode sequences (e.g., \u0001) are not supported.
Supported Attributes

The following token attributes are supported, and identified with short symbolic names:
Name

Description

Lucene attribute

Value format

i

position increment

PositionIncrementAttribute

integer

s

start offset

OffsetAttribute

integer

e

end offset

OffsetAttribute

integer

y

lexical type

TypeAttribute

string

f

flags

FlagsAttribute

hexadecimal integer

p

payload

PayloadAttribute

bytes in hexadecimal format;
whitespace is ignored

Token positions are tracked and implicitly added to the token stream - the start and end offsets consider
only the term text and whitespace, and exclude the space taken by token attributes.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 162 of 1195

Apache Solr Reference Guide 7.3

Example Token Streams

1 one two three
• version: 1
• stored: null
• token: (term=one,startOffset=0,endOffset=3)
• token: (term=two,startOffset=4,endOffset=7)
• token: (term=three,startOffset=8,endOffset=13)
1 one

two

three

• version: 1
• stored: null
• token: (term=one,startOffset=0,endOffset=3)
• token: (term=two,startOffset=5,endOffset=8)
• token: (term=three,startOffset=11,endOffset=16)
1 one,s=123,e=128,i=22 two three,s=20,e=22
• version: 1
• stored: null
• token: (term=one,positionIncrement=22,startOffset=123,endOffset=128)
• token: (term=two,positionIncrement=1,startOffset=5,endOffset=8)
• token: (term=three,positionIncrement=1,startOffset=20,endOffset=22)
1 \ one\ \,,i=22,a=\, two\=
\n,\ =\ \
• version: 1
• stored: null
• token: (term=one ,,positionIncrement=22,startOffset=0,endOffset=6)
• token: (term=two= ,positionIncrement=1,startOffset=7,endOffset=15)
• token: (term=\,positionIncrement=1,startOffset=17,endOffset=18)
Note that unknown attributes and their values are ignored, so in this example, the “a” attribute on the first
token and the " " (escaped space) attribute on the second token are ignored, along with their values,
because they are not among the supported attribute names.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 163 of 1195

1 ,i=22 ,i=33,s=2,e=20 ,
• version: 1
• stored: null
• token: (term=,positionIncrement=22,startOffset=0,endOffset=0)
• token: (term=,positionIncrement=33,startOffset=2,endOffset=20)
• token: (term=,positionIncrement=1,startOffset=2,endOffset=2)
1 =This is the stored part with \=
\n \t escapes.=one two three
• version: 1
• stored: This is the stored part with = \t escapes.
• token: (term=one,startOffset=0,endOffset=3)
• token: (term=two,startOffset=4,endOffset=7)
• token: (term=three,startOffset=8,endOffset=13)
Note that the \t in the above stored value is not literal; it’s shown that way to visually indicate the actual tab
char that is in the stored value.
1 ==
• version: 1
• stored: ""
• (no tokens)
1 =this is a test.=
• version: 1
• stored: this is a test.
• (no tokens)

Field Properties by Use Case
Here is a summary of common use cases, and the attributes the fields or field types should have to support
the case. An entry of true or false in the table indicates that the option must be set to the given value for the
use case to function correctly. If no entry is provided, the setting of that attribute has no impact on the case.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 164 of 1195

Apache Solr Reference Guide 7.3

Use Case

indexed

search within
field

true

multiValue omitNorm termVecto termPositi docValues
d
s
rs
ons

8

retrieve
contents

8

true

use as unique
key

true

sort on field

true

highlighting

true

faceting

stored

false
7

4

5

true

9

false

1

7

true
2

true

true

true

7

3

7

true

add multiple
values,
maintaining
order

true
true

field length
affects doc
score
MoreLikeThis

true

false

5

true

6

Notes:
1. Recommended but not necessary.
2. Will be used if present, but not necessary.
3. (if termVectors=true)
4. A tokenizer must be defined for the field, but it doesn’t need to be indexed.
5. Described in Understanding Analyzers, Tokenizers, and Filters.
6. Term vectors are not mandatory here. If not true, then a stored field is analyzed. So term vectors are
recommended, but only required if stored=false.
7. For most field types, either indexed or docValues must be true, but both are not required. DocValues can
be more efficient in many cases. For [Int/Long/Float/Double/Date]PointFields, docValues=true is
required.
8. Stored content will be used by default, but docValues can alternatively be used. See DocValues.
9. Multi-valued sorting may be performed on docValues-enabled fields using the two-argument field()
function, e.g., field(myfield,min); see the field() function in Function Queries.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 165 of 1195

Defining Fields
Fields are defined in the fields element of schema.xml. Once you have the field types set up, defining the
fields themselves is simple.

Example Field Definition
The following example defines a field named price with a type named float and a default value of 0.0; the
indexed and stored properties are explicitly set to true, while any other properties specified on the float
field type are inherited.


Field Properties
Field definitions can have the following properties:

name
The name of the field. Field names should consist of alphanumeric or underscore characters only and not
start with a digit. This is not currently strictly enforced, but other field names will not have first class
support from all components and back compatibility is not guaranteed. Names with both leading and
trailing underscores (e.g., _version_) are reserved. Every field must have a name.

type
The name of the fieldType for this field. This will be found in the name attribute on the fieldType
definition. Every field must have a type.

default
A default value that will be added automatically to any document that does not have a value in this field
when it is indexed. If this property is not specified, there is no default.

Optional Field Type Override Properties
Fields can have many of the same properties as field types. Properties from the table below which are
specified on an individual field will override any explicit value for that property specified on the the
fieldType of the field, or any implicit default property value provided by the underlying fieldType
implementation. The table below is reproduced from Field Type Definitions and Properties, which has more
details:
Property

Description

Values

Implicit Default

indexed

If true, the value of the field can be used
in queries to retrieve matching
documents.

true or false

true

stored

If true, the actual value of the field can be true or false
retrieved by queries.

true

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 166 of 1195

Apache Solr Reference Guide 7.3

Property

Description

Values

Implicit Default

docValues

If true, the value of the field will be put in
a column-oriented DocValues structure.

true or false

false

sortMissingFirst
sortMissingLast

Control the placement of documents
when a sort field is not present.

true or false

false

multiValued

If true, indicates that a single document
true or false
might contain multiple values for this field
type.

false

omitNorms

If true, omits the norms associated with
true or false
this field (this disables length
normalization for the field, and saves
some memory). Defaults to true for all
primitive (non-analyzed) field types,
such as int, float, data, bool, and string.
Only full-text fields or fields need norms.

*

omitTermFreqAndP If true, omits term frequency, positions,
true or false
ositions
and payloads from postings for this field.
This can be a performance boost for fields
that don’t require that information. It also
reduces the storage space required for
the index. Queries that rely on position
that are issued on a field with this option
will silently fail to find documents. This
property defaults to true for all field
types that are not text fields.

*

omitPositions

Similar to omitTermFreqAndPositions but true or false
preserves term frequency information.

*

termVectors
termPositions
termOffsets
termPayloads

These options instruct Solr to maintain full true or false
term vectors for each document,
optionally including position, offset and
payload information for each term
occurrence in those vectors. These can be
used to accelerate highlighting and other
ancillary functionality, but impose a
substantial cost in terms of index size.
They are not necessary for typical uses of
Solr.

false

required

Instructs Solr to reject any attempts to
add a document which does not have a
value for this field. This property defaults
to false.

false

Guide Version 7.3 - Published: 2018-03-27

true or false

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3
Property

Description

Page 167 of 1195
Values

Implicit Default

useDocValuesAsStor If the field has docValues enabled, setting true or false
ed
this to true would allow the field to be
returned as if it were a stored field (even
if it has stored=false) when matching
“*” in an fl parameter.

true

large

false

Large fields are always lazy loaded and
true or false
will only take up space in the document
cache if the actual value is < 512KB. This
option requires stored="true" and
multiValued="false". It’s intended for
fields that might have very large values so
that they don’t get cached in memory.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 168 of 1195

Apache Solr Reference Guide 7.3

Copying Fields
You might want to interpret some document fields in more than one way. Solr has a mechanism for making
copies of fields so that you can apply several distinct field types to a single piece of incoming information.
The name of the field you want to copy is the source, and the name of the copy is the destination. In
schema.xml, it’s very simple to make copies of fields:

In this example, we want Solr to copy the cat field to a field named text. Fields are copied before analysis is
done, meaning you can have two fields with identical original content, but which use different analysis
chains and are stored in the index differently.
In the example above, if the text destination field has data of its own in the input documents, the contents
of the cat field will be added as additional values – just as if all of the values had originally been specified by
the client. Remember to configure your fields as multivalued="true" if they will ultimately get multiple
values (either from a multivalued source or from multiple copyField directives).
A common usage for this functionality is to create a single "search" field that will serve as the default query
field when users or clients do not specify a field to query. For example, title, author, keywords, and body
may all be fields that should be searched by default, with copy field rules for each field to copy to a catchall
field (for example, it could be named anything). Later you can set a rule in solrconfig.xml to search the
catchall field by default. One caveat to this is your index will grow when using copy fields. However,
whether this becomes problematic for you and the final size will depend on the number of fields being
copied, the number of destination fields being copied to, the analysis in use, and the available disk space.
The maxChars parameter, an int parameter, establishes an upper limit for the number of characters to be
copied from the source value when constructing the value added to the destination field. This limit is useful
for situations in which you want to copy some data from the source field, but also control the size of index
files.
Both the source and the destination of copyField can contain either leading or trailing asterisks, which will
match anything. For example, the following line will copy the contents of all incoming fields that match the
wildcard pattern *_t to the text field.:




The copyField command can use a wildcard (*) character in the dest parameter only if the
source parameter contains one as well. copyField uses the matching glob from the source
field for the dest field name into which the source content is copied.

Copying is done at the stream source level and no copy feeds into another copy. This means that copy fields
cannot be chained i.e., you cannot copy from here to there and then from there to elsewhere. However, the
same source field can be copied to multiple destination fields:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 169 of 1195




© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 170 of 1195

Apache Solr Reference Guide 7.3

Dynamic Fields
Dynamic fields allow Solr to index fields that you did not explicitly define in your schema.
This is useful if you discover you have forgotten to define one or more fields. Dynamic fields can make your
application less brittle by providing some flexibility in the documents you can add to Solr.
A dynamic field is just like a regular field except it has a name with a wildcard in it. When you are indexing
documents, a field that does not match any explicitly defined fields can be matched with a dynamic field.
For example, suppose your schema includes a dynamic field with a name of *_i. If you attempt to index a
document with a cost_i field, but no explicit cost_i field is defined in the schema, then the cost_i field will
have the field type and analysis defined for *_i.
Like regular fields, dynamic fields have a name, a field type, and options.


It is recommended that you include basic dynamic field mappings (like that shown above) in your
schema.xml. The mappings can be very useful.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 171 of 1195

Other Schema Elements
This section describes several other important elements of schema.xml not covered in earlier sections.

Unique Key
The uniqueKey element specifies which field is a unique identifier for documents. Although uniqueKey is not
required, it is nearly always warranted by your application design. For example, uniqueKey should be used if
you will ever update a document in the index.
You can define the unique key field by naming it:
id
Schema defaults and copyFields cannot be used to populate the uniqueKey field. The fieldType of
uniqueKey must not be analyzed. You can use UUIDUpdateProcessorFactory to have uniqueKey values
generated automatically.
Further, the operation will fail if the uniqueKey field is used, but is multivalued (or inherits the multivalueness from the fieldtype). However, uniqueKey will continue to work, as long as the field is properly used.

Similarity
Similarity is a Lucene class used to score a document in searching.
Each collection has one "global" Similarity, and by default Solr uses an implicit SchemaSimilarityFactory
which allows individual field types to be configured with a "per-type" specific Similarity and implicitly uses
BM25Similarity for any field type which does not have an explicit Similarity.
This default behavior can be overridden by declaring a top level  element in your schema.xml,
outside of any single field type. This similarity declaration can either refer directly to the name of a class with
a no-argument constructor, such as in this example showing BM25Similarity:

or by referencing a SimilarityFactory implementation, which may take optional initialization parameters:

P
L
H2
7

In most cases, specifying global level similarity like this will cause an error if your schema.xml also includes
field type specific  declarations. One key exception to this is that you may explicitly declare a
SchemaSimilarityFactory and specify what that default behavior will be for all field types that do not

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 172 of 1195

Apache Solr Reference Guide 7.3

declare an explicit Similarity using the name of field type (specified by defaultSimFromFieldType) that is
configured with a specific similarity:

text_dfr




I(F)
B
H3
900





SPL
DF
H2





In the example above IBSimilarityFactory (using the Information-Based model) will be used for any fields
of type text_ib, while DFRSimilarityFactory (divergence from random) will be used for any fields of type
text_dfr, as well as any fields using a type that does not explicitly specify a .
If SchemaSimilarityFactory is explicitly declared without configuring a defaultSimFromFieldType, then
BM25Similarity is implicitly used as the default.
In addition to the various factories mentioned on this page, there are several other similarity
implementations that can be used such as the SweetSpotSimilarityFactory, ClassicSimilarityFactory,
etc. For details, see the Solr Javadocs for the similarity factories.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 173 of 1195

Schema API
The Schema API allows you to use an HTTP API to manage many of the elements of your schema.
The Schema API utilizes the ManagedIndexSchemaFactory class, which is the default schema factory in
modern Solr versions. See the section Schema Factory Definition in SolrConfig for more information about
choosing a schema factory for your index.
This API provides read and write access to the Solr schema for each collection (or core, when using
standalone Solr). Read access to all schema elements is supported. Fields, dynamic fields, field types and
copyField rules may be added, removed or replaced. Future Solr releases will extend write access to allow
more schema elements to be modified.
Why is hand editing of the managed schema discouraged?



The file named "managed-schema" in the example configurations may include a note that
recommends never hand-editing the file. Before the Schema API existed, such edits were
the only way to make changes to the schema, and users may have a strong desire to
continue making changes this way.
The reason that this is discouraged is because hand-edits of the schema may be lost if the
Schema API described here is later used to make a change, unless the core or collection is
reloaded or Solr is restarted before using the Schema API. If care is taken to always reload
or restart after a manual edit, then there is no problem at all with doing those edits.

The API allows two output modes for all calls: JSON or XML. When requesting the complete schema, there is
another output mode which is XML modeled after the managed-schema file itself, which is in XML format.
When modifying the schema with the API, a core reload will automatically occur in order for the changes to
be available immediately for documents indexed thereafter. Previously indexed documents will not be
automatically updated - they must be re-indexed if existing index data uses schema elements that you
changed.
Re-index after schema modifications!



If you modify your schema, you will likely need to re-index all documents. If you do not, you
may lose access to documents, or not be able to interpret them properly, e.g., after
replacing a field type.
Modifying your schema will never modify any documents that are already indexed. You
must re-index documents in order to apply schema changes to them. Queries and updates
made after the change may encounter errors that were not present before the change.
Completely deleting the index and rebuilding it is usually the only option to fix such errors.

Modify the Schema
To add, remove or replace fields, dynamic field rules, copy field rules, or new field types, you can send a
POST request to the /collection/schema/ endpoint with a sequence of commands to perform the
requested actions. The following commands are supported:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 174 of 1195

Apache Solr Reference Guide 7.3

• add-field: add a new field with parameters you provide.
• delete-field: delete a field.
• replace-field: replace an existing field with one that is differently configured.
• add-dynamic-field: add a new dynamic field rule with parameters you provide.
• delete-dynamic-field: delete a dynamic field rule.
• replace-dynamic-field: replace an existing dynamic field rule with one that is differently configured.
• add-field-type: add a new field type with parameters you provide.
• delete-field-type: delete a field type.
• replace-field-type: replace an existing field type with one that is differently configured.
• add-copy-field: add a new copy field rule.
• delete-copy-field: delete a copy field rule.
These commands can be issued in separate POST requests or in the same POST request. Commands are
executed in the order in which they are specified.
In each case, the response will include the status and the time to process the request, but will not include
the entire schema.
When modifying the schema with the API, a core reload will automatically occur in order for the changes to
be available immediately for documents indexed thereafter. Previously indexed documents will not be
automatically handled - they must be re-indexed if they used schema elements that you changed.

Add a New Field
The add-field command adds a new field definition to your schema. If a field with the same name exists an
error is thrown.
All of the properties available when defining a field with manual schema.xml edits can be passed via the API.
These request attributes are described in detail in the section Defining Fields.
For example, to define a new stored field named "sell_by", of type "pdate", you would POST the following
request:
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
"name":"sell_by",
"type":"pdate",
"stored":true }
}' http://localhost:8983/solr/gettingstarted/schema

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 175 of 1195

V2 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
"name":"sell_by",
"type":"pdate",
"stored":true }
}' http://localhost:8983/api/cores/gettingstarted/schema

Delete a Field
The delete-field command removes a field definition from your schema. If the field does not exist in the
schema, or if the field is the source or destination of a copy field rule, an error is thrown.
For example, to delete a field named "sell_by", you would POST the following request:
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"delete-field" : { "name":"sell_by" }
}' http://localhost:8983/solr/gettingstarted/schema

V2 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"delete-field" : { "name":"sell_by" }
}' http://localhost:8983/api/cores/gettingstarted/schema

Replace a Field
The replace-field command replaces a field’s definition. Note that you must supply the full definition for a
field - this command will not partially modify a field’s definition. If the field does not exist in the schema an
error is thrown.
All of the properties available when defining a field with manual schema.xml edits can be passed via the API.
These request attributes are described in detail in the section Defining Fields.
For example, to replace the definition of an existing field "sell_by", to make it be of type "date" and to not be
stored, you would POST the following request:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 176 of 1195

Apache Solr Reference Guide 7.3

V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"replace-field":{
"name":"sell_by",
"type":"date",
"stored":false }
}' http://localhost:8983/solr/gettingstarted/schema

V2 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"replace-field":{
"name":"sell_by",
"type":"date",
"stored":false }
}' http://localhost:8983/api/cores/gettingstarted/schema

Add a Dynamic Field Rule
The add-dynamic-field command adds a new dynamic field rule to your schema.
All of the properties available when editing schema.xml can be passed with the POST request. The section
Dynamic Fields has details on all of the attributes that can be defined for a dynamic field rule.
For example, to create a new dynamic field rule where all incoming fields ending with "_s" would be stored
and have field type "string", you can POST a request like this:
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-dynamic-field":{
"name":"*_s",
"type":"string",
"stored":true }
}' http://localhost:8983/solr/gettingstarted/schema

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 177 of 1195

V2 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-dynamic-field":{
"name":"*_s",
"type":"string",
"stored":true }
}' http://localhost:8983/api/cores/gettingstarted/schema

Delete a Dynamic Field Rule
The delete-dynamic-field command deletes a dynamic field rule from your schema. If the dynamic field
rule does not exist in the schema, or if the schema contains a copy field rule with a target or destination that
matches only this dynamic field rule, an error is thrown.
For example, to delete a dynamic field rule matching "*_s", you can POST a request like this:
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"delete-dynamic-field":{ "name":"*_s" }
}' http://localhost:8983/solr/gettingstarted/schema

V2 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"delete-dynamic-field":{ "name":"*_s" }
}' http://localhost:8983/api/cores/gettingstarted/schema

Replace a Dynamic Field Rule
The replace-dynamic-field command replaces a dynamic field rule in your schema. Note that you must
supply the full definition for a dynamic field rule - this command will not partially modify a dynamic field
rule’s definition. If the dynamic field rule does not exist in the schema an error is thrown.
All of the properties available when editing schema.xml can be passed with the POST request. The section
Dynamic Fields has details on all of the attributes that can be defined for a dynamic field rule.
For example, to replace the definition of the "*_s" dynamic field rule with one where the field type is
"text_general" and it’s not stored, you can POST a request like this:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 178 of 1195

Apache Solr Reference Guide 7.3

V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"replace-dynamic-field":{
"name":"*_s",
"type":"text_general",
"stored":false }
}' http://localhost:8983/solr/gettingstarted/schema

V2 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"replace-dynamic-field":{
"name":"*_s",
"type":"text_general",
"stored":false }
}' http://localhost:8983/solr/gettingstarted/schema

Add a New Field Type
The add-field-type command adds a new field type to your schema.
All of the field type properties available when editing schema.xml by hand are available for use in a POST
request. The structure of the command is a json mapping of the standard field type definition, including the
name, class, index and query analyzer definitions, etc. Details of all of the available options are described in
the section Solr Field Types.
For example, to create a new field type named "myNewTxtField", you can POST a request as follows:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 179 of 1195

V1 API with Single Analysis
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field-type" : {
"name":"myNewTxtField",
"class":"solr.TextField",
"positionIncrementGap":"100",
"analyzer" : {
"charFilters":[{
"class":"solr.PatternReplaceCharFilterFactory",
"replacement":"$1$1",
"pattern":"([a-zA-Z])\\\\1+" }],
"tokenizer":{
"class":"solr.WhitespaceTokenizerFactory" },
"filters":[{
"class":"solr.WordDelimiterFilterFactory",
"preserveOriginal":"0" }]}}
}' http://localhost:8983/solr/gettingstarted/schema
Note in this example that we have only defined a single analyzer section that will apply to index analysis
and query analysis.

V1 API with Two Analyzers
If we wanted to define separate analysis, we would replace the analyzer section in the above example
with separate sections for indexAnalyzer and queryAnalyzer. As in this example:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field-type":{
"name":"myNewTextField",
"class":"solr.TextField",
"indexAnalyzer":{
"tokenizer":{
"class":"solr.PathHierarchyTokenizerFactory",
"delimiter":"/" }},
"queryAnalyzer":{
"tokenizer":{
"class":"solr.KeywordTokenizerFactory" }}}
}' http://localhost:8983/solr/gettingstarted/schema

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 180 of 1195

Apache Solr Reference Guide 7.3

V2 API with Two Analyzers
To define two analyzers with the V2 API, we just use a different endpoint:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field-type":{
"name":"myNewTextField",
"class":"solr.TextField",
"indexAnalyzer":{
"tokenizer":{
"class":"solr.PathHierarchyTokenizerFactory",
"delimiter":"/" }},
"queryAnalyzer":{
"tokenizer":{
"class":"solr.KeywordTokenizerFactory" }}}
}' http://localhost:8983/api/cores/gettingstarted/schema

Delete a Field Type
The delete-field-type command removes a field type from your schema. If the field type does not exist in
the schema, or if any field or dynamic field rule in the schema uses the field type, an error is thrown.
For example, to delete the field type named "myNewTxtField", you can make a POST request as follows:
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"delete-field-type":{ "name":"myNewTxtField" }
}' http://localhost:8983/solr/gettingstarted/schema

V2 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"delete-field-type":{ "name":"myNewTxtField" }
}' http://localhost:8983/api/cores/gettingstarted/schema

Replace a Field Type
The replace-field-type command replaces a field type in your schema. Note that you must supply the full
definition for a field type - this command will not partially modify a field type’s definition. If the field type
does not exist in the schema an error is thrown.
All of the field type properties available when editing schema.xml by hand are available for use in a POST
request. The structure of the command is a json mapping of the standard field type definition, including the

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 181 of 1195

name, class, index and query analyzer definitions, etc. Details of all of the available options are described in
the section Solr Field Types.
For example, to replace the definition of a field type named "myNewTxtField", you can make a POST request
as follows:
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"replace-field-type":{
"name":"myNewTxtField",
"class":"solr.TextField",
"positionIncrementGap":"100",
"analyzer":{
"tokenizer":{
"class":"solr.StandardTokenizerFactory" }}}
}' http://localhost:8983/solr/gettingstarted/schema

V2 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"replace-field-type":{
"name":"myNewTxtField",
"class":"solr.TextField",
"positionIncrementGap":"100",
"analyzer":{
"tokenizer":{
"class":"solr.StandardTokenizerFactory" }}}
}' http://localhost:8983/api/cores/gettingstarted/schema

Add a New Copy Field Rule
The add-copy-field command adds a new copy field rule to your schema.
The attributes supported by the command are the same as when creating copy field rules by manually
editing the schema.xml, as below:

source
The source field. This parameter is required.

dest
A field or an array of fields to which the source field will be copied. This parameter is required.

maxChars
The upper limit for the number of characters to be copied. The section Copying Fields has more details.
For example, to define a rule to copy the field "shelf" to the "location" and "catchall" fields, you would POST

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 182 of 1195

Apache Solr Reference Guide 7.3

the following request:
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-copy-field":{
"source":"shelf",
"dest":[ "location", "catchall" ]}
}' http://localhost:8983/solr/gettingstarted/schema

V2 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-copy-field":{
"source":"shelf",
"dest":[ "location", "catchall" ]}
}' http://localhost:8983/api/cores/gettingstarted/schema

Delete a Copy Field Rule
The delete-copy-field command deletes a copy field rule from your schema. If the copy field rule does not
exist in the schema an error is thrown.
The source and dest attributes are required by this command.
For example, to delete a rule to copy the field "shelf" to the "location" field, you would POST the following
request:
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"delete-copy-field":{ "source":"shelf", "dest":"location" }
}' http://localhost:8983/solr/gettingstarted/schema

V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"delete-copy-field":{ "source":"shelf", "dest":"location" }
}' http://localhost:8983/api/cores/gettingstarted/schema

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 183 of 1195

Multiple Commands in a Single POST
It is possible to perform one or more add requests in a single command. The API is transactional and all
commands in a single call either succeed or fail together.
The commands are executed in the order in which they are specified. This means that if you want to create a
new field type and in the same request use the field type on a new field, the section of the request that
creates the field type must come before the section that creates the new field. Similarly, since a field must
exist for it to be used in a copy field rule, a request to add a field must come before a request for the field to
be used as either the source or the destination for a copy field rule.
The syntax for making multiple requests supports several approaches. First, the commands can simply be
made serially, as in this request to create a new field type and then a field that uses that type:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field-type":{
"name":"myNewTxtField",
"class":"solr.TextField",
"positionIncrementGap":"100",
"analyzer":{
"charFilters":[{
"class":"solr.PatternReplaceCharFilterFactory",
"replacement":"$1$1",
"pattern":"([a-zA-Z])\\\\1+" }],
"tokenizer":{
"class":"solr.WhitespaceTokenizerFactory" },
"filters":[{
"class":"solr.WordDelimiterFilterFactory",
"preserveOriginal":"0" }]}},
"add-field" : {
"name":"sell_by",
"type":"myNewTxtField",
"stored":true }
}' http://localhost:8983/solr/gettingstarted/schema
Or, the same command can be repeated, as in this example:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 184 of 1195

Apache Solr Reference Guide 7.3

curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
"name":"shelf",
"type":"myNewTxtField",
"stored":true },
"add-field":{
"name":"location",
"type":"myNewTxtField",
"stored":true },
"add-copy-field":{
"source":"shelf",
"dest":[ "location", "catchall" ]}
}' http://localhost:8983/solr/gettingstarted/schema
Finally, repeated commands can be sent as an array:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":[
{ "name":"shelf",
"type":"myNewTxtField",
"stored":true },
{ "name":"location",
"type":"myNewTxtField",
"stored":true }]
}' http://localhost:8983/solr/gettingstarted/schema

Schema Changes among Replicas
When running in SolrCloud mode, changes made to the schema on one node will propagate to all replicas in
the collection.
You can pass the updateTimeoutSecs parameter with your request to set the number of seconds to wait
until all replicas confirm they applied the schema updates. This helps your client application be more robust
in that you can be sure that all replicas have a given schema change within a defined amount of time.
If agreement is not reached by all replicas in the specified time, then the request fails and the error message
will include information about which replicas had trouble. In most cases, the only option is to re-try the
change after waiting a brief amount of time. If the problem persists, then you’ll likely need to investigate the
server logs on the replicas that had trouble applying the changes.
If you do not supply an updateTimeoutSecs parameter, the default behavior is for the receiving node to
return immediately after persisting the updates to ZooKeeper. All other replicas will apply the updates
asynchronously. Consequently, without supplying a timeout, your client application cannot be sure that all
replicas have applied the changes.

Retrieve Schema Information
The following endpoints allow you to read how your schema has been defined. You can GET the entire
schema, or only portions of it as needed.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 185 of 1195

To modify the schema, see the previous section Modify the Schema.

Retrieve the Entire Schema
GET /collection/schema
Retrieve Schema Parameters
Path Parameters

collection
The collection (or core) name.
Query Parameters
The query parameters should be added to the API request after '?'.

wt
Defines the format of the response. The options are json, xml or schema.xml. If not specified, JSON will
be returned by default.
Retrieve Schema Response
Output Content
The output will include all fields, field types, dynamic rules and copy field rules, in the format requested
(JSON or XML). The schema name and version are also included.
Retrieve Schema Examples
Get the entire schema in JSON.
curl http://localhost:8983/solr/gettingstarted/schema

{
"responseHeader":{
"status":0,
"QTime":5},
"schema":{
"name":"example",
"version":1.5,
"uniqueKey":"id",
"fieldTypes":[{
"name":"alphaOnlySort",
"class":"solr.TextField",
"sortMissingLast":true,
"omitNorms":true,
"analyzer":{
"tokenizer":{
"class":"solr.KeywordTokenizerFactory"},
"filters":[{

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 186 of 1195

Apache Solr Reference Guide 7.3
"class":"solr.LowerCaseFilterFactory"},

{
"class":"solr.TrimFilterFactory"},
{
"class":"solr.PatternReplaceFilterFactory",
"replace":"all",
"replacement":"",
"pattern":"([^a-z])"}]}}],
"fields":[{
"name":"_version_",
"type":"long",
"indexed":true,
"stored":true},
{
"name":"author",
"type":"text_general",
"indexed":true,
"stored":true},
{
"name":"cat",
"type":"string",
"multiValued":true,
"indexed":true,
"stored":true}],
"copyFields":[{
"source":"author",
"dest":"text"},
{
"source":"cat",
"dest":"text"},
{
"source":"content",
"dest":"text"},
{
"source":"author",
"dest":"author_s"}]}}
Get the entire schema in XML.
curl http://localhost:8983/solr/gettingstarted/schema?wt=xml

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 187 of 1195



0
5


example
1.5
id


alphaOnlySort
solr.TextField
true
true


solr.KeywordTokenizerFactory



solr.LowerCaseFilterFactory


solr.TrimFilterFactory


solr.PatternReplaceFilterFactory
all

([^a-z])




...

author
author_s




Get the entire schema in "schema.xml" format.
curl http://localhost:8983/solr/gettingstarted/schema?wt=schema.xml

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 188 of 1195

Apache Solr Reference Guide 7.3


id









...





List Fields
GET /collection/schema/fields
GET /collection/schema/fields/fieldname
List Fields Parameters
Path Parameters

collection
The collection (or core) name.

fieldname
The specific fieldname (if limiting the request to a single field).
Query Parameters
The query parameters can be added to the API request after a '?'.

wt
Defines the format of the response. The options are json or xml. If not specified, JSON will be returned by
default.

fl
Comma- or space-separated list of one or more fields to return. If not specified, all fields will be returned
by default.

includeDynamic
If true, and if the fl query parameter is specified or the fieldname path parameter is used, matching
dynamic fields are included in the response and identified with the dynamicBase property.
If neither the fl query parameter nor the fieldname path parameter is specified, the includeDynamic

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 189 of 1195

query parameter is ignored.
If false, the default, matching dynamic fields will not be returned.

showDefaults
If true, all default field properties from each field’s field type will be included in the response (e.g.,
tokenized for solr.TextField). If false, the default, only explicitly specified field properties will be
included.
List Fields Response
The output will include each field and any defined configuration for each field. The defined configuration can
vary for each field, but will minimally include the field name, the type, if it is indexed and if it is stored.
If multiValued is defined as either true or false (most likely true), that will also be shown. See the section
Defining Fields for more information about each parameter.
List Fields Examples
Get a list of all fields.
curl http://localhost:8983/solr/gettingstarted/schema/fields
The sample output below has been truncated to only show a few fields.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 190 of 1195

Apache Solr Reference Guide 7.3

{
"fields": [
{
"indexed": true,
"name": "_version_",
"stored": true,
"type": "long"
},
{
"indexed": true,
"name": "author",
"stored": true,
"type": "text_general"
},
{
"indexed": true,
"multiValued": true,
"name": "cat",
"stored": true,
"type": "string"
},
"..."
],
"responseHeader": {
"QTime": 1,
"status": 0
}
}

List Dynamic Fields
GET /collection/schema/dynamicfields
GET /collection/schema/dynamicfields/name
List Dynamic Field Parameters
Path Parameters

collection
The collection (or core) name.

name
The name of the dynamic field rule (if limiting request to a single dynamic field rule).
Query Parameters
The query parameters can be added to the API request after a '?'.

wt
Defines the format of the response. The options are json or xml. If not specified, JSON will be returned by

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 191 of 1195

default.

showDefaults
If true, all default field properties from each dynamic field’s field type will be included in the response
(e.g., tokenized for solr.TextField). If false, the default, only explicitly specified field properties will be
included.
List Dynamic Field Response
The output will include each dynamic field rule and the defined configuration for each rule. The defined
configuration can vary for each rule, but will minimally include the dynamic field name, the type, if it is
indexed and if it is stored. See the section Dynamic Fields for more information about each parameter.
List Dynamic Field Examples
Get a list of all dynamic field declarations:
curl http://localhost:8983/solr/gettingstarted/schema/dynamicfields
The sample output below has been truncated.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 192 of 1195

Apache Solr Reference Guide 7.3

{
"dynamicFields": [
{
"indexed": true,
"name": "*_coordinate",
"stored": false,
"type": "tdouble"
},
{
"multiValued": true,
"name": "ignored_*",
"type": "ignored"
},
{
"name": "random_*",
"type": "random"
},
{
"indexed": true,
"multiValued": true,
"name": "attr_*",
"stored": true,
"type": "text_general"
},
{
"indexed": true,
"multiValued": true,
"name": "*_txt",
"stored": true,
"type": "text_general"
}
"..."
],
"responseHeader": {
"QTime": 1,
"status": 0
}
}

List Field Types
GET /collection/schema/fieldtypes
GET /collection/schema/fieldtypes/name
List Field Type Parameters
Path Parameters

collection
The collection (or core) name.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 193 of 1195

name
The name of the field type (if limiting request to a single field type).
Query Parameters
The query parameters can be added to the API request after a '?'.

wt
Defines the format of the response. The options are json or xml. If not specified, JSON will be returned by
default.

showDefaults
If true, all default field properties from each dynamic field’s field type will be included in the response
(e.g., tokenized for solr.TextField). If false, the default, only explicitly specified field properties will be
included.
List Field Type Response
The output will include each field type and any defined configuration for the type. The defined configuration
can vary for each type, but will minimally include the field type name and the class. If query or index
analyzers, tokenizers, or filters are defined, those will also be shown with other defined parameters. See the
section Solr Field Types for more information about how to configure various types of fields.
List Field Type Examples
Get a list of all field types.
curl http://localhost:8983/solr/gettingstarted/schema/fieldtypes
The sample output below has been truncated to show a few different field types from different parts of the
list.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 194 of 1195

Apache Solr Reference Guide 7.3

{
"fieldTypes": [
{
"analyzer": {
"class": "solr.TokenizerChain",
"filters": [
{
"class": "solr.LowerCaseFilterFactory"
},
{
"class": "solr.TrimFilterFactory"
},
{
"class": "solr.PatternReplaceFilterFactory",
"pattern": "([^a-z])",
"replace": "all",
"replacement": ""
}
],
"tokenizer": {
"class": "solr.KeywordTokenizerFactory"
}
},
"class": "solr.TextField",
"dynamicFields": [],
"fields": [],
"name": "alphaOnlySort",
"omitNorms": true,
"sortMissingLast": true
},
{
"class": "solr.FloatPointField",
"dynamicFields": [
"*_fs",
"*_f"
],
"fields": [
"price",
"weight"
],
"name": "float",
"positionIncrementGap": "0",
}]
}

List Copy Fields
GET /collection/schema/copyfields

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 195 of 1195

List Copy Field Parameters
Path Parameters

collection
The collection (or core) name.
Query Parameters
The query parameters can be added to the API request after a '?'.

wt
Defines the format of the response. The options are json or xml. If not specified, JSON will be returned by
default.

source.fl
Comma- or space-separated list of one or more copyField source fields to include in the response copyField directives with all other source fields will be excluded from the response. If not specified, all
copyField-s will be included in the response.

dest.fl
Comma- or space-separated list of one or more copyField destination fields to include in the response.
copyField directives with all other dest fields will be excluded. If not specified, all copyField-s will be
included in the response.
List Copy Field Response
The output will include the source and dest (destination) of each copy field rule defined in schema.xml. For
more information about copying fields, see the section Copying Fields.
List Copy Field Examples
Get a list of all copyFields.
curl http://localhost:8983/solr/gettingstarted/schema/copyfields
The sample output below has been truncated to the first few copy definitions.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 196 of 1195

Apache Solr Reference Guide 7.3

{
"copyFields": [
{
"dest": "text",
"source": "author"
},
{
"dest": "text",
"source": "cat"
},
{
"dest": "text",
"source": "content"
},
{
"dest": "text",
"source": "content_type"
},
],
"responseHeader": {
"QTime": 3,
"status": 0
}
}

Show Schema Name
GET /collection/schema/name
Show Schema Parameters
Path Parameters

collection
The collection (or core) name.
Query Parameters
The query parameters can be added to the API request after a '?'.

wt
Defines the format of the response. The options are json or xml. If not specified, JSON will be returned by
default.
Show Schema Response
The output will be simply the name given to the schema.
Show Schema Examples
Get the schema name.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 197 of 1195

curl http://localhost:8983/solr/gettingstarted/schema/name

{
"responseHeader":{
"status":0,
"QTime":1},
"name":"example"}

Show the Schema Version
GET /collection/schema/version
Show Schema Version Parameters
Path Parameters
collection
The collection (or core) name.
Query Parameters
The query parameters can be added to the API request after a '?'.

wt
Defines the format of the response. The options are json or xml. If not specified, JSON will be returned by
default.
Show Schema Version Response
The output will simply be the schema version in use.
Show Schema Version Example
Get the schema version
curl http://localhost:8983/solr/gettingstarted/schema/version

{
"responseHeader":{
"status":0,
"QTime":2},
"version":1.5}

List UniqueKey
GET /collection/schema/uniquekey

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 198 of 1195

Apache Solr Reference Guide 7.3

List UniqueKey Parameters
Path Parameters
|collection
The collection (or core) name.
Query Parameters
The query parameters can be added to the API request after a '?'.
|wt
Defines the format of the response. The options are json or xml. If not specified, JSON will be returned by
default.
List UniqueKey Response
The output will include simply the field name that is defined as the uniqueKey for the index.
List UniqueKey Example
List the uniqueKey.
curl http://localhost:8983/solr/gettingstarted/schema/uniquekey

{
"responseHeader":{
"status":0,
"QTime":2},
"uniqueKey":"id"}

Show Global Similarity
GET /collection/schema/similarity
Show Global Similarity Parameters
Path Parameters

collection
The collection (or core) name.
Query Parameters
The query parameters can be added to the API request after a '?'.

wt
Defines the format of the response. The options are json or xml. If not specified, JSON will be returned by
default.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 199 of 1195

Show Global Similary Response
The output will include the class name of the global similarity defined (if any).
Show Global Similarity Example
Get the similarity implementation.
curl http://localhost:8983/solr/gettingstarted/schema/similarity

{
"responseHeader":{
"status":0,
"QTime":1},
"similarity":{
"class":"org.apache.solr.search.similarities.DefaultSimilarityFactory"}}

Manage Resource Data
The Managed Resources REST API provides a mechanism for any Solr plugin to expose resources that should
support CRUD (Create, Read, Update, Delete) operations. Depending on what Field Types and Analyzers are
configured in your Schema, additional /schema/ REST API paths may exist. See the Managed Resources
section for more information and examples.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 200 of 1195

Apache Solr Reference Guide 7.3

Putting the Pieces Together
At the highest level, schema.xml is structured as follows.
This example is not real XML, but it gives you an idea of the structure of the file.






Obviously, most of the excitement is in types and fields, where the field types and the actual field
definitions live.
These are supplemented by copyFields.
The uniqueKey must always be defined.
Types and fields are optional tags



Note that the types and fields sections are optional, meaning you are free to mix field,
dynamicField, copyField and fieldType definitions on the top level. This allows for a more
logical grouping of related tags in your schema.

Choosing Appropriate Numeric Types
For general numeric needs, consider using one of the IntPointField, LongPointField, FloatPointField, or
DoublePointField classes, depending on the specific values you expect. These "Dimensional Point" based
numeric classes use specially encoded data structures to support efficient range queries regardless of the
size of the ranges used. Enable DocValues on these fields as needed for sorting and/or faceting.
Some Solr features may not yet work with "Dimensional Points", in which case you may want to consider the
equivalent TrieIntField, TrieLongField, TrieFloatField, and TrieDoubleField classes. These field types
are deprecated and are likely to be removed in a future major Solr release, but they can still be used if
necessary. Configure a precisionStep="0" if you wish to minimize index size, but if you expect users to
make frequent range queries on numeric types, use the default precisionStep (by not specifying it) or
specify it as precisionStep="8" (which is the default). This offers faster speed for range queries at the
expense of increasing index size.

Working With Text
Handling text properly will make your users happy by providing them with the best possible results for text
searches.
One technique is using a text field as a catch-all for keyword searching. Most users are not sophisticated
about their searches and the most common search is likely to be a simple keyword search. You can use
copyField to take a variety of fields and funnel them all into a single text field for keyword searches.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 201 of 1195

In the schema.xml file for the “techproducts” example included with Solr, copyField declarations are used
to dump the contents of cat, name, manu, features, and includes into a single field, text. In addition, it
could be a good idea to copy ID into text in case users wanted to search for a particular product by passing
its product number to a keyword search.
Another technique is using copyField to use the same field in different ways. Suppose you have a field that
is a list of authors, like this:

Schildt, Herbert; Wolpert, Lewis; Davies, P.
For searching by author, you could tokenize the field, convert to lower case, and strip out punctuation:

schildt / herbert / wolpert / lewis / davies / p
For sorting, just use an untokenized field, converted to lower case, with punctuation stripped:

schildt herbert wolpert lewis davies p
Finally, for faceting, use the primary author only via a StrField:

Schildt, Herbert

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 202 of 1195

Apache Solr Reference Guide 7.3

DocValues
DocValues are a way of recording field values internally that is more efficient for some purposes, such as
sorting and faceting, than traditional indexing.

Why DocValues?
The standard way that Solr builds the index is with an inverted index. This style builds a list of terms found in
all the documents in the index and next to each term is a list of documents that the term appears in (as well
as how many times the term appears in that document). This makes search very fast - since users search by
terms, having a ready list of term-to-document values makes the query process faster.
For other features that we now commonly associate with search, such as sorting, faceting, and highlighting,
this approach is not very efficient. The faceting engine, for example, must look up each term that appears in
each document that will make up the result set and pull the document IDs in order to build the facet list. In
Solr, this is maintained in memory, and can be slow to load (depending on the number of documents, terms,
etc.).
In Lucene 4.0, a new approach was introduced. DocValue fields are now column-oriented fields with a
document-to-value mapping built at index time. This approach promises to relieve some of the memory
requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.

Enabling DocValues
To use docValues, you only need to enable it for a field that you will use it with. As with all schema design,
you need to define a field type and then define fields of that type with docValues enabled. All of these
actions are done in schema.xml.
Enabling a field for docValues only requires adding docValues="true" to the field (or field type) definition,
as in this example from the schema.xml of Solr’s sample_techproducts_configs config set:




If you have already indexed data into your Solr index, you will need to completely re-index
your content after changing your field definitions in schema.xml in order to successfully use
docValues.

DocValues are only available for specific field types. The types chosen determine the underlying Lucene
docValue type that will be used. The available Solr field types are:
• StrField and UUIDField:
◦ If the field is single-valued (i.e., multi-valued is false), Lucene will use the SORTED type.
◦ If the field is multi-valued, Lucene will use the SORTED_SET type.
• BoolField:
◦ If the field is single-valued (i.e., multi-valued is false), Lucene will use the SORTED type.
◦ If the field is multi-valued, Lucene will use the SORTED_BINARY type.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 203 of 1195

• Any *PointField Numeric or Date fields, EnumFieldType, and CurrencyFieldType:
◦ If the field is single-valued (i.e., multi-valued is false), Lucene will use the NUMERIC type.
◦ If the field is multi-valued, Lucene will use the SORTED_NUMERIC type.
• Any of the deprecated Trie* Numeric or Date fields, EnumField and CurrencyField:
◦ If the field is single-valued (i.e., multi-valued is false), Lucene will use the NUMERIC type.
◦ If the field is multi-valued, Lucene will use the SORTED_SET type.
These Lucene types are related to how the values are sorted and stored.
There is an additional configuration option available, which is to modify the docValuesFormat used by the
field type. The default implementation employs a mixture of loading some things into memory and keeping
some on disk. In some cases, however, you may choose to specify an alternative DocValuesFormat
implementation. For example, you could choose to keep everything in memory by specifying
docValuesFormat="Memory" on a field type:

Please note that the docValuesFormat option may change in future releases.



Lucene index back-compatibility is only supported for the default codec. If you choose to
customize the docValuesFormat in your schema.xml, upgrading to a future version of Solr
may require you to either switch back to the default codec and optimize your index to
rewrite it into the default codec before upgrading, or re-build your entire index from
scratch after upgrading.

Using DocValues
Sorting, Faceting & Functions
If docValues="true" for a field, then DocValues will automatically be used any time the field is used for
sorting, faceting or function queries.

Retrieving DocValues During Search
Field values retrieved during search queries are typically returned from stored values. However, non-stored
docValues fields will be also returned along with other stored fields when all fields (or pattern matching
globs) are specified to be returned (e.g., “fl=*”) for search queries depending on the effective value of the
useDocValuesAsStored parameter for each field. For schema versions >= 1.6, the implicit default is
useDocValuesAsStored="true". See Field Type Definitions and Properties & Defining Fields for more details.
When useDocValuesAsStored="false", non-stored DocValues fields can still be explicitly requested by name
in the fl param, but will not match glob patterns ("*"). Note that returning DocValues along with "regular"
stored fields at query time has performance implications that stored fields may not because DocValues are
column-oriented and may therefore incur additional cost to retrieve for each returned document. Also note
that while returning non-stored fields from DocValues, the values of a multi-valued field are returned in

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 204 of 1195

Apache Solr Reference Guide 7.3

sorted order (and not insertion order). If you require the multi-valued fields to be returned in the original
insertion order, then make your multi-valued field as stored (such a change requires re-indexing).
In cases where the query is returning only docValues fields performance may improve since returning stored
fields requires disk reads and decompression whereas returning docValues fields in the fl list only requires
memory access.
When retrieving fields from their docValues form (using the /export handler, streaming expressions or if the
field is requested in the fl parameter), two important differences between regular stored fields and
docValues fields must be understood:
1. Order is not preserved. For simply retrieving stored fields, the insertion order is the return order. For
docValues, it is the sorted order.
2. For field types using SORTED_SET, multiple identical entries are collapsed into a single value. Thus if I
insert values 4, 5, 2, 4, 1, my return will be 1, 2, 4, 5.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 205 of 1195

Schemaless Mode
Schemaless Mode is a set of Solr features that, when used together, allow users to rapidly construct an
effective schema by simply indexing sample data, without having to manually edit the schema.
These Solr features, all controlled via solrconfig.xml, are:
1. Managed schema: Schema modifications are made at runtime through Solr APIs, which requires the use
of a schemaFactory that supports these changes. See the section Schema Factory Definition in SolrConfig
for more details.
2. Field value class guessing: Previously unseen fields are run through a cascading set of value-based
parsers, which guess the Java class of field values - parsers for Boolean, Integer, Long, Float, Double, and
Date are currently available.
3. Automatic schema field addition, based on field value class(es): Previously unseen fields are added to the
schema, based on field value Java classes, which are mapped to schema field types - see Solr Field Types.

Using the Schemaless Example
The three features of schemaless mode are pre-configured in the _default config set in the Solr
distribution. To start an example instance of Solr using these configs, run the following command:
bin/solr start -e schemaless
This will launch a single Solr server, and automatically create a collection (named “gettingstarted”) that
contains only three fields in the initial schema: id, _version_, and _text_.
You can use the /schema/fields Schema API to confirm this: curl
http://localhost:8983/solr/gettingstarted/schema/fields will output:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 206 of 1195

Apache Solr Reference Guide 7.3

{
"responseHeader":{
"status":0,
"QTime":1},
"fields":[{
"name":"_text_",
"type":"text_general",
"multiValued":true,
"indexed":true,
"stored":false},
{
"name":"_version_",
"type":"long",
"indexed":true,
"stored":true},
{
"name":"id",
"type":"string",
"multiValued":false,
"indexed":true,
"required":true,
"stored":true,
"uniqueKey":true}]}

Configuring Schemaless Mode
As described above, there are three configuration elements that need to be in place to use Solr in
schemaless mode. In the _default config set included with Solr these are already configured. If, however,
you would like to implement schemaless on your own, you should make the following changes.

Enable Managed Schema
As described in the section Schema Factory Definition in SolrConfig, Managed Schema support is enabled by
default, unless your configuration specifies that ClassicIndexSchemaFactory should be used.
You can configure the ManagedIndexSchemaFactory (and control the resource file used, or disable future
modifications) by adding an explicit  like the one below, please see Schema Factory
Definition in SolrConfig for more details on the options available.

true
managed-schema


Enable Field Class Guessing
In Solr, an UpdateRequestProcessorChain defines a chain of plugins that are applied to documents before or
while they are indexed.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 207 of 1195

The field guessing aspect of Solr’s schemaless mode uses a specially-defined UpdateRequestProcessorChain
that allows Solr to guess field types. You can also define the default field type classes to use.
To start, you should define it as follows (see the javadoc links below for update processor factory
documentation):


[^\w-\.]
_





yyyy-MM-dd'T'HH:mm:ss.SSSZ
yyyy-MM-dd'T'HH:mm:ss,SSSZ
yyyy-MM-dd'T'HH:mm:ss.SSS
yyyy-MM-dd'T'HH:mm:ss,SSS
yyyy-MM-dd'T'HH:mm:ssZ
yyyy-MM-dd'T'HH:mm:ss
yyyy-MM-dd'T'HH:mmZ
yyyy-MM-dd'T'HH:mm
yyyy-MM-dd HH:mm:ss.SSSZ
yyyy-MM-dd HH:mm:ss,SSSZ
yyyy-MM-dd HH:mm:ss.SSS
yyyy-MM-dd HH:mm:ss,SSS
yyyy-MM-dd HH:mm:ssZ
yyyy-MM-dd HH:mm:ss
yyyy-MM-dd HH:mmZ
yyyy-MM-dd HH:mm
yyyy-MM-dd



java.lang.Stringtext_general
*_str
256


true


java.lang.Boolean
booleans

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 208 of 1195

Apache Solr Reference Guide 7.3



java.util.Date
pdates


java.lang.Longjava.lang.Integer
plongs


java.lang.Number
pdoubles







There are many things defined in this chain. Let’s step through a few of them.

① First, we’re using the FieldNameMutatingUpdateProcessorFactory to lower-case all field names. Note

that this and every following  element include a name. These names will be used in the final
chain definition at the end of this example.

② Next we add several update request processors to parse different field types. Note the

ParseDateFieldUpdateProcessorFactory includes a long list of possible date formations that would be
parsed into valid Solr dates. If you have a custom date, you could add it to this list (see the link to the
Javadocs below to get information on how).

③ Once the fields have been parsed, we define the field types that will be assigned to those fields. You can
modify any of these that you would like to change.

④ In this definition, if the parsing step decides the incoming data in a field is a string, we will put this into a
field in Solr with the field type text_general. This field type by default allows Solr to query on this field.

⑤ After we’ve added the text_general field, we have also defined a copy field rule that will copy all data

from the new text_general field to a field with the same name suffixed with _str. This is done by Solr’s
dynamic fields feature. By defining the target of the copy field rule as a dynamic field in this way, you
can control the field type used in your schema. The default selection allows Solr to facet, highlight, and
sort on these fields.

⑥ This is another example of a mapping rule. In this case we define that when either of the Long or
Integer field parsers identify a field, they should both map their fields to the plongs field type.

⑦ Finally, we add a chain definition that calls the list of plugins. These plugins are each called by the names
we gave to them when we defined them. We can also add other processors to the chain, as shown here.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 209 of 1195

Note we have also given the entire chain a name ("add-unknown-fields-to-the-schema"). We’ll use this
name in the next section to specify that our update request handler should use this chain definition.



This chain definition will make a number of copy field rules for string fields to be created
from corresponding text fields. If your data causes you to end up with a lot of copy field
rules, indexing may be slowed down noticeably, and your index size will be larger. To
control for these issues, it’s recommended that you review the copy field rules that are
created, and remove any which you do not need for faceting, sorting, highlighting, etc.

If you’re interested in more information about the classes used in this chain, here are links to the Javadocs
for update processor factories mentioned above:
• UUIDUpdateProcessorFactory
• RemoveBlankFieldUpdateProcessorFactory
• FieldNameMutatingUpdateProcessorFactory
• ParseBooleanFieldUpdateProcessorFactory
• ParseLongFieldUpdateProcessorFactory
• ParseDoubleFieldUpdateProcessorFactory
• ParseDateFieldUpdateProcessorFactory
• AddSchemaFieldsUpdateProcessorFactory

Set the Default UpdateRequestProcessorChain
Once the UpdateRequestProcessorChain has been defined, you must instruct your UpdateRequestHandlers
to use it when working with index updates (i.e., adding, removing, replacing documents).
There are two ways to do this. The update chain shown above has a default=true attribute which will use it
for any update handler.
An alternative, more explicit way is to use InitParams to set the defaults on all /update request handlers:


add-unknown-fields-to-the-schema





After all of these changes have been made, Solr should be restarted or the cores reloaded.

Disabling Automatic Field Guessing
Automatic field creation can be disabled with the update.autoCreateFields property. To do this, you can
use the Config API with a command such as:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 210 of 1195

Apache Solr Reference Guide 7.3

curl http://host:8983/solr/mycollection/config -d '{"set-user-property":
{"update.autoCreateFields":"false"}}'

Examples of Indexed Documents
Once the schemaless mode has been enabled (whether you configured it manually or are using the
_default configset), documents that include fields that are not defined in your schema will be indexed,
using the guessed field types which are automatically added to the schema.
For example, adding a CSV document will cause unknown fields to be added, with fieldTypes based on
values:
curl "http://localhost:8983/solr/gettingstarted/update?commit=true&wt=xml" -H "Contenttype:application/csv" -d '
id,Artist,Album,Released,Rating,FromDistributor,Sold
44C,Old Shews,Mead for Walking,1988-08-13,0.01,14,0'
Output indicating success:

0106

The fields now in the schema (output from curl

http://localhost:8983/solr/gettingstarted/schema/fields ):

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 211 of 1195

{
"responseHeader":{
"status":0,
"QTime":2},
"fields":[{
"name":"Album",
"type":"text_general"},
{
"name":"Artist",
"type":"text_general"},
{
"name":"FromDistributor",
"type":"plongs"},
{
"name":"Rating",
"type":"pdoubles"},
{
"name":"Released",
"type":"pdates"},
{
"name":"Sold",
"type":"plongs"},
{
"name":"_root_", ...},
{
"name":"_text_", ...},
{
"name":"_version_", ...},
{
"name":"id", ...}
]}
In addition string versions of the text fields are indexed, using copyFields to a *_str dynamic field: (output
from curl http://localhost:8983/solr/gettingstarted/schema/copyfields ):
{
"responseHeader":{
"status":0,
"QTime":0},
"copyFields":[{
"source":"Artist",
"dest":"Artist_str",
"maxChars":256},
{
"source":"Album",
"dest":"Album_str",
"maxChars":256}]}

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 212 of 1195

Apache Solr Reference Guide 7.3

You Can Still Be Explicit
Even if you want to use schemaless mode for most fields, you can still use the Schema API
to pre-emptively create some fields, with explicit types, before you index documents that
use them.



Internally, the Schema API and the Schemaless Update Processors both use the same
Managed Schema functionality.
Also, if you do not need the *_str version of a text field, you can simply remove the
copyField definition from the auto-generated schema and it will not be re-added since the
original field is now defined.

Once a field has been added to the schema, its field type is fixed. As a consequence, adding documents with
field value(s) that conflict with the previously guessed field type will fail. For example, after adding the above
document, the “Sold” field has the fieldType plongs, but the document below has a non-integral decimal
value in this field:
curl "http://localhost:8983/solr/gettingstarted/update?commit=true&wt=xml" -H "Contenttype:application/csv" -d '
id,Description,Sold
19F,Cassettes by the pound,4.93'
This document will fail, as shown in this output:


400
7


ERROR: [doc=19F] Error adding field 'Sold'='4.93' msg=For input string:
"4.93"
400



Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 213 of 1195

Understanding Analyzers, Tokenizers, and
Filters
The following sections describe how Solr breaks down and works with textual data. There are three main
concepts to understand: analyzers, tokenizers, and filters.
• Field analyzers are used both during ingestion, when a document is indexed, and at query time. An
analyzer examines the text of fields and generates a token stream. Analyzers may be a single class or
they may be composed of a series of tokenizer and filter classes.
• Tokenizers break field data into lexical units, or tokens.
• Filters examine a stream of tokens and keep them, transform or discard them, or create new ones.
Tokenizers and filters may be combined to form pipelines, or chains, where the output of one is input to
the next. Such a sequence of tokenizers and filters is called an analyzer and the resulting output of an
analyzer is used to match query results or build indices.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 214 of 1195

Apache Solr Reference Guide 7.3

Using Analyzers, Tokenizers, and Filters
Although the analysis process is used for both indexing and querying, the same analysis process need not
be used for both operations. For indexing, you often want to simplify, or normalize, words. For example,
setting all letters to lowercase, eliminating punctuation and accents, mapping words to their stems, and so
on. Doing so can increase recall because, for example, "ram", "Ram" and "RAM" would all match a query for
"ram". To increase query-time precision, a filter could be employed to narrow the matches by, for example,
ignoring all-cap acronyms if you’re interested in male sheep, but not Random Access Memory.
The tokens output by the analysis process define the values, or terms, of that field and are used either to
build an index of those terms when a new document is added, or to identify which documents contain the
terms you are querying for.

For More Information
These sections will show you how to configure field analyzers and also serves as a reference for the details
of configuring each of the available tokenizer and filter classes. It also serves as a guide so that you can
configure your own analysis classes if you have special needs that cannot be met with the included filters or
tokenizers.
For Analyzers, see:
• Analyzers: Detailed conceptual information about Solr analyzers.
• Running Your Analyzer: Detailed information about testing and running your Solr analyzer.
For Tokenizers, see:
• About Tokenizers: Detailed conceptual information about Solr tokenizers.
• Tokenizers: Information about configuring tokenizers, and about the tokenizer factory classes included
in this distribution of Solr.
For Filters, see:
• About Filters: Detailed conceptual information about Solr filters.
• Filter Descriptions: Information about configuring filters, and about the filter factory classes included in
this distribution of Solr.
• CharFilterFactories: Information about filters for pre-processing input characters.
To find out how to use Tokenizers and Filters with various languages, see:
• Language Analysis: Information about tokenizers and filters for character set conversion or for use with
specific languages.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 215 of 1195

Analyzers
An analyzer examines the text of fields and generates a token stream.
Analyzers are specified as a child of the  element in the schema.xml configuration file (in the
same conf/ directory as solrconfig.xml).
In normal usage, only fields of type solr.TextField or solr.SortableTextField will specify an analyzer.
The simplest way to configure an analyzer is with a single  element whose class attribute is a fully
qualified Java class name. The named class must derive from org.apache.lucene.analysis.Analyzer. For
example:



In this case a single class, WhitespaceAnalyzer, is responsible for analyzing the content of the named text
field and emitting the corresponding tokens. For simple cases, such as plain English prose, a single analyzer
class like this may be sufficient. But it’s often necessary to do more complex analysis of the field content.
Even the most complex analysis requirements can usually be decomposed into a series of discrete, relatively
simple processing steps. As you will soon discover, the Solr distribution comes with a large selection of
tokenizers and filters that covers most scenarios you are likely to encounter. Setting up an analyzer chain is
very straightforward; you specify a simple  element (no class attribute) with child elements that
name factory classes for the tokenizer and filters to use, in the order you want them to run.
For example:









Note that classes in the org.apache.solr.analysis package may be referred to here with the shorthand
solr. prefix.
In this case, no Analyzer class was specified on the  element. Rather, a sequence of more
specialized classes are wired together and collectively act as the Analyzer for the field. The text of the field is
passed to the first item in the list (solr.StandardTokenizerFactory), and the tokens that emerge from the
last one (solr.EnglishPorterFilterFactory) are the terms that are used for indexing or querying any
fields that use the "nametext" fieldType.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 216 of 1195

Apache Solr Reference Guide 7.3

Field Values versus Indexed Terms



The output of an Analyzer affects the terms indexed in a given field (and the terms used
when parsing queries against those fields) but it has no impact on the stored value for the
fields. For example: an analyzer might split "Brown Cow" into two indexed terms "brown"
and "cow", but the stored value will still be a single String: "Brown Cow"

Analysis Phases
Analysis takes place in two contexts. At index time, when a field is being created, the token stream that
results from analysis is added to an index and defines the set of terms (including positions, sizes, and so on)
for the field. At query time, the values being searched for are analyzed and the terms that result are
matched against those that are stored in the field’s index.
In many cases, the same analysis should be applied to both phases. This is desirable when you want to
query for exact string matches, possibly with case-insensitivity, for example. In other cases, you may want to
apply slightly different analysis steps during indexing than those used at query time.
If you provide a simple  definition for a field type, as in the examples above, then it will be used
for both indexing and queries. If you want distinct analyzers for each phase, you may include two
 definitions distinguished with a type attribute. For example:












In this theoretical example, at index time the text is tokenized, the tokens are set to lowercase, any that are
not listed in keepwords.txt are discarded and those that remain are mapped to alternate values as defined
by the synonym rules in the file syns.txt. This essentially builds an index from a restricted set of possible
values and then normalizes them to values that may not even occur in the original text.
At query time, the only normalization that happens is to convert the query terms to lowercase. The filtering
and mapping steps that occur at index time are not applied to the query terms. Queries must then, in this
example, be very precise, using only the normalized terms that were stored at index time.

Analysis for Multi-Term Expansion
In some types of queries (i.e., Prefix, Wildcard, Regex, etc.) the input provided by the user is not natural
language intended for Analysis. Things like Synonyms or Stop word filtering do not work in a logical way in
these types of Queries.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 217 of 1195

The analysis factories that can work in these types of queries (such as Lowercasing, or Normalizing
Factories) are known as MultiTermAwareComponents. When Solr needs to perform analysis for a query that
results in Multi-Term expansion, only the MultiTermAwareComponents used in the query analyzer are used,
Factory that is not Multi-Term aware will be skipped.
For most use cases, this provides the best possible behavior, but if you wish for absolute control over the
analysis performed on these types of queries, you may explicitly define a multiterm analyzer to use, such as
in the following example:

















© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 218 of 1195

Apache Solr Reference Guide 7.3

About Tokenizers
The job of a tokenizer is to break up a stream of text into tokens, where each token is (usually) a subsequence of the characters in the text. An analyzer is aware of the field it is configured for, but a tokenizer is
not. Tokenizers read from a character stream (a Reader) and produce a sequence of Token objects (a
TokenStream).
Characters in the input stream may be discarded, such as whitespace or other delimiters. They may also be
added to or replaced, such as mapping aliases or abbreviations to normalized forms. A token contains
various metadata in addition to its text value, such as the location at which the token occurs in the field.
Because a tokenizer may produce tokens that diverge from the input text, you should not assume that the
text of the token is the same text that occurs in the field, or that its length is the same as the original text.
It’s also possible for more than one token to have the same position or refer to the same offset in the
original text. Keep this in mind if you use token metadata for things like highlighting search results in the
field text.





The class named in the tokenizer element is not the actual tokenizer, but rather a class that implements the
TokenizerFactory API. This factory class will be called upon to create new tokenizer instances as needed.
Objects created by the factory must derive from Tokenizer, which indicates that they produce sequences of
tokens. If the tokenizer produces tokens that are usable as is, it may be the only component of the analyzer.
Otherwise, the tokenizer’s output tokens will serve as input to the first filter stage in the pipeline.
A TypeTokenFilterFactory is available that creates a TypeTokenFilter that filters tokens based on their
TypeAttribute, which is set in factory.getStopTypes.
For a complete list of the available TokenFilters, see the section Tokenizers.

When to Use a CharFilter vs. a TokenFilter
There are several pairs of CharFilters and TokenFilters that have related (i.e., MappingCharFilter and
ASCIIFoldingFilter) or nearly identical (i.e., PatternReplaceCharFilterFactory and
PatternReplaceFilterFactory) functionality and it may not always be obvious which is the best choice.
The decision about which to use depends largely on which Tokenizer you are using, and whether you need
to preprocess the stream of characters.
For example, suppose you have a tokenizer such as StandardTokenizer and although you are pretty happy
with how it works overall, you want to customize how some specific characters behave. You could modify the
rules and re-build your own tokenizer with JFlex, but it might be easier to simply map some of the characters
before tokenization with a CharFilter.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 219 of 1195

About Filters
Like tokenizers, filters consume input and produce a stream of tokens. Filters also derive from
org.apache.lucene.analysis.TokenStream. Unlike tokenizers, a filter’s input is another TokenStream. The
job of a filter is usually easier than that of a tokenizer since in most cases a filter looks at each token in the
stream sequentially and decides whether to pass it along, replace it or discard it.
A filter may also do more complex analysis by looking ahead to consider multiple tokens at once, although
this is less common. One hypothetical use for such a filter might be to normalize state names that would be
tokenized as two words. For example, the single token "california" would be replaced with "CA", while the
token pair "rhode" followed by "island" would become the single token "RI".
Because filters consume one TokenStream and produce a new TokenStream, they can be chained one after
another indefinitely. Each filter in the chain in turn processes the tokens produced by its predecessor. The
order in which you specify the filters is therefore significant. Typically, the most general filtering is done first,
and later filtering stages are more specialized.








This example starts with Solr’s standard tokenizer, which breaks the field’s text into tokens. Those tokens
then pass through Solr’s standard filter, which removes dots from acronyms, and performs a few other
common operations. All the tokens are then set to lowercase, which will facilitate case-insensitive matching
at query time.
The last filter in the above example is a stemmer filter that uses the Porter stemming algorithm. A stemmer
is basically a set of mapping rules that maps the various forms of a word back to the base, or stem, word
from which they derive. For example, in English the words "hugs", "hugging" and "hugged" are all forms of
the stem word "hug". The stemmer will replace all of these terms with "hug", which is what will be indexed.
This means that a query for "hug" will match the term "hugged", but not "huge".
Conversely, applying a stemmer to your query terms will allow queries containing non stem terms, like
"hugging", to match documents with different variations of the same stem word, such as "hugged". This
works because both the indexer and the query will map to the same stem ("hug").
Word stemming is, obviously, very language specific. Solr includes several language-specific stemmers
created by the Snowball generator that are based on the Porter stemming algorithm. The generic Snowball
Porter Stemmer Filter can be used to configure any of these language stemmers. Solr also includes a
convenience wrapper for the English Snowball stemmer. There are also several purpose-built stemmers for
non-English languages. These stemmers are described in Language Analysis.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 220 of 1195

Apache Solr Reference Guide 7.3

Tokenizers
Tokenizers are responsible for breaking field data into lexical units, or tokens.
You configure the tokenizer for a text field type in schema.xml with a  element, as a child of
:






The class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer
factory classes implement the org.apache.solr.analysis.TokenizerFactory. A TokenizerFactory’s
create() method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it passes a
Reader object that provides the content of the text field.
Arguments may be passed to tokenizer factories by setting attributes on the  element.





The following sections describe the tokenizer factory classes included in this release of Solr.
For user tips about Solr’s tokenizers, see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.

Standard Tokenizer
This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter
characters are discarded, with the following exceptions:
• Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet
domain names.
• The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved
as single tokens.
Note that words are split at hyphens.
The Standard Tokenizer supports Unicode standard annex UAX#29 word boundaries with the following
token types: , , , , and .
Factory class: solr.StandardTokenizerFactory
Arguments:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 221 of 1195

maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified
by maxTokenLength.
Example:



In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."
Out: "Please", "email", "john.doe", "foo.com", "by", "03", "09", "re", "m37", "xq"

Classic Tokenizer
The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions 3.1 and
previous. It does not use the Unicode standard annex UAX#29 word boundary rules that the Standard
Tokenizer uses. This tokenizer splits the text field into tokens, treating whitespace and punctuation as
delimiters. Delimiter characters are discarded, with the following exceptions:
• Periods (dots) that are not followed by whitespace are kept as part of the token.
• Words are split at hyphens, unless there is a number in the word, in which case the token is not split and
the numbers and hyphen(s) are preserved.
• Recognizes Internet domain names and email addresses and preserves them as a single token.
Factory class: solr.ClassicTokenizerFactory
Arguments:

maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified
by maxTokenLength.
Example:



In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."
Out: "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"

Keyword Tokenizer
This tokenizer treats the entire text field as a single token.
Factory class: solr.KeywordTokenizerFactory
Arguments: None

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 222 of 1195

Apache Solr Reference Guide 7.3

Example:



In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."
Out: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

Letter Tokenizer
This tokenizer creates tokens from strings of contiguous letters, discarding all non-letter characters.
Factory class: solr.LetterTokenizerFactory
Arguments: None
Example:



In: "I can’t."
Out: "I", "can", "t"

Lower Case Tokenizer
Tokenizes the input stream by delimiting at non-letters and then converting all letters to lowercase.
Whitespace and non-letters are discarded.
Factory class: solr.LowerCaseTokenizerFactory
Arguments: None
Example:



In: "I just *LOVE* my iPhone!"
Out: "i", "just", "love", "my", "iphone"

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 223 of 1195

N-Gram Tokenizer
Reads the field text and generates n-gram tokens of sizes in the given range.
Factory class: solr.NGramTokenizerFactory
Arguments:

minGramSize: (integer, default 1) The minimum n-gram size, must be > 0.
maxGramSize: (integer, default 2) The maximum n-gram size, must be >= minGramSize.
Example:
Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at
whitespace. As a result, the space character is included in the encoding.



In: "hey man"
Out: "h", "e", "y", " ", "m", "a", "n", "he", "ey", "y ", " m", "ma", "an"
Example:
With an n-gram size range of 4 to 5:



In: "bicycle"
Out: "bicy", "bicyc", "icyc", "icycl", "cycl", "cycle", "ycle"

Edge N-Gram Tokenizer
Reads the field text and generates edge n-gram tokens of sizes in the given range.
Factory class: solr.EdgeNGramTokenizerFactory
Arguments:

minGramSize: (integer, default is 1) The minimum n-gram size, must be > 0.
maxGramSize: (integer, default is 1) The maximum n-gram size, must be >= minGramSize.
Example:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 224 of 1195

Apache Solr Reference Guide 7.3

Default behavior (min and max default to 1):



In: "babaloo"
Out: "b"
Example:
Edge n-gram range of 2 to 5



In: "babaloo"
Out:"ba", "bab", "baba", "babal"

ICU Tokenizer
This tokenizer processes multilingual text and tokenizes it appropriately based on its script attribute.
You can customize this tokenizer’s behavior by specifying per-script rule files. To add per-script rules, add a
rulefiles argument, which should contain a comma-separated list of code:rulefile pairs in the following
format: four-letter ISO 15924 script code, followed by a colon, then a resource path. For example, to specify
rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"), you would enter
Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi.
The default configuration for solr.ICUTokenizerFactory provides UAX#29 word break rules tokenization
(like solr.StandardTokenizer), but also includes custom tailorings for Hebrew (specializing handling of
double and single quotation marks), for syllable tokenization for Khmer, Lao, and Myanmar, and dictionarybased word segmentation for CJK characters.
Factory class: solr.ICUTokenizerFactory
Arguments:

rulefile: a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924
script code, followed by a colon, then a resource path.
Example:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 225 of 1195












To use this tokenizer, you must add additional .jars to Solr’s classpath (as described in the
section Lib Directives in SolrConfig). See the solr/contrib/analysis-extras/README.txt
for information on which jars you need to add to your SOLR_HOME/lib.

Path Hierarchy Tokenizer
This tokenizer creates synonyms from file path hierarchies.
Factory class: solr.PathHierarchyTokenizerFactory
Arguments:

delimiter: (character, no default) You can specify the file path delimiter and replace it with a delimiter you
provide. This can be useful for working with backslash delimiters.

replace: (character, no default) Specifies the delimiter character Solr uses in the tokenized output.
Example:





In: "c:\usr\local\apache"
Out: "c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache"

Regular Expression Pattern Tokenizer
This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression
provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to
match patterns that should be extracted from the text as tokens.
See the Javadocs for java.util.regex.Pattern for more information on Java regular expression syntax.
Factory class: solr.PatternTokenizerFactory

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 226 of 1195

Apache Solr Reference Guide 7.3

Arguments:

pattern: (Required) The regular expression, as defined by in java.util.regex.Pattern.
group: (Optional, default -1) Specifies which regex group to extract as the token(s). The value -1 means the
regex should be treated as a delimiter that separates tokens. Non-negative group numbers (>= 0) indicate
that character sequences matching that regex group should be converted to tokens. Group zero refers to
the entire regex, groups greater than zero refer to parenthesized sub-expressions of the regex, counted
from left to right.
Example:
A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or
more spaces.



In: "fee,fie, foe , fum, foo"
Out: "fee", "fie", "foe", "fum", "foo"
Example:
Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of
either case is extracted as a token.



In: "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."
Out: "Hello", "My", "Inigo", "Montoya", "You", "Prepare"
Example:
Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an
optional semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex
capture groups are numbered by counting left parenthesis from left to right. Group 3 is the subexpression
"[0-9-]+", which matches one or more digits or hyphens.



In: "SKU: 1234, Part Number 5678, Part: 126-987"

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 227 of 1195

Out: "1234", "5678", "126-987"

Simplified Regular Expression Pattern Tokenizer
This tokenizer is similar to the PatternTokenizerFactory described above, but uses Lucene RegExp pattern
matching to construct distinct tokens for the input stream. The syntax is more limited than
PatternTokenizerFactory, but the tokenization is quite a bit faster.
Factory class: solr.SimplePatternTokenizerFactory
Arguments:

pattern: (Required) The regular expression, as defined by in the RegExp javadocs, identifying the characters
to include in tokens. The matching is greedy such that the longest token matching at a given point is
created. Empty tokens are never created.

maxDeterminizedStates: (Optional, default 10000) the limit on total state count for the determined
automaton computed from the regexp.
Example:
To match tokens delimited by simple whitespace characters:




Simplified Regular Expression Pattern Splitting Tokenizer
This tokenizer is similar to the SimplePatternTokenizerFactory described above, but uses Lucene RegExp
pattern matching to identify sequences of characters that should be used to split tokens. The syntax is more
limited than PatternTokenizerFactory, but the tokenization is quite a bit faster.
Factory class: solr.SimplePatternSplitTokenizerFactory
Arguments:

pattern: (Required) The regular expression, as defined by in the RegExp javadocs, identifying the characters
that should split tokens. The matching is greedy such that the longest token separator matching at a given
point is matched. Empty tokens are never created.

maxDeterminizedStates: (Optional, default 10000) the limit on total state count for the determined
automaton computed from the regexp.
Example:
To match tokens delimited by simple whitespace characters:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 228 of 1195

Apache Solr Reference Guide 7.3





UAX29 URL Email Tokenizer
This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter
characters are discarded, with the following exceptions:
• Periods (dots) that are not followed by whitespace are kept as part of the token.
• Words are split at hyphens, unless there is a number in the word, in which case the token is not split and
the numbers and hyphen(s) are preserved.
• Recognizes and preserves as single tokens the following:
◦ Internet domain names containing top-level domains validated against the white list in the IANA Root
Zone Database when the tokenizer was generated
◦ email addresses
◦ file://, http(s)://, and ftp:// URLs
◦ IPv4 and IPv6 addresses
The UAX29 URL Email Tokenizer supports Unicode standard annex UAX#29 word boundaries with the
following token types: , , , , , , and
.
Factory class: solr.UAX29URLEmailTokenizerFactory
Arguments:

maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified
by maxTokenLength.
Example:



In: "Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com"
Out: "Visit", "http://accarol.com/contact.htm?from=external&a=10", "or", "e", "mail",
"bob.cratchet@accarol.com"

White Space Tokenizer
Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace
characters as tokens. Note that any punctuation will be included in the tokens.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 229 of 1195

Factory class: solr.WhitespaceTokenizerFactory
Arguments:

rule
Specifies how to define whitespace for the purpose of tokenization. Valid values:
• java: (Default) Uses Character.isWhitespace(int)
• unicode: Uses Unicode’s WHITESPACE property
Example:



In: "To be, or what?"
Out: "To", "be,", "or", "what?"

OpenNLP Tokenizer and OpenNLP Filters
See OpenNLP Integration for information about using the OpenNLP Tokenizer, along with information about
available OpenNLP token filters.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 230 of 1195

Apache Solr Reference Guide 7.3

Filter Descriptions
Filters examine a stream of tokens and keep them, transform them or discard them, depending on the filter
type being used.
You configure each filter with a  element in schema.xml as a child of , following the
 element. Filter definitions should follow a tokenizer or another filter definition because they
take a TokenStream as input. For example:



...


The class attribute names a factory class that will instantiate a filter object as needed. Filter factory classes
must implement the org.apache.solr.analysis.TokenFilterFactory interface. Like tokenizers, filters are
also instances of TokenStream and thus are producers of tokens. Unlike tokenizers, filters also consume
tokens from a TokenStream. This allows you to mix and match filters, in any order you prefer, downstream
of a tokenizer.
Arguments may be passed to tokenizer factories to modify their behavior by setting attributes on the
 element. For example:






The following sections describe the filter factories that are included in this release of Solr.
For user tips about Solr’s filters, see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.

ASCII Folding Filter
This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the Basic Latin
Unicode block (the first 127 ASCII characters) to their ASCII equivalents, if one exists. This filter converts
characters from the following Unicode blocks:
• C1 Controls and Latin-1 Supplement (PDF)
• Latin Extended-A (PDF)
• Latin Extended-B (PDF)
• Latin Extended Additional (PDF)

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 231 of 1195

• Latin Extended-C (PDF)
• Latin Extended-D (PDF)
• IPA Extensions (PDF)
• Phonetic Extensions (PDF)
• Phonetic Extensions Supplement (PDF)
• General Punctuation (PDF)
• Superscripts and Subscripts (PDF)
• Enclosed Alphanumerics (PDF)
• Dingbats (PDF)
• Supplemental Punctuation (PDF)
• Alphabetic Presentation Forms (PDF)
• Halfwidth and Fullwidth Forms (PDF)
Factory class: solr.ASCIIFoldingFilterFactory
Arguments:

preserveOriginal
(boolean, default false) If true, the original token is preserved: "thé" -> "the", "thé"
Example:




In: "á" (Unicode character 00E1)
Out: "a" (ASCII character 97)

Beider-Morse Filter
Implements the Beider-Morse Phonetic Matching (BMPM) algorithm, which allows identification of similar
names, even if they are spelled differently or in different languages. More information about how this works
is available in the section on Phonetic Matching.



BeiderMorseFilter changed its behavior in Solr 5.0 due to an update to version 3.04 of the
BMPM algorithm. Older version of Solr implemented BMPM version 3.00 (see
http://stevemorse.org/phoneticinfo.htm). Any index built using this filter with earlier
versions of Solr will need to be rebuilt.

Factory class: solr.BeiderMorseFilterFactory
Arguments:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 232 of 1195

Apache Solr Reference Guide 7.3

nameType
Types of names. Valid values are GENERIC, ASHKENAZI, or SEPHARDIC. If not processing Ashkenazi or
Sephardic names, use GENERIC.

ruleType
Types of rules to apply. Valid values are APPROX or EXACT.

concat
Defines if multiple possible matches should be combined with a pipe ("|").

languageSet
The language set to use. The value "auto" will allow the Filter to identify the language, or a commaseparated list can be supplied.
Example:






Classic Filter
This filter takes the output of the Classic Tokenizer and strips periods from acronyms and "'s" from
possessives.
Factory class: solr.ClassicFilterFactory
Arguments: None
Example:




In: "I.B.M. cat’s can’t"
Tokenizer to Filter: "I.B.M", "cat’s", "can’t"
Out: "IBM", "cat", "can’t"

Common Grams Filter
This filter creates word shingles by combining common tokens such as stop words with regular tokens. This
is useful for creating phrase queries containing common words, such as "the cat." Solr normally ignores

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 233 of 1195

stop words in queried phrases, so searching for "the cat" would return all matches for the word "cat."
Factory class: solr.CommonGramsFilterFactory
Arguments:

words
(a common word file in .txt format) Provide the name of a common word file, such as stopwords.txt.

format
(optional) If the stopwords list has been formatted for Snowball, you can specify format="snowball" so
Solr can read the stopwords file.

ignoreCase
(boolean) If true, the filter ignores the case of words when comparing them to the common word file. The
default is false.
Example:




In: "the Cat"
Tokenizer to Filter: "the", "Cat"
Out: "the_cat"

Collation Key Filter
Collation allows sorting of text in a language-sensitive way. It is usually used for sorting, but can also be
used with advanced searches. We’ve covered this in much more detail in the section on Unicode Collation.

Daitch-Mokotoff Soundex Filter
Implements the Daitch-Mokotoff Soundex algorithm, which allows identification of similar names, even if
they are spelled differently. More information about how this works is available in the section on Phonetic
Matching.
Factory class: solr.DaitchMokotoffSoundexFilterFactory
Arguments:

inject
(true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens
are replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the
exact spelling of the target word may not match.
Example:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 234 of 1195

Apache Solr Reference Guide 7.3






Double Metaphone Filter
This filter creates tokens using the DoubleMetaphone encoding algorithm from commons-codec. For more
information, see the Phonetic Matching section.
Factory class: solr.DoubleMetaphoneFilterFactory
Arguments:

inject
(true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens
are replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the
exact spelling of the target word may not match.

maxCodeLength
(integer) The maximum length of the code to be generated.
Example:
Default behavior for inject (true): keep the original token and add phonetic token(s) at the same position.




In: "four score and Kuczewski"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "Kuczewski"(4)
Out: "four"(1), "FR"(1), "score"(2), "SKR"(2), "and"(3), "ANT"(3), "Kuczewski"(4), "KSSK"(4), "KXFS"(4)
The phonetic tokens have a position increment of 0, which indicates that they are at the same position as the
token they were derived from (immediately preceding). Note that "Kuczewski" has two encodings, which are
added at the same position.
Example:
Discard original token (inject="false").





Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 235 of 1195

In: "four score and Kuczewski"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "Kuczewski"(4)
Out: "FR"(1), "SKR"(2), "ANT"(3), "KSSK"(4), "KXFS"(4)
Note that "Kuczewski" has two encodings, which are added at the same position.

Edge N-Gram Filter
This filter generates edge n-gram tokens of sizes within the given range.
Factory class: solr.EdgeNGramFilterFactory
Arguments:

minGramSize
(integer, default 1) The minimum gram size.

maxGramSize
(integer, default 1) The maximum gram size.
Example:
Default behavior.




In: "four score and twenty"
Tokenizer to Filter: "four", "score", "and", "twenty"
Out: "f", "s", "a", "t"
Example:
A range of 1 to 4.




In: "four score"
Tokenizer to Filter: "four", "score"
Out: "f", "fo", "fou", "four", "s", "sc", "sco", "scor"

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 236 of 1195

Apache Solr Reference Guide 7.3

Example:
A range of 4 to 6.




In: "four score and twenty"
Tokenizer to Filter: "four", "score", "and", "twenty"
Out: "four", "scor", "score", "twen", "twent", "twenty"

English Minimal Stem Filter
This filter stems plural English words to their singular form.
Factory class: solr.EnglishMinimalStemFilterFactory
Arguments: None
Example:




In: "dogs cats"
Tokenizer to Filter: "dogs", "cats"
Out: "dog", "cat"

English Possessive Filter
This filter removes singular possessives (trailing 's) from words. Note that plural possessives, e.g., the s' in
"divers' snorkels", are not removed by this filter.
Factory class: solr.EnglishPossessiveFilterFactory
Arguments: None
Example:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 237 of 1195





In: "Man’s dog bites dogs' man"
Tokenizer to Filter: "Man’s", "dog", "bites", "dogs'", "man"
Out: "Man", "dog", "bites", "dogs'", "man"

Fingerprint Filter
This filter outputs a single token which is a concatenation of the sorted and de-duplicated set of input
tokens. This can be useful for clustering/linking use cases.
Factory class: solr.FingerprintFilterFactory
Arguments:

separator
The character used to separate tokens combined into the single output token. Defaults to " " (a space
character).

maxOutputTokenSize
The maximum length of the summarized output token. If exceeded, no output token is emitted. Defaults
to 1024.
Example:




In: "the quick brown fox jumped over the lazy dog"
Tokenizer to Filter: "the", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog"
Out: "brown_dog_fox_jumped_lazy_over_quick_the"

Flatten Graph Filter
This filter must be included on index-time analyzer specifications that include at least one graph-aware filter,
including Synonym Graph Filter and Word Delimiter Graph Filter.
Factory class: solr.FlattenGraphFilterFactory
Arguments: None

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 238 of 1195

Apache Solr Reference Guide 7.3

See the examples below for Synonym Graph Filter and Word Delimiter Graph Filter.

Hunspell Stem Filter
The Hunspell Stem Filter provides support for several languages. You must provide the dictionary (.dic)
and rules (.aff) files for each language you wish to use with the Hunspell Stem Filter. You can download
those language files here.
Be aware that your results will vary widely based on the quality of the provided dictionary and rules files. For
example, some languages have only a minimal word list with no morphological information. On the other
hand, for languages that have no stemmer but do have an extensive dictionary file, the Hunspell stemmer
may be a good choice.
Factory class: solr.HunspellStemFilterFactory
Arguments:

dictionary
(required) The path of a dictionary file.

affix
(required) The path of a rules file.

ignoreCase
(boolean) controls whether matching is case sensitive or not. The default is false.

strictAffixParsing
(boolean) controls whether the affix parsing is strict or not. If true, an error while reading an affix rule
causes a ParseException, otherwise is ignored. The default is true.
Example:




In: "jump jumping jumped"
Tokenizer to Filter: "jump", "jumping", "jumped"
Out: "jump", "jump", "jump"

Hyphenated Words Filter
This filter reconstructs hyphenated words that have been tokenized as two tokens because of a line break or

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 239 of 1195

other intervening whitespace in the field test. If a token ends with a hyphen, it is joined with the following
token and the hyphen is discarded.
Note that for this filter to work properly, the upstream tokenizer must not remove trailing hyphen
characters. This filter is generally only useful at index time.
Factory class: solr.HyphenatedWordsFilterFactory
Arguments: None
Example:




In: "A hyphen- ated word"
Tokenizer to Filter: "A", "hyphen-", "ated", "word"
Out: "A", "hyphenated", "word"

ICU Folding Filter
This filter is a custom Unicode normalization form that applies the foldings specified in Unicode Technical
Report 30 in addition to the NFKC_Casefold normalization form as described in ICU Normalizer 2 Filter. This
filter is a better substitute for the combined behavior of the ASCII Folding Filter, Lower Case Filter, and ICU
Normalizer 2 Filter.
To use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need
to add to your solr_home/lib. For more information about adding jars, see the section Lib Directives in
Solrconfig.
Factory class: solr.ICUFoldingFilterFactory
Arguments: None
Example:




For detailed information on this normalization form, see http://www.unicode.org/reports/tr30/tr30-4.html.

ICU Normalizer 2 Filter
This filter factory normalizes text according to one of five Unicode Normalization Forms as described in

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 240 of 1195

Apache Solr Reference Guide 7.3

Unicode Standard Annex #15:
• NFC: (name="nfc" mode="compose") Normalization Form C, canonical decomposition
• NFD: (name="nfc" mode="decompose") Normalization Form D, canonical decomposition, followed by
canonical composition
• NFKC: (name="nfkc" mode="compose") Normalization Form KC, compatibility decomposition
• NFKD: (name="nfkc" mode="decompose") Normalization Form KD, compatibility decomposition,
followed by canonical composition
• NFKC_Casefold: (name="nfkc_cf" mode="compose") Normalization Form KC, with additional Unicode
case folding. Using the ICU Normalizer 2 Filter is a better-performing substitution for the Lower Case
Filter and NFKC normalization.
Factory class: solr.ICUNormalizer2FilterFactory
Arguments:

name
(string) The name of the normalization form; nfc, nfd, nfkc, nfkd, nfkc_cf

mode
(string) The mode of Unicode character composition and decomposition; compose or decompose
Example:




For detailed information about these Unicode Normalization Forms, see http://unicode.org/reports/tr15/.
To use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need
to add to your solr_home/lib.

ICU Transform Filter
This filter applies ICU Tranforms to text. This filter supports only ICU System Transforms. Custom rule sets
are not supported.
Factory class: solr.ICUTransformFilterFactory
Arguments:

id
(string) The identifier for the ICU System Transform you wish to apply with this filter. For a full list of ICU
System Transforms, see http://demo.icu-project.org/icu-bin/translit?TEMPLATE_FILE=data/
translit_rule_main.html.
Example:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 241 of 1195





For detailed information about ICU Transforms, see http://userguide.icu-project.org/transforms/general.
To use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need
to add to your solr_home/lib.

Keep Word Filter
This filter discards all tokens except those that are listed in the given word list. This is the inverse of the Stop
Words Filter. This filter can be useful for building specialized indices for a constrained set of terms.
Factory class: solr.KeepWordFilterFactory
Arguments:

words
(required) Path of a text file containing the list of keep words, one per line. Blank lines and lines that
begin with "#" are ignored. This may be an absolute path, or a simple filename in the Solr conf directory.

ignoreCase
(true/false) If true then comparisons are done case-insensitively. If this argument is true, then the words
file is assumed to contain only lowercase words. The default is false.

enablePositionIncrements
if luceneMatchVersion is 4.3 or earlier and enablePositionIncrements="false", no position holes will
be left by this filter when it removes tokens. This argument is invalid if luceneMatchVersion is 5.0 or
later.
Example:
Where keepwords.txt contains:

happy funny silly




In: "Happy, sad or funny"
Tokenizer to Filter: "Happy", "sad", "or", "funny"
Out: "funny"
Example:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 242 of 1195

Apache Solr Reference Guide 7.3

Same keepwords.txt, case insensitive:




In: "Happy, sad or funny"
Tokenizer to Filter: "Happy", "sad", "or", "funny"
Out: "Happy", "funny"
Example:
Using LowerCaseFilterFactory before filtering for keep words, no ignoreCase flag.





In: "Happy, sad or funny"
Tokenizer to Filter: "Happy", "sad", "or", "funny"
Filter to Filter: "happy", "sad", "or", "funny"
Out: "happy", "funny"

KStem Filter
KStem is an alternative to the Porter Stem Filter for developers looking for a less aggressive stemmer. KStem
was written by Bob Krovetz, ported to Lucene by Sergio Guzman-Lara (UMASS Amherst). This stemmer is
only appropriate for English language text.
Factory class: solr.KStemFilterFactory
Arguments: None
Example:




In: "jump jumping jumped"

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 243 of 1195

Tokenizer to Filter: "jump", "jumping", "jumped"
Out: "jump", "jump", "jump"

Length Filter
This filter passes tokens whose length falls within the min/max limit specified. All other tokens are
discarded.
Factory class: solr.LengthFilterFactory
Arguments:

min
(integer, required) Minimum token length. Tokens shorter than this are discarded.

max
(integer, required, must be >= min) Maximum token length. Tokens longer than this are discarded.

enablePositionIncrements
if luceneMatchVersion is 4.3 or earlier and enablePositionIncrements="false", no position holes will
be left by this filter when it removes tokens. This argument is invalid if luceneMatchVersion is 5.0 or
later.
Example:




In: "turn right at Albuquerque"
Tokenizer to Filter: "turn", "right", "at", "Albuquerque"
Out: "turn", "right"

Limit Token Count Filter
This filter limits the number of accepted tokens, typically useful for index analysis.
By default, this filter ignores any tokens in the wrapped TokenStream once the limit has been reached, which
can result in reset() being called prior to incrementToken() returning false. For most TokenStream
implementations this should be acceptable, and faster then consuming the full stream. If you are wrapping
a TokenStream which requires that the full stream of tokens be exhausted in order to function properly, use
the consumeAllTokens="true" option.
Factory class: solr.LimitTokenCountFilterFactory
Arguments:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 244 of 1195

Apache Solr Reference Guide 7.3

maxTokenCount
(integer, required) Maximum token count. After this limit has been reached, tokens are discarded.

consumeAllTokens
(boolean, defaults to false) Whether to consume (and discard) previous token filters' tokens after the
maximum token count has been reached. See description above.
Example:




In: "1 2 3 4 5 6 7 8 9 10 11 12"
Tokenizer to Filter: "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"
Out: "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"

Limit Token Offset Filter
This filter limits tokens to those before a configured maximum start character offset. This can be useful to
limit highlighting, for example.
By default, this filter ignores any tokens in the wrapped TokenStream once the limit has been reached, which
can result in reset() being called prior to incrementToken() returning false. For most TokenStream
implementations this should be acceptable, and faster then consuming the full stream. If you are wrapping
a TokenStream which requires that the full stream of tokens be exhausted in order to function properly, use
the consumeAllTokens="true" option.
Factory class: solr.LimitTokenOffsetFilterFactory
Arguments:

maxStartOffset
(integer, required) Maximum token start character offset. After this limit has been reached, tokens are
discarded.

consumeAllTokens
(boolean, defaults to false) Whether to consume (and discard) previous token filters' tokens after the
maximum start offset has been reached. See description above.
Example:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 245 of 1195





In: "0 2 4 6 8 A C E"
Tokenizer to Filter: "0", "2", "4", "6", "8", "A", "C", "E"
Out: "0", "2", "4", "6", "8", "A"

Limit Token Position Filter
This filter limits tokens to those before a configured maximum token position.
By default, this filter ignores any tokens in the wrapped TokenStream once the limit has been reached, which
can result in reset() being called prior to incrementToken() returning false. For most TokenStream
implementations this should be acceptable, and faster then consuming the full stream. If you are wrapping
a TokenStream which requires that the full stream of tokens be exhausted in order to function properly, use
the consumeAllTokens="true" option.
Factory class: solr.LimitTokenPositionFilterFactory
Arguments:

maxTokenPosition
(integer, required) Maximum token position. After this limit has been reached, tokens are discarded.

consumeAllTokens
(boolean, defaults to false) Whether to consume (and discard) previous token filters' tokens after the
maximum start offset has been reached. See description above.
Example:




In: "1 2 3 4 5"
Tokenizer to Filter: "1", "2", "3", "4", "5"
Out: "1", "2", "3"

Lower Case Filter
Converts any uppercase letters in a token to the equivalent lowercase token. All other characters are left

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 246 of 1195

Apache Solr Reference Guide 7.3

unchanged.
Factory class: solr.LowerCaseFilterFactory
Arguments: None
Example:




In: "Down With CamelCase"
Tokenizer to Filter: "Down", "With", "CamelCase"
Out: "down", "with", "camelcase"

Managed Stop Filter
This is specialized version of the Stop Words Filter Factory that uses a set of stop words that are managed
from a REST API.
Arguments:

managed
The name that should be used for this set of stop words in the managed REST API.
Example: With this configuration the set of words is named "english" and can be managed via

/solr/collection_name/schema/analysis/stopwords/english




See Stop Filter for example input/output.

Managed Synonym Filter
This is specialized version of the Synonym Filter that uses a mapping on synonyms that is managed from a
REST API.



Managed Synonym Filter has been Deprecated
Managed Synonym Filter has been deprecated in favor of Managed Synonym Graph Filter,
which is required for multi-term synonym support.

Factory class: solr.ManagedSynonymFilterFactory

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 247 of 1195

For arguments and examples, see the Synonym Graph Filter below.

Managed Synonym Graph Filter
This is specialized version of the Synonym Graph Filter that uses a mapping on synonyms that is managed
from a REST API.
This filter maps single- or multi-token synonyms, producing a fully correct graph output. This filter is a
replacement for the Managed Synonym Filter, which produces incorrect graphs for multi-token synonyms.



Although this filter produces correct token graphs, it cannot consume an input token graph
correctly.

Arguments:

managed
The name that should be used for this mapping on synonyms in the managed REST API.
Example: With this configuration the set of mappings is named "english" and can be managed via

/solr/collection_name/schema/analysis/synonyms/english



 





See Synonym Graph Filter below for example input/output.

N-Gram Filter
Generates n-gram tokens of sizes in the given range. Note that tokens are ordered by position and then by
gram size.
Factory class: solr.NGramFilterFactory
Arguments:

minGramSize
(integer, default 1) The minimum gram size.

maxGramSize
(integer, default 2) The maximum gram size.
Example:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 248 of 1195

Apache Solr Reference Guide 7.3

Default behavior.




In: "four score"
Tokenizer to Filter: "four", "score"
Out: "f", "o", "u", "r", "fo", "ou", "ur", "s", "c", "o", "r", "e", "sc", "co", "or", "re"
Example:
A range of 1 to 4.




In: "four score"
Tokenizer to Filter: "four", "score"
Out: "f", "fo", "fou", "four", "o", "ou", "our", "u", "ur", "r", "s", "sc", "sco", "scor", "c", "co", "cor", "core",
"o", "or", "ore", "r", "re", "e"
Example:
A range of 3 to 5.




In: "four score"
Tokenizer to Filter: "four", "score"
Out: "fou", "four", "our", "sco", "scor", "score", "cor", "core", "ore"

Numeric Payload Token Filter
This filter adds a numeric floating point payload value to tokens that match a given type. Refer to the
Javadoc for the org.apache.lucene.analysis.Token class for more information about token types and
payloads.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 249 of 1195

Factory class: solr.NumericPayloadTokenFilterFactory
Arguments:

payload
(required) A floating point value that will be added to all matching tokens.

typeMatch
(required) A token type name string. Tokens with a matching type name will have their payload set to the
above floating point value.
Example:




In: "bing bang boom"
Tokenizer to Filter: "bing", "bang", "boom"
Out: "bing"[0.75], "bang"[0.75], "boom"[0.75]

Pattern Replace Filter
This filter applies a regular expression to each token and, for those that match, substitutes the given
replacement string in place of the matched pattern. Tokens which do not match are passed though
unchanged.
Factory class: solr.PatternReplaceFilterFactory
Arguments:

pattern
(required) The regular expression to test against each token, as per java.util.regex.Pattern.

replacement
(required) A string to substitute in place of the matched pattern. This string may contain references to
capture groups in the regex pattern. See the Javadoc for java.util.regex.Matcher.

replace
("all" or "first", default "all") Indicates whether all occurrences of the pattern in the token should be
replaced, or only the first.
Example:
Simple string replace:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 250 of 1195

Apache Solr Reference Guide 7.3





In: "cat concatenate catycat"
Tokenizer to Filter: "cat", "concatenate", "catycat"
Out: "dog", "condogenate", "dogydog"
Example:
String replacement, first occurrence only:




In: "cat concatenate catycat"
Tokenizer to Filter: "cat", "concatenate", "catycat"
Out: "dog", "condogenate", "dogycat"
Example:
More complex pattern with capture group reference in the replacement. Tokens that start with non-numeric
characters and end with digits will have an underscore inserted before the numbers. Otherwise the token is
passed through.




In: "cat foo1234 9987 blah1234foo"
Tokenizer to Filter: "cat", "foo1234", "9987", "blah1234foo"
Out: "cat", "foo_1234", "9987", "blah1234foo"

Phonetic Filter
This filter creates tokens using one of the phonetic encoding algorithms in the
org.apache.commons.codec.language package. For more information, see the section on Phonetic
Matching.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 251 of 1195

Factory class: solr.PhoneticFilterFactory
Arguments:

encoder
(required) The name of the encoder to use. The encoder name must be one of the following (case
insensitive): DoubleMetaphone, Metaphone, Soundex, RefinedSoundex, Caverphone (v2.0),
ColognePhonetic, or Nysiis.

inject
(true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens
are replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the
exact spelling of the target word may not match.

maxCodeLength
(integer) The maximum length of the code to be generated by the Metaphone or Double Metaphone
encoders.
Example:
Default behavior for DoubleMetaphone encoding.




In: "four score and twenty"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)
Out: "four"(1), "FR"(1), "score"(2), "SKR"(2), "and"(3), "ANT"(3), "twenty"(4), "TNT"(4)
The phonetic tokens have a position increment of 0, which indicates that they are at the same position as the
token they were derived from (immediately preceding).
Example:
Discard original token.




In: "four score and twenty"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)
Out: "FR"(1), "SKR"(2), "ANT"(3), "TWNT"(4)

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 252 of 1195

Apache Solr Reference Guide 7.3

Example:
Default Soundex encoder.




In: "four score and twenty"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)
Out: "four"(1), "F600"(1), "score"(2), "S600"(2), "and"(3), "A530"(3), "twenty"(4), "T530"(4)

Porter Stem Filter
This filter applies the Porter Stemming Algorithm for English. The results are similar to using the Snowball
Porter Stemmer with the language="English" argument. But this stemmer is coded directly in Java and is
not based on Snowball. It does not accept a list of protected words and is only appropriate for English
language text. However, it has been benchmarked as four times faster than the English Snowball stemmer,
so can provide a performance enhancement.
Factory class: solr.PorterStemFilterFactory
Arguments: None
Example:




In: "jump jumping jumped"
Tokenizer to Filter: "jump", "jumping", "jumped"
Out: "jump", "jump", "jump"

Remove Duplicates Token Filter
The filter removes duplicate tokens in the stream. Tokens are considered to be duplicates ONLY if they have
the same text and position values.
Because positions must be the same, this filter might not do what a user expects it to do based on its name.
It is a very specialized filter that is only useful in very specific circumstances. It has been so named for
brevity, even though it is potentially misleading.
Factory class: solr.RemoveDuplicatesTokenFilterFactory

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 253 of 1195

Arguments: None
Example:
One example of where RemoveDuplicatesTokenFilterFactory is useful in situations where a synonym file is
being used in conjunction with a stemmer. In these situations, both the stemmer and the synonym filter can
cause completely identical terms with the same positions to end up in the stream, increasing index size with
no benefit.
Consider the following entry from a synonyms.txt file:
Television, Televisions, TV, TVs
When used in the following configuration:






In: "Watch TV"
Tokenizer to Synonym Filter: "Watch"(1) "TV"(2)
Synonym Filter to Stem Filter: "Watch"(1) "Television"(2) "Televisions"(2) "TV"(2) "TVs"(2)
Stem Filter to Remove Dups Filter: "Watch"(1) "Television"(2) "Television"(2) "TV"(2) "TV"(2)
Out: "Watch"(1) "Television"(2) "TV"(2)

Reversed Wildcard Filter
This filter reverses tokens to provide faster leading wildcard and prefix queries. Tokens without wildcards
are not reversed.
Factory class: solr.ReversedWildcardFilterFactory
Arguments:

withOriginal
(boolean) If true, the filter produces both original and reversed tokens at the same positions. If false,
produces only reversed tokens.

maxPosAsterisk
(integer, default = 2) The maximum position of the asterisk wildcard ('*') that triggers the reversal of the
query term. Terms with asterisks at positions above this value are not reversed.

maxPosQuestion

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 254 of 1195

Apache Solr Reference Guide 7.3

(integer, default = 1) The maximum position of the question mark wildcard ('?') that triggers the reversal
of query term. To reverse only pure suffix queries (queries with a single leading asterisk), set this to 0 and
maxPosAsterisk to 1.

maxFractionAsterisk
(float, default = 0.0) An additional parameter that triggers the reversal if asterisk ('*') position is less than
this fraction of the query token length.

minTrailing
(integer, default = 2) The minimum number of trailing characters in a query token after the last wildcard
character. For good performance this should be set to a value larger than 1.
Example:




In: "*foo *bar"
Tokenizer to Filter: "*foo", "*bar"
Out: "oof*", "rab*"

Shingle Filter
This filter constructs shingles, which are token n-grams, from the token stream. It combines runs of tokens
into a single token.
Factory class: solr.ShingleFilterFactory
Arguments:

minShingleSize
(integer, must be >= 2, default 2) The minimum number of tokens per shingle.

maxShingleSize
(integer, must be >= minShingleSize, default 2) The maximum number of tokens per shingle.

outputUnigrams
(boolean, default true) If true, then each individual token is also included at its original position.

outputUnigramsIfNoShingles
(boolean, default false) If true, then individual tokens will be output if no shingles are possible.

tokenSeparator
(string, default is " ") The string to use when joining adjacent tokens to form a shingle.
Example:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 255 of 1195

Default behavior.




In: "To be, or what?"
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "To"(1), "To be"(1), "be"(2), "be or"(2), "or"(3), "or what"(3), "what"(4)
Example:
A shingle size of four, do not include original token.




In: "To be, or not to be."
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "not"(4), "to"(5), "be"(6)
Out: "To be"(1), "To be or"(1), "To be or not"(1), "be or"(2), "be or not"(2), "be or not to"(2), "or not"(3), "or
not to"(3), "or not to be"(3), "not to"(4), "not to be"(4), "to be"(5)

Snowball Porter Stemmer Filter
This filter factory instantiates a language-specific stemmer generated by Snowball. Snowball is a software
package that generates pattern-based word stemmers. This type of stemmer is not as accurate as a tablebased stemmer, but is faster and less complex. Table-driven stemmers are labor intensive to create and
maintain and so are typically commercial products.
Solr contains Snowball stemmers for Armenian, Basque, Catalan, Danish, Dutch, English, Finnish, French,
German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish and Turkish. For
more information on Snowball, visit http://snowball.tartarus.org/.

StopFilterFactory, CommonGramsFilterFactory, and CommonGramsQueryFilterFactory can optionally read
stopwords in Snowball format (specify format="snowball" in the configuration of those FilterFactories).
Factory class: solr.SnowballPorterFilterFactory
Arguments:

language
(default "English") The name of a language, used to select the appropriate Porter stemmer to use. Case is
significant. This string is used to select a package name in the org.tartarus.snowball.ext class

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 256 of 1195

Apache Solr Reference Guide 7.3

hierarchy.

protected
Path of a text file containing a list of protected words, one per line. Protected words will not be stemmed.
Blank lines and lines that begin with "#" are ignored. This may be an absolute path, or a simple file name
in the Solr conf directory.
Example:
Default behavior:




In: "flip flipped flipping"
Tokenizer to Filter: "flip", "flipped", "flipping"
Out: "flip", "flip", "flip"
Example:
French stemmer, English words:




In: "flip flipped flipping"
Tokenizer to Filter: "flip", "flipped", "flipping"
Out: "flip", "flipped", "flipping"
Example:
Spanish stemmer, Spanish words:




In: "cante canta"
Tokenizer to Filter: "cante", "canta"
Out: "cant", "cant"

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 257 of 1195

Standard Filter
This filter removes dots from acronyms and the substring "'s" from the end of tokens. This filter depends on
the tokens being tagged with the appropriate term-type to recognize acronyms and words with
apostrophes.
Factory class: solr.StandardFilterFactory
Arguments: None



This filter is no longer operational in Solr when the luceneMatchVersion (in
solrconfig.xml) is higher than "3.1".

Stop Filter
This filter discards, or stops analysis of, tokens that are on the given stop words list. A standard stop words
list is included in the Solr conf directory, named stopwords.txt, which is appropriate for typical English
language text.
Factory class: solr.StopFilterFactory
Arguments:

words
(optional) The path to a file that contains a list of stop words, one per line. Blank lines and lines that begin
with "#" are ignored. This may be an absolute path, or path relative to the Solr conf directory.

format
(optional) If the stopwords list has been formatted for Snowball, you can specify format="snowball" so
Solr can read the stopwords file.

ignoreCase
(true/false, default false) Ignore case when testing for stop words. If true, the stop list should contain
lowercase words.

enablePositionIncrements
if luceneMatchVersion is 4.4 or earlier and enablePositionIncrements="false", no position holes will
be left by this filter when it removes tokens. This argument is invalid if luceneMatchVersion is 5.0 or
later.
Example:
Case-sensitive matching, capitalized words not stopped. Token positions skip stopped words.




In: "To be or what?"

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 258 of 1195

Apache Solr Reference Guide 7.3

Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "To"(1), "what"(4)
Example:




In: "To be or what?"
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "what"(4)

Suggest Stop Filter
Like Stop Filter, this filter discards, or stops analysis of, tokens that are on the given stop words list.
Suggest Stop Filter differs from Stop Filter in that it will not remove the last token unless it is followed by a
token separator. For example, a query "find the" would preserve the 'the' since it was not followed by a
space, punctuation etc., and mark it as a KEYWORD so that following filters will not change or remove it.
By contrast, a query like “find the popsicle” would remove ‘the’ as a stopword, since it’s followed by a space.
When using one of the analyzing suggesters, you would normally use the ordinary StopFilterFactory in
your index analyzer and then SuggestStopFilter in your query analyzer.
Factory class: solr.SuggestStopFilterFactory
Arguments:

words
(optional; default: StopAnalyzer#ENGLISH_STOP_WORDS_SET ) The name of a stopwords file to parse.

format
(optional; default: wordset) Defines how the words file will be parsed. If words is not specified, then
format must not be specified. The valid values for the format option are:

wordset
This is the default format, which supports one word per line (including any intra-word whitespace) and
allows whole line comments beginning with the # character. Blank lines are ignored.

snowball
This format allows for multiple words specified on each line, and trailing comments may be specified
using the vertical line (|). Blank lines are ignored.

ignoreCase
(optional; default: false) If true, matching is case-insensitive.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 259 of 1195

Example:





In: "The The"
Tokenizer to Filter: "the"(1), "the"(2)
Out: "the"(2)

Synonym Filter
This filter does synonym mapping. Each token is looked up in the list of synonyms and if a match is found,
then the synonym is emitted in place of the token. The position value of the new tokens are set such they all
occur at the same position as the original token.



Synonym Filter has been Deprecated
Synonym Filter has been deprecated in favor of Synonym Graph Filter, which is required for
multi-term synonym support.

Factory class: solr.SynonymFilterFactory
For arguments and examples, see the Synonym Graph Filter below.

Synonym Graph Filter
This filter maps single- or multi-token synonyms, producing a fully correct graph output. This filter is a
replacement for the Synonym Filter, which produces incorrect graphs for multi-token synonyms.
If you use this filter during indexing, you must follow it with a Flatten Graph Filter to squash tokens on top of
one another like the Synonym Filter, because the indexer can’t directly consume a graph. To get fully correct
positional queries when your synonym replacements are multiple tokens, you should instead apply
synonyms using this filter at query time.



Although this filter produces correct token graphs, it cannot consume an input token graph
correctly.

Factory class: solr.SynonymGraphFilterFactory
Arguments:

synonyms
(required) The path of a file that contains a list of synonyms, one per line. In the (default) solr format see the format argument below for alternatives - blank lines and lines that begin with “#” are ignored.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 260 of 1195

Apache Solr Reference Guide 7.3

This may be a comma-separated list of absolute paths, or paths relative to the Solr config directory.
There are two ways to specify synonym mappings:
• A comma-separated list of words. If the token matches any of the words, then all the words in the list
are substituted, which will include the original token.
• Two comma-separated lists of words with the symbol "=>" between them. If the token matches any
word on the left, then the list on the right is substituted. The original token will not be included unless
it is also in the list on the right.

ignoreCase
(optional; default: false) If true, synonyms will be matched case-insensitively.

expand
(optional; default: true) If true, a synonym will be expanded to all equivalent synonyms. If false, all
equivalent synonyms will be reduced to the first in the list.

format
(optional; default: solr) Controls how the synonyms will be parsed. The short names solr (for
SolrSynonymParser) and wordnet (for WordnetSynonymParser ) are supported, or you may alternatively
supply the name of your own SynonymMap.Builder subclass.

tokenizerFactory
(optional; default: WhitespaceTokenizerFactory) The name of the tokenizer factory to use when parsing
the synonyms file. Arguments with the name prefix tokenizerFactory.* will be supplied as init params
to the specified tokenizer factory.
Any arguments not consumed by the synonym filter factory, including those without the
tokenizerFactory.* prefix, will also be supplied as init params to the tokenizer factory.
If tokenizerFactory is specified, then analyzer may not be, and vice versa.

analyzer
(optional; default: WhitespaceTokenizerFactory) The name of the analyzer class to use when parsing the
synonyms file. If analyzer is specified, then tokenizerFactory may not be, and vice versa.
For the following examples, assume a synonyms file named mysynonyms.txt:
couch,sofa,divan
teh => the
huge,ginormous,humungous => large
small => tiny,teeny,weeny
Example:

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 261 of 1195




 





In: "teh small couch"
Tokenizer to Filter: "teh"(1), "small"(2), "couch"(3)
Out: "the"(1), "tiny"(2), "teeny"(2), "weeny"(2), "couch"(3), "sofa"(3), "divan"(3)
Example:
In: "teh ginormous, humungous sofa"
Tokenizer to Filter: "teh"(1), "ginormous"(2), "humungous"(3), "sofa"(4)
Out: "the"(1), "large"(2), "large"(3), "couch"(4), "sofa"(4), "divan"(4)

Token Offset Payload Filter
This filter adds the numeric character offsets of the token as a payload value for that token.
Factory class: solr.TokenOffsetPayloadTokenFilterFactory
Arguments: None
Example:




In: "bing bang boom"
Tokenizer to Filter: "bing", "bang", "boom"
Out: "bing"[0,4], "bang"[5,9], "boom"[10,14]

Trim Filter
This filter trims leading and/or trailing whitespace from tokens. Most tokenizers break tokens at whitespace,
so this filter is most often used for special situations.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 262 of 1195

Apache Solr Reference Guide 7.3

Factory class: solr.TrimFilterFactory
Arguments:

updateOffsets
if luceneMatchVersion is 4.3 or earlier and updateOffsets="true", trimmed tokens' start and end
offsets will be updated to those of the first and last characters (plus one) remaining in the token. This
argument is invalid if luceneMatchVersion is 5.0 or later.
Example:
The PatternTokenizerFactory configuration used here splits the input on simple commas, it does not remove
whitespace.




In: "one, two , three ,four "
Tokenizer to Filter: "one", " two ", " three ", "four "
Out: "one", "two", "three", "four"

Type As Payload Filter
This filter adds the token’s type, as an encoded byte sequence, as its payload.
Factory class: solr.TypeAsPayloadTokenFilterFactory
Arguments: None
Example:




In: "Pay Bob’s I.O.U."
Tokenizer to Filter: "Pay", "Bob’s", "I.O.U."
Out: "Pay"[], "Bob’s"[], "I.O.U."[]

Type As Synonym Filter
This filter adds the token’s type, as a token at the same position as the token, optionally with a configurable
prefix prepended.

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 263 of 1195

Factory class: solr.TypeAsSynonymFilterFactory
Arguments:

prefix
(optional) The prefix to prepend to the token’s type.
Examples:
With the example below, each token’s type will be emitted verbatim at the same position:




With the example below, for a token "example.com" with type , the token emitted at the same position
will be "_type_":





Type Token Filter
This filter blacklists or whitelists a specified list of token types, assuming the tokens have type metadata
associated with them. For example, the UAX29 URL Email Tokenizer emits "" and "" typed
tokens, as well as other types. This filter would allow you to pull out only e-mail addresses from text as
tokens, if you wish.
Factory class: solr.TypeTokenFilterFactory
Arguments:

types
Defines the location of a file of types to filter.

useWhitelist
If true, the file defined in types should be used as include list. If false, or undefined, the file defined in
types is used as a blacklist.

enablePositionIncrements
if luceneMatchVersion is 4.3 or earlier and enablePositionIncrements="false", no position holes will
be left by this filter when it removes tokens. This argument is invalid if luceneMatchVersion is 5.0 or
later.
Example:

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 264 of 1195

Apache Solr Reference Guide 7.3





Word Delimiter Filter
This filter splits tokens at word delimiters.



Word Delimiter Filter has been Deprecated
Word Delimiter Filter has been deprecated in favor of Word Delimiter Graph Filter, which is
required to produce a correct token graph so that e.g., phrase queries can work correctly.

Factory class: solr.WordDelimiterFilterFactory
For a full description, including arguments and examples, see the Word Delimiter Graph Filter below.

Word Delimiter Graph Filter
This filter splits tokens at word delimiters.
If you use this filter during indexing, you must follow it with a Flatten Graph Filter to squash tokens on top of
one another like the Word Delimiter Filter, because the indexer can’t directly consume a graph. To get fully
correct positional queries when tokens are split, you should instead use this filter at query time.
Note: although this filter produces correct token graphs, it cannot consume an input token graph correctly.
The rules for determining delimiters are determined as follows:
• A change in case within a word: "CamelCase" -> "Camel", "Case". This can be disabled by setting
splitOnCaseChange="0".
• A transition from alpha to numeric characters or vice versa: "Gonzo5000" -> "Gonzo", "5000" "4500XL" ->
"4500", "XL". This can be disabled by setting splitOnNumerics="0".
• Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
• A trailing "'s" is removed: "O’Reilly’s" -> "O", "Reilly"
• Any leading or trailing delimiters are discarded: "--hot-spot--" -> "hot", "spot"
Factory class: solr.WordDelimiterGraphFilterFactory
Arguments:

generateWordParts
(integer, default 1) If non-zero, splits words at delimiters. For example:"CamelCase", "hot-spot" ->
"Camel", "Case", "hot", "spot"

generateNumberParts
(integer, default 1) If non-zero, splits numeric strings at delimiters:"1947-32" ->*"1947", "32"

splitOnCaseChange

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 265 of 1195

(integer, default 1) If 0, words are not split on camel-case changes:"BugBlaster-XL" -> "BugBlaster", "XL".
Example 1 below illustrates the default (non-zero) splitting behavior.

splitOnNumerics
(integer, default 1) If 0, don’t split words on transitions from alpha to numeric:"FemBot3000" -> "Fem",
"Bot3000"

catenateWords
(integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor’s" ->
"hotspotsensor"

catenateNumbers
(integer, default 0) If non-zero, maximal runs of number parts will be joined: 1947-32" -> "194732"

catenateAll
(0/1, default 0) If non-zero, runs of word and number parts will be joined: "Zap-Master-9000" ->
"ZapMaster9000"

preserveOriginal
(integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" -> "Zap-Master-9000",
"Zap", "Master", "9000"

protected
(optional) The pathname of a file that contains a list of protected words that should be passed through
without splitting.

stemEnglishPossessive
(integer, default 1) If 1, strips the possessive 's from each subword.

types
(optional) The pathname of a file that contains character => type mappings, which enable customization
of this filter’s splitting behavior. Recognized character types: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, and
SUBWORD_DELIM.
The default for any character without a customized mapping is computed from Unicode character
properties. Blank lines and comment lines starting with '#' are ignored. An example file:
# Don't split numbers at '$', '.' or ','
$ => DIGIT
. => DIGIT
\u002C => DIGIT
# Don't split on ZWJ: http://en.wikipedia.org/wiki/Zero-width_joiner
\u200D => ALPHANUM
Example:
Default behavior. The whitespace tokenizer is used here to preserve non-alphanumeric characters.

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 266 of 1195

Apache Solr Reference Guide 7.3




 





In: "hot-spot RoboBlaster/9000 100XL"
Tokenizer to Filter: "hot-spot", "RoboBlaster/9000", "100XL"
Out: "hot", "spot", "Robo", "Blaster", "9000", "100", "XL"
Example:
Do not split on case changes, and do not generate number parts. Note that by not generating number parts,
tokens containing only numeric parts are ultimately discarded.




In: "hot-spot RoboBlaster/9000 100-42"
Tokenizer to Filter: "hot-spot", "RoboBlaster/9000", "100-42"
Out: "hot", "spot", "RoboBlaster", "9000"
Example:
Concatenate word parts and number parts, but not word and number parts that occur in the same token.




In: "hot-spot 100+42 XL40"
Tokenizer to Filter: "hot-spot"(1), "100+42"(2), "XL40"(3)
Out: "hot"(1), "spot"(2), "hotspot"(2), "100"(3), "42"(4), "10042"(4), "XL"(5), "40"(6)

Guide Version 7.3 - Published: 2018-03-27

© 2018, Apache Software Foundation

Apache Solr Reference Guide 7.3

Page 267 of 1195

Example:
Concatenate all. Word and/or number parts are joined together.




In: "XL-4000/ES"
Tokenizer to Filter: "XL-4000/ES"(1)
Out: "XL"(1), "4000"(2), "ES"(3), "XL4000ES"(3)
Example:
Using a protected words list that contains "AstroBlaster" and "XL-5000" (among others).




In: "FooBar AstroBlaster XL-5000 ==ES-34-"
Tokenizer to Filter: "FooBar", "AstroBlaster", "XL-5000", "==ES-34-"
Out: "FooBar", "FooBar", "AstroBlaster", "XL-5000", "ES", "34"

© 2018, Apache Software Foundation

Guide Version 7.3 - Published: 2018-03-27

Page 268 of 1195

Apache Solr Reference Guide 7.3

CharFilterFactories
CharFilter is a component that pre-processes input characters.
CharFilters can be chained like Token Filters and placed in front of a Tokenizer. CharFilters can add, change,
or remove characters while preserving the original character offsets to support features like highlighting.

solr.MappingCharFilterFactory
This filter creates org.apache.lucene.analysis.MappingCharFilter, which can be used for changing one
string to another (for example, for normalizing é to e.).
This filter requires specifying a mapping argument, which is the path and name of a file containing the
mappings to perform.
Example:



[...]

Mapping file syntax:
• Comment lines beginning with a hash mark (#), as well as blank lines, are ignored.
• Each non-comment, non-blank line consists of a mapping of the form: "source" => "target"
◦ Double-quoted source string, optional whitespace, an arrow (=>), optional whitespace, double-quoted
target string.
• Trailing comments on mapping lines are not allowed.
• The source string must contain at least one character, but the target string may be empty.
• The following character escape sequences are recognized within source and target strings:
Escape Sequence

Resulting Character (ECMA48 alias)

Unicode Character Example Mapping Line

\\

\

U+005C

"\\" => "/"

\"

"

U+0022

"\"and\"" => "'and'"

\b

backspace (BS)

U+0008

"\b" => " "

\t

tab (HT)

U+0009

"\t" => ","

\n

newline (LF)

U+000A

"\n" => "
" \f form feed (FF) U+000C "\f" => "\n" \r carriage return (CR) U+000D "\r" => "/carriagereturn/" Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 269 of 1195 Escape Sequence Resulting Character (ECMA48 alias) Unicode Character Example Mapping Line \uXXXX Unicode char referenced by the 4 hex digits U+XXXX "\uFEFF" => "" ◦ A backslash followed by any other character is interpreted as if the character were present without the backslash. solr.HTMLStripCharFilterFactory This filter creates org.apache.solr.analysis.HTMLStripCharFilter. This CharFilter strips HTML from the input stream and passes the result to another CharFilter or a Tokenizer. This filter: • Removes HTML/XML tags while preserving other content. • Removes attributes within tags and supports optional attribute quoting. • Removes XML processing instructions, such as: • Removes XML comments. • Removes XML elements starting with . • Removes contents of '); --> hello if a hello a [...]
solr.ICUNormalizer2CharFilterFactory This filter performs pre-tokenization Unicode normalization using ICU4J. Arguments: name A Unicode Normalization Form, one of nfc, nfkc, nfkc_cf. Default is nfkc_cf. mode Either compose or decompose. Default is compose. Use decompose with name="nfc" or name="nfkc" to get NFD or NFKD, respectively. filter A UnicodeSet pattern. Codepoints outside the set are always left unchanged. Default is [] (the null set, no filtering - all codepoints are subject to normalization). Example: [...] solr.PatternReplaceCharFilterFactory This filter uses regular expressions to replace or change character patterns. Arguments: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 271 of 1195 pattern the regular expression pattern to apply to the incoming text. replacement the text to use to replace matching patterns. You can configure this filter in schema.xml like this: [...] The table below presents examples of regex-based pattern replacement: Input Pattern Replace ment Output Description see-ing looking (\w+)(ing) $1 see-ing look Removes "ing" from the end of word. see-ing looking (\w+)ing $1 see-ing look Same as above. 2nd parentheses can be omitted. No.1 NO. no. 543 [nN][oO]\.\s*(\d+ #$1 ) #1 NO. #543 Replace some string literals abc=1234=5678 (\w+)=(\d+)=(\d+) $3=$1=$2 5678=abc=1234 © 2018, Apache Software Foundation Change the order of the groups. Guide Version 7.3 - Published: 2018-03-27 Page 272 of 1195 Apache Solr Reference Guide 7.3 Language Analysis This section contains information about tokenizers and filters related to character set conversion or for use with specific languages. For the European languages, tokenization is fairly straightforward. Tokens are delimited by white space and/or a relatively small set of punctuation characters. In other languages the tokenization rules are often not so simple. Some European languages may also require special tokenization rules, such as rules for decompounding German words. For information about language detection at index time, see Detecting Languages During Indexing. KeywordMarkerFilterFactory Protects words from being modified by stemmers. A customized protected word list may be specified with the "protected" attribute in the schema. Any words in the protected word list will not be modified by any stemmer in Solr. A sample Solr protwords.txt with comments can be found in the sample_techproducts_configs config set directory: KeywordRepeatFilterFactory Emits each token twice, one with the KEYWORD attribute and once without. If placed before a stemmer, the result will be that you will get the unstemmed token preserved on the same position as the stemmed one. Queries matching the original exact term will get a better score while still maintaining the recall benefit of stemming. Another advantage of keeping the original token is that wildcard truncation will work as expected. To configure, add the KeywordRepeatFilterFactory early in the analysis chain. It is recommended to also include RemoveDuplicatesTokenFilterFactory to avoid duplicates when tokens are not stemmed. A sample fieldType configuration could look like this: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 273 of 1195  When adding the same token twice, it will also score twice (double), so you may have to retune your ranking rules. StemmerOverrideFilterFactory Overrides stemming algorithms by applying a custom mapping, then protecting these terms from being modified by stemmers. A customized mapping of words to stems, in a tab-separated file, can be specified to the "dictionary" attribute in the schema. Words in this mapping will be stemmed to the stems from the file, and will not be further changed by any stemmer. A sample stemdict.txt with comments can be found in the Source Repository. Dictionary Compound Word Token Filter This filter splits, or decompounds, compound words into individual words using a dictionary of the component words. Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is also added to the stream at the same logical position. Compound words are most commonly found in Germanic languages. Factory class: solr.DictionaryCompoundWordTokenFilterFactory Arguments: dictionary (required) The path of a file that contains a list of simple words, one per line. Blank lines and lines that begin with "#" are ignored. This path may be an absolute path, or path relative to the Solr config directory. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 274 of 1195 Apache Solr Reference Guide 7.3 minWordSize (integer, default 5) Any token shorter than this is not decompounded. minSubwordSize (integer, default 2) Subwords shorter than this are not emitted as tokens. maxSubwordSize (integer, default 15) Subwords longer than this are not emitted as tokens. onlyLongestMatch (true/false) If true (the default), only the longest matching subwords will generate new tokens. Example: Assume that germanwords.txt contains at least the following words: dumm kopf donau dampf schiff In: "Donaudampfschiff dummkopf" Tokenizer to Filter: "Donaudampfschiff"(1), "dummkopf"(2), Out: "Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2), "kopf"(2) Unicode Collation Unicode Collation is a language-sensitive method of sorting text that can also be used for advanced search purposes. Unicode Collation in Solr is fast, because all the work is done at index time. Rather than specifying an analyzer within , the solr.CollationField and solr.ICUCollationField field type classes provide this functionality. solr.ICUCollationField, which is backed by the ICU4J library, provides more flexible configuration, has more locales, is significantly faster, and requires less memory and less index space, since its keys are smaller than those produced by the JDK implementation that backs solr.CollationField. solr.ICUCollationField is included in the Solr analysis-extras contrib - see solr/contrib/analysisextras/README.txt for instructions on which jars you need to add to your SOLR_HOME/lib in order to use it. solr.ICUCollationField and solr.CollationField fields can be created in two ways: • Based upon a system collator associated with a Locale. • Based upon a tailored RuleBasedCollator ruleset. Arguments for solr.ICUCollationField, specified as attributes within the element: Using a System collator: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 275 of 1195 locale (required) RFC 3066 locale ID. See the ICU locale explorer for a list of supported locales. strength Valid values are primary, secondary, tertiary, quaternary, or identical. See Comparison Levels in ICU Collation Concepts for more information. decomposition Valid values are no or canonical. See Normalization in ICU Collation Concepts for more information. Using a Tailored ruleset: custom (required) Path to a UTF-8 text file containing rules supported by the ICU RuleBasedCollator strength Valid values are primary, secondary, tertiary, quaternary, or identical. See Comparison Levels in ICU Collation Concepts for more information. decomposition Valid values are no or canonical. See Normalization in ICU Collation Concepts for more information. Expert options: alternate Valid values are shifted or non-ignorable. Can be used to ignore punctuation/whitespace. caseLevel (true/false) If true, in combination with strength="primary", accents are ignored but case is taken into account. The default is false. See CaseLevel in ICU Collation Concepts for more information. caseFirst Valid values are lower or upper. Useful to control which is sorted first when case is not ignored. numeric (true/false) If true, digits are sorted according to numeric value, e.g., foobar-9 sorts before foobar-10. The default is false. variableTop Single character or contraction. Controls what is variable for alternate. Sorting Text for a Specific Language In this example, text is sorted according to the default German rules provided by ICU4J. Locales are typically defined as a combination of language and country, but you can specify just the language if you want. For example, if you specify "de" as the language, you will get sorting that works well for the German language. If you specify "de" as the language and "CH" as the country, you will get German sorting specifically tailored for Switzerland. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 276 of 1195 Apache Solr Reference Guide 7.3 ... ... In the example above, we defined the strength as "primary". The strength of the collation determines how strict the sort order will be, but it also depends upon the language. For example, in English, "primary" strength ignores differences in case and accents. Another example: ... ... ... The type will be used for the fields where the data contains Polish text. The "secondary" strength will ignore case differences, but, unlike "primary" strength, a letter with diacritic(s) will be sorted differently from the same base letter without diacritics. An example using the "city_sort" field to sort: q=*:*&fl=city&sort=city_sort+asc Sorting Text for Multiple Languages There are two approaches to supporting multiple languages: if there is a small list of languages you wish to support, consider defining collated fields for each language and using copyField. However, adding a large number of sort fields can increase disk and indexing costs. An alternative approach is to use the Unicode default collator. The Unicode default or ROOT locale has rules that are designed to work well for most languages. To use the default locale, simply define the locale as the empty string. This Unicode default sort is still significantly more advanced than the standard Solr sort. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 277 of 1195 Sorting Text with Custom Rules You can define your own set of sorting rules. It’s easiest to take existing rules that are close to what you want and customize them. In the example below, we create a custom rule set for German called DIN 5007-2. This rule set treats umlauts in German differently: it treats ö as equivalent to oe, ä as equivalent to ae, and ü as equivalent to ue. For more information, see the ICU RuleBasedCollator javadocs. This example shows how to create a custom rule set for solr.ICUCollationField and dump it to a file: // get the default rules for Germany // these are called DIN 5007-1 sorting RuleBasedCollator baseCollator = (RuleBasedCollator) Collator.getInstance(new ULocale("de", "DE" )); // define some tailorings, to make it DIN 5007-2 sorting. // For example, this makes ö equivalent to oe String DIN5007_2_tailorings = "& ae , a\u0308 & AE , A\u0308"+ "& oe , o\u0308 & OE , O\u0308"+ "& ue , u\u0308 & UE , u\u0308"; // concatenate the default rules to the tailorings, and dump it to a String RuleBasedCollator tailoredCollator = new RuleBasedCollator(baseCollator.getRules() + DIN5007_2_tailorings); String tailoredRules = tailoredCollator.getRules(); // write these to a file, be sure to use UTF-8 encoding!!! FileOutputStream os = new FileOutputStream(new File("/solr_home/conf/customRules.dat")); IOUtils.write(tailoredRules, os, "UTF-8"); This rule set can now be used for custom collation in Solr: JDK Collation As mentioned above, ICU Unicode Collation is better in several ways than JDK Collation, but if you cannot use ICU4J for some reason, you can use solr.CollationField. The principles of JDK Collation are the same as those of ICU Collation; you just specify language, country © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 278 of 1195 Apache Solr Reference Guide 7.3 and variant arguments instead of the combined locale argument. Arguments for solr.CollationField, specified as attributes within the element: Using a System collator (see Oracle’s list of locales supported in Java 8): language (required) ISO-639 language code country ISO-3166 country code variant Vendor or browser-specific code strength Valid values are primary, secondary, tertiary or identical. See Oracle Java 8 Collator javadocs for more information. decomposition Valid values are no, canonical, or full. See Oracle Java 8 Collator javadocs for more information. Using a Tailored ruleset: custom (required) Path to a UTF-8 text file containing rules supported by the JDK RuleBasedCollator strength Valid values are primary, secondary, tertiary or identical. See Oracle Java 8 Collator javadocs for more information. decomposition Valid values are no, canonical, or full. See Oracle Java 8 Collator javadocs for more information. A solr.CollationField example: ... ... ASCII & Decimal Folding Filters ASCII Folding This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. Only those characters Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 279 of 1195 with reasonable ASCII alternatives are converted. This can increase recall by causing more matches. On the other hand, it can reduce precision because language-specific character differences may be lost. Factory class: solr.ASCIIFoldingFilterFactory Arguments: None Example: In: "Björn Ångström" Tokenizer to Filter: "Björn", "Ångström" Out: "Bjorn", "Angstrom" Decimal Digit Folding This filter converts any character in the Unicode "Decimal Number" general category (Nd) into their equivalent Basic Latin digits (0-9). This can increase recall by causing more matches. On the other hand, it can reduce precision because language-specific character differences may be lost. Factory class: solr.DecimalDigitFilterFactory Arguments: None Example: OpenNLP Integration The lucene/analysis/opennlp module provides OpenNLP integration via several analysis components: a tokenizer, a part-of-speech tagging filter, a phrase chunking filter, and a lemmatization filter. In addition to these analysis components, Solr also provides an update request processor to extract named entities - see Update Processor Factories That Can Be Loaded as Plugins. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 280 of 1195  Apache Solr Reference Guide 7.3 The OpenNLP Tokenizer must be used with all other OpenNLP analysis components, for two reasons: first, the OpenNLP Tokenizer detects and marks the sentence boundaries required by all the OpenNLP filters; and second, since the pre-trained OpenNLP models used by these filters were trained using the corresponding language-specific sentencedetection/tokenization models, the same tokenization, using the same models, must be used at runtime for optimal performance. See solr/contrib/analysis-extras/README.txt for information on which jars you need to add to your SOLR_HOME/lib. OpenNLP Tokenizer The OpenNLP Tokenizer takes two language-specific binary model files as parameters: a sentence detector model and a tokenizer model. The last token in each sentence is flagged, so that following OpenNLP-based filters can use this information to apply operations to tokens one sentence at a time. See the OpenNLP website for information on downloading pre-trained models. Factory class: solr.OpenNLPTokenizerFactory Arguments: sentenceModel (required) The path of a language-specific OpenNLP sentence detection model file. This path may be an absolute path, or path relative to the Solr config directory. tokenizerModel (required) The path of a language-specific OpenNLP tokenization model file. This path may be an absolute path, or path relative to the Solr config directory. Example: OpenNLP Part-Of-Speech Filter This filter sets each token’s type attribute to the part of speech (POS) assigned by the configured model. See the OpenNLP website for information on downloading pre-trained models.  Lucene currently does not index token types, so if you want to keep this information, you have to preserve it either in a payload or as a synonym; see the examples below. Factory class: solr.OpenNLPPOSFilterFactory Arguments: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 281 of 1195 posTaggerModel (required) The path of a language-specific OpenNLP POS tagger model file. This path may be an absolute path, or path relative to the Solr config directory. Examples: The OpenNLP tokenizer will tokenize punctuation, which is useful for following token filters, but ordinarily you don’t want to include punctuation in your index, so the TypeTokenFilter (described here) is included in the examples below, with stop.pos.txt containing the following: stop.pos.txt # $ '' `` , -LRB-RRB: . Index the POS for each token as a payload: Index the POS for each token as a synonym, after prefixing the POS with "@" (see the TypeAsSynonymFilter description): Only index nouns - the keep.pos.txt file contains lines NN, NNS, NNP and NNPS: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 282 of 1195 Apache Solr Reference Guide 7.3 OpenNLP Phrase Chunking Filter This filter sets each token’s type attribute based on the output of an OpenNLP phrase chunking model. The chunk labels replace the POS tags that previously were in each token’s type attribute. See the OpenNLP website for information on downloading pre-trained models. Prerequisite: the OpenNLP Tokenizer and the OpenNLP Part-Of-Speech Filter must precede this filter.  Lucene currently does not index token types, so if you want to keep this information, you have to preserve it either in a payload or as a synonym; see the examples below. Factory class: solr.OpenNLPChunkerFilter Arguments: chunkerModel (required) The path of a language-specific OpenNLP phrase chunker model file. This path may be an absolute path, or path relative to the Solr config directory. Examples: Index the phrase chunk label for each token as a payload: Index the phrase chunk label for each token as a synonym, after prefixing it with "#" (see the TypeAsSynonymFilter description): Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 283 of 1195 OpenNLP Lemmatizer Filter This filter replaces the text of each token with its lemma. Both a dictionary-based lemmatizer and a modelbased lemmatizer are supported. If both are configured, the dictionary-based lemmatizer is tried first, and then the model-based lemmatizer is consulted for out-of-vocabulary tokens. See the OpenNLP website for information on downloading pre-trained models. Factory class: solr.OpenNLPLemmatizerFilter Arguments: Either dictionary or lemmatizerModel must be provided, and both may be provided - see the examples below: dictionary (optional) The path of a lemmatization dictionary file. This path may be an absolute path, or path relative to the Solr config directory. The dictionary file must be encoded as UTF-8, with one entry per line, in the form word[tab]lemma[tab]part-of-speech, e.g., wrote[tab]write[tab]VBD. lemmatizerModel (optional) The path of a language-specific OpenNLP lemmatizer model file. This path may be an absolute path, or path relative to the Solr config directory. Examples: Perform dictionary-based lemmatization, and fall back to model-based lemmatization for out-of-vocabulary tokens (see the OpenNLP Part-Of-Speech Filter section above for information about using TypeTokenFilter to avoid indexing punctuation): © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 284 of 1195 Apache Solr Reference Guide 7.3 Perform dictionary-based lemmatization only: Perform model-based lemmatization only, preserving the original token and emitting the lemma as a synonym (see the KeywordRepeatFilterFactory description)): Language-Specific Factories These factories are each designed to work with specific languages. The languages covered here are: • Arabic • Brazilian Portuguese • Bulgarian • Catalan • Traditional Chinese • Simplified Chinese • Czech • Danish • Dutch • Finnish • French • Galician • German • Greek Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 285 of 1195 • Hebrew, Lao, Myanmar, Khmer • Hindi • Indonesian • Italian • Irish • Japanese • Latvian • Norwegian • Persian • Polish • Portuguese • Romanian • Russian • Scandinavian • Serbian • Spanish • Swedish • Thai • Turkish • Ukrainian Arabic Solr provides support for the Light-10 (PDF) stemming algorithm, and Lucene includes an example stopword list. This algorithm defines both character normalization and stemming, so these are split into two filters to provide more flexibility. Factory classes: solr.ArabicStemFilterFactory, solr.ArabicNormalizationFilterFactory Arguments: None Example: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 286 of 1195 Apache Solr Reference Guide 7.3 Brazilian Portuguese This is a Java filter written specifically for stemming the Brazilian dialect of the Portuguese language. It uses the Lucene class org.apache.lucene.analysis.br.BrazilianStemmer. Although that stemmer can be configured to use a list of protected words (which should not be stemmed), this factory does not accept any arguments to specify such a list. Factory class: solr.BrazilianStemFilterFactory Arguments: None Example: In: "praia praias" Tokenizer to Filter: "praia", "praias" Out: "pra", "pra" Bulgarian Solr includes a light stemmer for Bulgarian, following this algorithm (PDF), and Lucene includes an example stopword list. Factory class: solr.BulgarianStemFilterFactory Arguments: None Example: Catalan Solr can stem Catalan using the Snowball Porter Stemmer with an argument of language="Catalan". Solr includes a set of contractions for Catalan, which can be stripped using solr.ElisionFilterFactory. Factory class: solr.SnowballPorterFilterFactory Arguments: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 287 of 1195 language (required) stemmer language, "Catalan" in this case Example: In: "llengües llengua" Tokenizer to Filter: "llengües"(1) "llengua"(2), Out: "llengu"(1), "llengu"(2) Traditional Chinese The default configuration of the ICU Tokenizer is suitable for Traditional Chinese text. It follows the Word Break rules from the Unicode Text Segmentation algorithm for non-Chinese text, and uses a dictionary to segment Chinese words. To use this tokenizer, you must add additional .jars to Solr’s classpath (as described in the section Lib Directives in SolrConfig). See the solr/contrib/analysis-extras/README.txt for information on which jars you need to add to your SOLR_HOME/lib. Standard Tokenizer can also be used to tokenize Traditional Chinese text. Following the Word Break rules from the Unicode Text Segmentation algorithm, it produces one token per Chinese character. When combined with CJK Bigram Filter, overlapping bigrams of Chinese characters are formed. CJK Width Filter folds fullwidth ASCII variants into the equivalent Basic Latin forms. Examples: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 288 of 1195 Apache Solr Reference Guide 7.3 CJK Bigram Filter Forms bigrams (overlapping 2-character sequences) of CJK characters that are generated from Standard Tokenizer or ICU Tokenizer. By default, all CJK characters produce bigrams, but finer grained control is available by specifying orthographic type arguments han, hiragana, katakana, and hangul. When set to false, characters of the corresponding type will be passed through as unigrams, and will not be included in any bigrams. When a CJK character has no adjacent characters to form a bigram, it is output in unigram form. If you want to always output both unigrams and bigrams, set the outputUnigrams argument to true. In all cases, all non-CJK input is passed through unmodified. Arguments: han (true/false) If false, Han (Chinese) characters will not form bigrams. Default is true. hiragana (true/false) If false, Hiragana (Japanese) characters will not form bigrams. Default is true. katakana (true/false) If false, Katakana (Japanese) characters will not form bigrams. Default is true. hangul (true/false) If false, Hangul (Korean) characters will not form bigrams. Default is true. outputUnigrams (true/false) If true, in addition to forming bigrams, all characters are also passed through as unigrams. Default is false. See the example under Traditional Chinese. Simplified Chinese For Simplified Chinese, Solr provides support for Chinese sentence and word segmentation with the HMM Chinese Tokenizer. This component includes a large dictionary and segments Chinese text into words with the Hidden Markov Model. To use this tokenizer, you must add additional .jars to Solr’s classpath (as described in the section Lib Directives in SolrConfig). See the solr/contrib/analysis-extras/README.txt for information on which jars you need to add to your SOLR_HOME/lib. The default configuration of the ICU Tokenizer is also suitable for Simplified Chinese text. It follows the Word Break rules from the Unicode Text Segmentation algorithm for non-Chinese text, and uses a dictionary to segment Chinese words. To use this tokenizer, you must add additional .jars to Solr’s classpath (as described in the section Lib Directives in SolrConfig). See the solr/contrib/analysis-extras/README.txt for information on which jars you need to add to your SOLR_HOME/lib. Also useful for Chinese analysis: CJK Width Filter folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds halfwidth Katakana variants into their equivalent fullwidth forms. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 289 of 1195 Examples: HMM Chinese Tokenizer For Simplified Chinese, Solr provides support for Chinese sentence and word segmentation with the solr.HMMChineseTokenizerFactory in the analysis-extras contrib module. This component includes a large dictionary and segments Chinese text into words with the Hidden Markov Model. To use this tokenizer, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your solr_home/lib. Factory class: solr.HMMChineseTokenizerFactory Arguments: None Examples: To use the default setup with fallback to English Porter stemmer for English words, use: Or to configure your own analysis setup, use the solr.HMMChineseTokenizerFactory along with your custom filter setup. See an example of this in the Simplified Chinese section. Czech Solr includes a light stemmer for Czech, following this algorithm, and Lucene includes an example stopword list. Factory class: solr.CzechStemFilterFactory Arguments: None Example: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 290 of 1195 Apache Solr Reference Guide 7.3 In: "prezidenští, prezidenta, prezidentského" Tokenizer to Filter: "prezidenští", "prezidenta", "prezidentského" Out: "preziden", "preziden", "preziden" Danish Solr can stem Danish using the Snowball Porter Stemmer with an argument of language="Danish". Also relevant are the Scandinavian normalization filters. Factory class: solr.SnowballPorterFilterFactory Arguments: language (required) stemmer language, "Danish" in this case Example: In: "undersøg undersøgelse" Tokenizer to Filter: "undersøg"(1) "undersøgelse"(2), Out: "undersøg"(1), "undersøg"(2) Dutch Solr can stem Dutch using the Snowball Porter Stemmer with an argument of language="Dutch". Factory class: solr.SnowballPorterFilterFactory Arguments: language (required) stemmer language, "Dutch" in this case Example: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 291 of 1195 In: "kanaal kanalen" Tokenizer to Filter: "kanaal", "kanalen" Out: "kanal", "kanal" Finnish Solr includes support for stemming Finnish, and Lucene includes an example stopword list. Factory class: solr.FinnishLightStemFilterFactory Arguments: None Example: In: "kala kalat" Tokenizer to Filter: "kala", "kalat" Out: "kala", "kala" French Elision Filter Removes article elisions from a token stream. This filter can be useful for languages such as French, Catalan, Italian, and Irish. Factory class: solr.ElisionFilterFactory Arguments: articles The pathname of a file that contains a list of articles, one per line, to be stripped. Articles are words such as "le", which are commonly abbreviated, such as in l’avion (the plane). This file should include the abbreviated form, which precedes the apostrophe. In this case, simply "l". If no articles attribute is specified, a default set of French articles is used. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 292 of 1195 Apache Solr Reference Guide 7.3 ignoreCase (boolean) If true, the filter ignores the case of words when comparing them to the common word file. Defaults to false Example: In: "L’histoire d’art" Tokenizer to Filter: "L’histoire", "d’art" Out: "histoire", "art" French Light Stem Filter Solr includes three stemmers for French: one in the solr.SnowballPorterFilterFactory, a lighter stemmer called solr.FrenchLightStemFilterFactory, and an even less aggressive stemmer called solr.FrenchMinimalStemFilterFactory. Lucene includes an example stopword list. Factory classes: solr.FrenchLightStemFilterFactory, solr.FrenchMinimalStemFilterFactory Arguments: None Examples: In: "le chat, les chats" Tokenizer to Filter: "le", "chat", "les", "chats" Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 293 of 1195 Out: "le", "chat", "le", "chat" Galician Solr includes a stemmer for Galician following this algorithm, and Lucene includes an example stopword list. Factory class: solr.GalicianStemFilterFactory Arguments: None Example: In: "felizmente Luzes" Tokenizer to Filter: "felizmente", "luzes" Out: "feliz", "luz" German Solr includes four stemmers for German: one in the solr.SnowballPorterFilterFactory language="German", a stemmer called solr.GermanStemFilterFactory, a lighter stemmer called solr.GermanLightStemFilterFactory, and an even less aggressive stemmer called solr.GermanMinimalStemFilterFactory. Lucene includes an example stopword list. Factory classes: solr.GermanStemFilterFactory, solr.LightGermanStemFilterFactory, solr.MinimalGermanStemFilterFactory Arguments: None Examples: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 294 of 1195 Apache Solr Reference Guide 7.3 In: "haus häuser" Tokenizer to Filter: "haus", "häuser" Out: "haus", "haus" Greek This filter converts uppercase letters in the Greek character set to the equivalent lowercase character. Factory class: solr.GreekLowerCaseFilterFactory Arguments: None  Use of custom charsets is no longer supported as of Solr 3.1. If you need to index text in these encodings, please use Java’s character set conversion facilities (InputStreamReader, etc.) during I/O, so that Lucene can analyze this text as Unicode instead. Example: Hindi Solr includes support for stemming Hindi following this algorithm (PDF), support for common spelling differences through the solr.HindiNormalizationFilterFactory, support for encoding differences through the solr.IndicNormalizationFilterFactory following this algorithm, and Lucene includes an example stopword list. Factory classes: solr.IndicNormalizationFilterFactory, solr.HindiNormalizationFilterFactory, solr.HindiStemFilterFactory Arguments: None Example: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 295 of 1195 Indonesian Solr includes support for stemming Indonesian (Bahasa Indonesia) following this algorithm (PDF), and Lucene includes an example stopword list. Factory class: solr.IndonesianStemFilterFactory Arguments: None Example: In: "sebagai sebagainya" Tokenizer to Filter: "sebagai", "sebagainya" Out: "bagai", "bagai" Italian Solr includes two stemmers for Italian: one in the solr.SnowballPorterFilterFactory language="Italian", and a lighter stemmer called solr.ItalianLightStemFilterFactory. Lucene includes an example stopword list. Factory class: solr.ItalianStemFilterFactory Arguments: None Example: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 296 of 1195 Apache Solr Reference Guide 7.3 In: "propaga propagare propagamento" Tokenizer to Filter: "propaga", "propagare", "propagamento" Out: "propag", "propag", "propag" Irish Solr can stem Irish using the Snowball Porter Stemmer with an argument of language="Irish". Solr includes solr.IrishLowerCaseFilterFactory, which can handle Irish-specific constructs. Solr also includes a set of contractions for Irish which can be stripped using solr.ElisionFilterFactory. Factory class: solr.SnowballPorterFilterFactory Arguments: language (required) stemmer language, "Irish" in this case Example: In: "siopadóireacht síceapatacha b’fhearr m’athair" Tokenizer to Filter: "siopadóireacht", "síceapatacha", "b’fhearr", "m’athair" Out: "siopadóir", "síceapaite", "fearr", "athair" Japanese Solr includes support for analyzing Japanese, via the Lucene Kuromoji morphological analyzer, which includes several analysis components - more details on each below: • JapaneseIterationMarkCharFilter normalizes Japanese horizontal iteration marks (odoriji) to their expanded form. • JapaneseTokenizer tokenizes Japanese using morphological analysis, and annotates each term with part-of-speech, base form (a.k.a. lemma), reading and pronunciation. • JapaneseBaseFormFilter replaces original terms with their base forms (a.k.a. lemmas). • JapanesePartOfSpeechStopFilter removes terms that have one of the configured parts-of-speech. • JapaneseKatakanaStemFilter normalizes common katakana spelling variations ending in a long sound character (U+30FC) by removing the long sound character. Also useful for Japanese analysis, from lucene-analyzers-common: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 297 of 1195 • CJKWidthFilter folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds halfwidth Katakana variants into their equivalent fullwidth forms. Japanese Iteration Mark CharFilter Normalizes horizontal Japanese iteration marks (odoriji) to their expanded form. Vertical iteration marks are not supported. Factory class: JapaneseIterationMarkCharFilterFactory Arguments: normalizeKanji set to false to not normalize kanji iteration marks (default is true) normalizeKana set to false to not normalize kana iteration marks (default is true) Japanese Tokenizer Tokenizer for Japanese that uses morphological analysis, and annotates each term with part-of-speech, base form (a.k.a. lemma), reading and pronunciation. JapaneseTokenizer has a search mode (the default) that does segmentation useful for search: a heuristic is used to segment compound terms into their constituent parts while also keeping the original compound terms as synonyms. Factory class: solr.JapaneseTokenizerFactory Arguments: mode Use search mode to get a noun-decompounding effect useful for search. search mode improves segmentation for search at the expense of part-of-speech accuracy. Valid values for mode are: • normal: default segmentation • search: segmentation useful for search (extra compound splitting) • extended: search mode plus unigramming of unknown words (experimental) For some applications it might be good to use search mode for indexing and normal mode for queries to increase precision and prevent parts of compounds from being matched and highlighted. userDictionary filename for a user dictionary, which allows overriding the statistical model with your own entries for segmentation, part-of-speech tags and readings without a need to specify weights. See lang/userdict_ja.txt for a sample user dictionary file. userDictionaryEncoding user dictionary encoding (default is UTF-8) discardPunctuation set to false to keep punctuation, true to discard (the default) © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 298 of 1195 Apache Solr Reference Guide 7.3 Japanese Base Form Filter Replaces original terms' text with the corresponding base form (lemma). (JapaneseTokenizer annotates each term with its base form.) Factory class: JapaneseBaseFormFilterFactory (no arguments) Japanese Part Of Speech Stop Filter Removes terms with one of the configured parts-of-speech. JapaneseTokenizer annotates terms with partsof-speech. Factory class : JapanesePartOfSpeechStopFilterFactory Arguments: tags filename for a list of parts-of-speech for which to remove terms; see conf/lang/stoptags_ja.txt in the sample_techproducts_config config set for an example. enablePositionIncrements if luceneMatchVersion is 4.3 or earlier and enablePositionIncrements="false", no position holes will be left by this filter when it removes tokens. This argument is invalid if luceneMatchVersion is 5.0 or later. Japanese Katakana Stem Filter Normalizes common katakana spelling variations ending in a long sound character (U+30FC) by removing the long sound character. solr.CJKWidthFilterFactory should be specified prior to this filter to normalize half-width katakana to fullwidth. Factory class: JapaneseKatakanaStemFilterFactory Arguments: minimumLength terms below this length will not be stemmed. Default is 4, value must be 2 or more. CJK Width Filter Folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds halfwidth Katakana variants into their equivalent fullwidth forms. Factory class: CJKWidthFilterFactory (no arguments) Example: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 299 of 1195 Hebrew, Lao, Myanmar, Khmer Lucene provides support, in addition to UAX#29 word break rules, for Hebrew’s use of the double and single quote characters, and for segmenting Lao, Myanmar, and Khmer into syllables with the solr.ICUTokenizerFactory in the analysis-extras contrib module. To use this tokenizer, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your solr_home/lib. See the ICUTokenizer for more information. Latvian Solr includes support for stemming Latvian, and Lucene includes an example stopword list. Factory class: solr.LatvianStemFilterFactory Arguments: None Example: In: "tirgiem tirgus" Tokenizer to Filter: "tirgiem", "tirgus" Out: "tirg", "tirg" © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 300 of 1195 Apache Solr Reference Guide 7.3 Norwegian Solr includes two classes for stemming Norwegian, NorwegianLightStemFilterFactory and NorwegianMinimalStemFilterFactory. Lucene includes an example stopword list. Another option is to use the Snowball Porter Stemmer with an argument of language="Norwegian". Also relevant are the Scandinavian normalization filters. Norwegian Light Stemmer The NorwegianLightStemFilterFactory requires a "two-pass" sort for the -dom and -het endings. This means that in the first pass the word "kristendom" is stemmed to "kristen", and then all the general rules apply so it will be further stemmed to "krist". The effect of this is that "kristen," "kristendom," "kristendommen," and "kristendommens" will all be stemmed to "krist." The second pass is to pick up -dom and -het endings. Consider this example: One pass Two passes Before After Before After forlegen forleg forlegen forleg forlegenhet forlegen forlegenhet forleg forlegenheten forlegen forlegenheten forleg forlegenhetens forlegen forlegenhetens forleg firkantet firkant firkantet firkant firkantethet firkantet firkantethet firkant firkantetheten firkantet firkantetheten firkant Factory class: solr.NorwegianLightStemFilterFactory Arguments: variant Choose the Norwegian language variant to use. Valid values are: • nb: Bokmål (default) • nn: Nynorsk • no: both Example: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 301 of 1195 In: "Forelskelsen" Tokenizer to Filter: "forelskelsen" Out: "forelske" Norwegian Minimal Stemmer The NorwegianMinimalStemFilterFactory stems plural forms of Norwegian nouns only. Factory class: solr.NorwegianMinimalStemFilterFactory Arguments: variant Choose the Norwegian language variant to use. Valid values are: • nb: Bokmål (default) • nn: Nynorsk • no: both Example: In: "Bilens" Tokenizer to Filter: "bilens" Out: "bil" © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 302 of 1195 Apache Solr Reference Guide 7.3 Persian Persian Filter Factories Solr includes support for normalizing Persian, and Lucene includes an example stopword list. Factory class: solr.PersianNormalizationFilterFactory Arguments: None Example: Polish Solr provides support for Polish stemming with the solr.StempelPolishStemFilterFactory, and solr.MorphologikFilterFactory for lemmatization, in the contrib/analysis-extras module. The solr.StempelPolishStemFilterFactory component includes an algorithmic stemmer with tables for Polish. To use either of these filters, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your solr_home/lib. Factory class: solr.StempelPolishStemFilterFactory and solr.MorfologikFilterFactory Arguments: None Example: In: ""studenta studenci" Tokenizer to Filter: "studenta", "studenci" Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 303 of 1195 Out: "student", "student" More information about the Stempel stemmer is available in the Lucene javadocs. Note the lower case filter is applied after the Morfologik stemmer; this is because the Polish dictionary contains proper names and then proper term case may be important to resolve disambiguities (or even lookup the correct lemma at all). The Morfologik dictionary parameter value is a constant specifying which dictionary to choose. The dictionary resource must be named path/to/language.dict and have an associated .info metadata file. See the Morfologik project for details. If the dictionary attribute is not provided, the Polish dictionary is loaded and used by default. Portuguese Solr includes four stemmers for Portuguese: one in the solr.SnowballPorterFilterFactory, an alternative stemmer called solr.PortugueseStemFilterFactory, a lighter stemmer called solr.PortugueseLightStemFilterFactory, and an even less aggressive stemmer called solr.PortugueseMinimalStemFilterFactory. Lucene includes an example stopword list. Factory classes: solr.PortugueseStemFilterFactory, solr.PortugueseLightStemFilterFactory, solr.PortugueseMinimalStemFilterFactory Arguments: None Example: In: "praia praias" Tokenizer to Filter: "praia", "praias" © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 304 of 1195 Apache Solr Reference Guide 7.3 Out: "pra", "pra" Romanian Solr can stem Romanian using the Snowball Porter Stemmer with an argument of language="Romanian". Factory class: solr.SnowballPorterFilterFactory Arguments: language (required) stemmer language, "Romanian" in this case Example: Russian Russian Stem Filter Solr includes two stemmers for Russian: one in the solr.SnowballPorterFilterFactory language="Russian", and a lighter stemmer called solr.RussianLightStemFilterFactory. Lucene includes an example stopword list. Factory class: solr.RussianLightStemFilterFactory Arguments: None Example: Scandinavian Scandinavian is a language group spanning three languages Norwegian, Swedish and Danish which are very similar. Swedish å, ä, ö are in fact the same letters as Norwegian and Danish å, æ, ø and thus interchangeable when used between these languages. They are however folded differently when people type them on a keyboard lacking these characters. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 305 of 1195 In that situation almost all Swedish people use a, a, o instead of å, ä, ö. Norwegians and Danes on the other hand usually type aa, ae and oe instead of å, æ and ø. Some do however use a, a, o, oo, ao and sometimes permutations of everything above. There are two filters for helping with normalization between Scandinavian languages: one is solr.ScandinavianNormalizationFilterFactory trying to preserve the special characters (æäöå) and another solr.ScandinavianFoldingFilterFactory which folds these to the more broad ø/ö->o etc. See also each language section for other relevant filters. Scandinavian Normalization Filter This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ. It’s a semantically less destructive solution than ScandinavianFoldingFilter, most useful when a person with a Norwegian or Danish keyboard queries a Swedish index and vice versa. This filter does not perform the common Swedish folds of å and ä to a nor ö to o. Factory class: solr.ScandinavianNormalizationFilterFactory Arguments: None Example: In: "blåbærsyltetøj blåbärsyltetöj blaabaarsyltetoej blabarsyltetoj" Tokenizer to Filter: "blåbærsyltetøj", "blåbärsyltetöj", "blaabaersyltetoej", "blabarsyltetoj" Out: "blåbærsyltetøj", "blåbærsyltetøj", "blåbærsyltetøj", "blabarsyltetoj" Scandinavian Folding Filter This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o. It also discriminate against use of double vowels aa, ae, ao, oe and oo, leaving just the first one. It’s a semantically more destructive solution than ScandinavianNormalizationFilter, but can in addition help with matching raksmorgas as räksmörgås. Factory class: solr.ScandinavianFoldingFilterFactory Arguments: None Example: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 306 of 1195 Apache Solr Reference Guide 7.3 In: "blåbærsyltetøj blåbärsyltetöj blaabaarsyltetoej blabarsyltetoj" Tokenizer to Filter: "blåbærsyltetøj", "blåbärsyltetöj", "blaabaersyltetoej", "blabarsyltetoj" Out: "blabarsyltetoj", "blabarsyltetoj", "blabarsyltetoj", "blabarsyltetoj" Serbian Serbian Normalization Filter Solr includes a filter that normalizes Serbian Cyrillic and Latin characters. Note that this filter only works with lowercased input. See the Solr wiki for tips & advice on using this filter: https://wiki.apache.org/solr/SerbianLanguageSupport Factory class: solr.SerbianNormalizationFilterFactory Arguments: haircut Select the extend of normalization. Valid values are: • bald: (Default behavior) Cyrillic characters are first converted to Latin; then, Latin characters have their diacritics removed, with the exception of LATIN SMALL LETTER D WITH STROKE (U+0111) which is converted to “dj” • regular: Only Cyrillic to Latin normalization will be applied, preserving the Latin diatrics Example: Spanish Solr includes two stemmers for Spanish: one in the solr.SnowballPorterFilterFactory language="Spanish", and a lighter stemmer called solr.SpanishLightStemFilterFactory. Lucene includes an example stopword list. Factory class: solr.SpanishStemFilterFactory Arguments: None Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 307 of 1195 Example: In: "torear toreara torearlo" Tokenizer to Filter: "torear", "toreara", "torearlo" Out: "tor", "tor", "tor" Swedish Swedish Stem Filter Solr includes two stemmers for Swedish: one in the solr.SnowballPorterFilterFactory language="Swedish", and a lighter stemmer called solr.SwedishLightStemFilterFactory. Lucene includes an example stopword list. Also relevant are the Scandinavian normalization filters. Factory class: solr.SwedishStemFilterFactory Arguments: None Example: In: "kloke klokhet klokheten" Tokenizer to Filter: "kloke", "klokhet", "klokheten" Out: "klok", "klok", "klok" Thai This filter converts sequences of Thai characters into individual Thai words. Unlike European languages, Thai does not use whitespace to delimit words. Factory class: solr.ThaiTokenizerFactory Arguments: None © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 308 of 1195 Apache Solr Reference Guide 7.3 Example: Turkish Solr includes support for stemming Turkish with the solr.SnowballPorterFilterFactory; support for caseinsensitive search with the solr.TurkishLowerCaseFilterFactory; support for stripping apostrophes and following suffixes with solr.ApostropheFilterFactory (see Role of Apostrophes in Turkish Information Retrieval); support for a form of stemming that truncating tokens at a configurable maximum length through the solr.TruncateTokenFilterFactory (see Information Retrieval on Turkish Texts); and Lucene includes an example stopword list. Factory class: solr.TurkishLowerCaseFilterFactory Arguments: None Example: Another example, illustrating diacritics-insensitive search: Ukrainian Solr provides support for Ukrainian lemmatization with the solr.MorphologikFilterFactory, in the contrib/analysis-extras module. To use this filter, see solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add to your solr_home/lib. Lucene also includes an example Ukrainian stopword list, in the lucene-analyzers-morfologik jar. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 309 of 1195 Factory class: solr.MorfologikFilterFactory Arguments: dictionary (required) lemmatizer dictionary - the lucene-analyzers-morfologik jar contains a Ukrainian dictionary at org/apache/lucene/analysis/uk/ukrainian.dict. Example: The Morfologik dictionary parameter value is a constant specifying which dictionary to choose. The dictionary resource must be named path/to/language.dict and have an associated .info metadata file. See the Morfologik project for details. If the dictionary attribute is not provided, the Polish dictionary is loaded and used by default. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 310 of 1195 Apache Solr Reference Guide 7.3 Phonetic Matching Phonetic matching algorithms may be used to encode tokens so that two different spellings that are pronounced similarly will match. For overviews of and comparisons between algorithms, see http://en.wikipedia.org/wiki/Phonetic_algorithm and http://ntz-develop.blogspot.com/2011/03/phonetic-algorithms.html Beider-Morse Phonetic Matching (BMPM) For examples of how to use this encoding in your analyzer, see Beider Morse Filter in the Filter Descriptions section. Beider-Morse Phonetic Matching (BMPM) is a "soundalike" tool that lets you search using a new phonetic matching system. BMPM helps you search for personal names (or just surnames) in a Solr/Lucene index, and is far superior to the existing phonetic codecs, such as regular soundex, metaphone, caverphone, etc. In general, phonetic matching lets you search a name list for names that are phonetically equivalent to the desired name. BMPM is similar to a soundex search in that an exact spelling is not required. Unlike soundex, it does not generate a large quantity of false hits. From the spelling of the name, BMPM attempts to determine the language. It then applies phonetic rules for that particular language to transliterate the name into a phonetic alphabet. If it is not possible to determine the language with a fair degree of certainty, it uses generic phonetic instead. Finally, it applies languageindependent rules regarding such things as voiced and unvoiced consonants and vowels to further insure the reliability of the matches. For example, assume that the matches found when searching for Stephen in a database are "Stefan", "Steph", "Stephen", "Steve", "Steven", "Stove", and "Stuffin". "Stefan", "Stephen", and "Steven" are probably relevant, and are names that you want to see. "Stuffin", however, is probably not relevant. Also rejected were "Steph", "Steve", and "Stove". Of those, "Stove" is probably not one that we would have wanted. But "Steph" and "Steve" are possibly ones that you might be interested in. For Solr, BMPM searching is available for the following languages: • English • French • German • Greek • Hebrew written in Hebrew letters • Hungarian • Italian • Polish • Romanian • Russian written in Cyrillic letters • Russian transliterated into English letters Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 311 of 1195 • Spanish • Turkish The name matching is also applicable to non-Jewish surnames from the countries in which those languages are spoken. For more information, see here: http://stevemorse.org/phoneticinfo.htm and http://stevemorse.org/phonetics/bmpm.htm. Daitch-Mokotoff Soundex To use this encoding in your analyzer, see Daitch-Mokotoff Soundex Filter in the Filter Descriptions section. The Daitch-Mokotoff Soundex algorithm is a refinement of the Russel and American Soundex algorithms, yielding greater accuracy in matching especially Slavic and Yiddish surnames with similar pronunciation but differences in spelling. The main differences compared to the other soundex variants are: • coded names are 6 digits long • initial character of the name is coded • rules to encoded multi-character n-grams • multiple possible encodings for the same name (branching) Note: the implementation used by Solr (commons-codec’s DaitchMokotoffSoundex ) has additional branching rules compared to the original description of the algorithm. For more information, see http://en.wikipedia.org/wiki/Daitch%E2%80%93Mokotoff_Soundex and http://www.avotaynu.com/soundex.htm Double Metaphone To use this encoding in your analyzer, see Double Metaphone Filter in the Filter Descriptions section. Alternatively, you may specify encoder="DoubleMetaphone" with the Phonetic Filter, but note that the Phonetic Filter version will not provide the second ("alternate") encoding that is generated by the Double Metaphone Filter for some tokens. Encodes tokens using the double metaphone algorithm by Lawrence Philips. See the original article at http://www.drdobbs.com/the-double-metaphone-search-algorithm/184401251?pgno=2 Metaphone To use this encoding in your analyzer, specify encoder="Metaphone" with the Phonetic Filter. Encodes tokens using the Metaphone algorithm by Lawrence Philips, described in "Hanging on the Metaphone" in Computer Language, Dec. 1990. Another reference for more information is Double Metaphone Search Algorithm, by Lawrence Philips. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 312 of 1195 Apache Solr Reference Guide 7.3 Soundex To use this encoding in your analyzer, specify encoder="Soundex" with the Phonetic Filter. Encodes tokens using the Soundex algorithm, which is used to relate similar names, but can also be used as a general purpose scheme to find words with similar phonemes. See also http://en.wikipedia.org/wiki/Soundex. Refined Soundex To use this encoding in your analyzer, specify encoder="RefinedSoundex" with the Phonetic Filter. Encodes tokens using an improved version of the Soundex algorithm. See http://en.wikipedia.org/wiki/Soundex. Caverphone To use this encoding in your analyzer, specify encoder="Caverphone" with the Phonetic Filter. Caverphone is an algorithm created by the Caversham Project at the University of Otago. The algorithm is optimised for accents present in the southern part of the city of Dunedin, New Zealand. See http://en.wikipedia.org/wiki/Caverphone and the Caverphone 2.0 specification at http://caversham.otago.ac.nz/files/working/ctp150804.pdf Kölner Phonetik a.k.a. Cologne Phonetic To use this encoding in your analyzer, specify encoder="ColognePhonetic" with the Phonetic Filter. The Kölner Phonetik, an algorithm published by Hans Joachim Postel in 1969, is optimized for the German language. See http://de.wikipedia.org/wiki/K%C3%B6lner_Phonetik NYSIIS To use this encoding in your analyzer, specify encoder="Nysiis" with the Phonetic Filter. NYSIIS is an encoding used to relate similar names, but can also be used as a general purpose scheme to find words with similar phonemes. See http://en.wikipedia.org/wiki/NYSIIS and http://www.dropby.com/NYSIIS.html Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 313 of 1195 Running Your Analyzer Once you’ve defined a field type in your Schema, and specified the analysis steps that you want applied to it, you should test it out to make sure that it behaves the way you expect it to. Luckily, there is a very handy page in the Solr admin interface that lets you do just that. You can invoke the analyzer for any text field, provide sample input, and display the resulting token stream. For example, let’s look at some of the "Text" field types available in the bin/solr -e techproducts example configuration, and use the Analysis Screen (http://localhost:8983/solr/#/techproducts/analysis) to compare how the tokens produced at index time for the sentence "Running an Analyzer" match up with a slightly different query text of "run my analyzer" We can begin with “text_ws” - one of the most simplified Text field types available: By looking at the start and end positions for each term, we can see that the only thing this field type does is tokenize text on whitespace. Notice in this image that the term "Running" has a start position of 0 and an end position of 7, while "an" has a start position of 8 and an end position of 10, and "Analyzer" starts at 11 and ends at 19. If the whitespace between the terms was also included, the count would be 21; since it is 19, we know that whitespace has been removed from this query. Note also that the indexed terms and the query terms are still very different. "Running" doesn’t match "run", "Analyzer" doesn’t match "analyzer" (to a computer), and obviously "an" and "my" are totally different words. If our objective is to allow queries like "run my analyzer" to match indexed text like "Running an Analyzer" then we will evidently need to pick a different field type with index & query time text analysis that does more processing of the inputs. In particular we will want: • Case insensitivity, so "Analyzer" and "analyzer" match. • Stemming, so words like "Run" and "Running" are considered equivalent terms. • Stop Word Pruning, so small words like "an" and "my" don’t affect the query. For our next attempt, let’s try the “text_general” field type: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 314 of 1195 Apache Solr Reference Guide 7.3 With the verbose output enabled, we can see how each stage of our new analyzers modify the tokens they receive before passing them on to the next stage. As we scroll down to the final output, we can see that we do start to get a match on "analyzer" from each input string, thanks to the "LCF" stage — which if you hover over with your mouse, you’ll see is the “LowerCaseFilter”: The “text_general” field type is designed to be generally useful for any language, and it has definitely gotten us closer to our objective than “text_ws” from our first example by solving the problem of case sensitivity. It’s still not quite what we are looking for because we don’t see stemming or stopword rules being applied. So now let us try the “text_en” field type: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 315 of 1195 Now we can see the "SF" (StopFilter) stage of the analyzers solving the problem of removing Stop Words ("an"), and as we scroll down, we also see the "PSF" (PorterStemFilter) stage apply stemming rules suitable for our English language input, such that the terms produced by our "index analyzer" and the terms produced by our "query analyzer" match the way we expect. At this point, we can continue to experiment with additional inputs, verifying that our analyzers produce matching tokens when we expect them to match, and disparate tokens when we do not expect them to match, as we iterate and tweak our field type configuration. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 316 of 1195 Apache Solr Reference Guide 7.3 Indexing and Basic Data Operations This section describes how Solr adds data to its index. It covers the following topics: • Introduction to Solr Indexing: An overview of Solr’s indexing process. • Post Tool: Information about using post.jar to quickly upload some content to your system. • Uploading Data with Index Handlers: Information about using Solr’s Index Handlers to upload XML/XSLT, JSON and CSV data. • Transforming and Indexing Custom JSON: Index any JSON of your choice • Uploading Data with Solr Cell using Apache Tika: Information about using the Solr Cell framework to upload data for indexing. • Uploading Structured Data Store Data with the Data Import Handler: Information about uploading and indexing data from a structured data store. • Updating Parts of Documents: Information about how to use atomic updates and optimistic concurrency with Solr. • Detecting Languages During Indexing: Information about using language identification during the indexing process. • De-Duplication: Information about configuring Solr to mark duplicate documents as they are indexed. • Content Streams: Information about streaming content to Solr Request Handlers. • UIMA Integration: Information about integrating Solr with Apache’s Unstructured Information Management Architecture (UIMA). UIMA lets you define custom pipelines of Analysis Engines that incrementally add metadata to your documents as annotations. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 317 of 1195 Indexing Using Client APIs Using client APIs, such as SolrJ, from your applications is an important option for updating Solr indexes. See the Client APIs section for more information. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 318 of 1195 Apache Solr Reference Guide 7.3 Introduction to Solr Indexing This section describes the process of indexing: adding content to a Solr index and, if necessary, modifying that content or deleting it. By adding content to an index, we make it searchable by Solr. A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV) files, data extracted from tables in a database, and files in common file formats such as Microsoft Word or PDF. Here are the three most common ways of loading data into a Solr index: • Using the Solr Cell framework built on Apache Tika for ingesting binary files or structured files such as Office, Word, PDF, and other proprietary formats. • Uploading XML files by sending HTTP requests to the Solr server from any environment where such requests can be generated. • Writing a custom Java application to ingest data through Solr’s Java Client API (which is described in more detail in Client APIs). Using the Java API may be the best choice if you’re working with an application, such as a Content Management System (CMS), that offers a Java API. Regardless of the method used to ingest data, there is a common basic data structure for data being fed into a Solr index: a document containing multiple fields, each with a name and containing content, which may be empty. One of the fields is usually designated as a unique ID field (analogous to a primary key in a database), although the use of a unique ID field is not strictly required by Solr. If the field name is defined in the Schema that is associated with the index, then the analysis steps associated with that field will be applied to its content when the content is tokenized. Fields that are not explicitly defined in the Schema will either be ignored or mapped to a dynamic field definition (see Documents, Fields, and Schema Design), if one matching the field name exists. For more information on indexing in Solr, see the Solr Wiki. The Solr Example Directory When starting Solr with the "-e" option, the example/ directory will be used as base directory for the example Solr instances that are created. This directory also includes an example/exampledocs/ subdirectory containing sample documents in a variety of formats that you can use to experiment with indexing into the various examples. The curl Utility for Transferring Files Many of the instructions and examples in this section make use of the curl utility for transferring content through a URL. curl posts and retrieves data over HTTP, FTP, and many other protocols. Most Linux distributions include a copy of curl. You’ll find curl downloads for Linux, Windows, and many other operating systems at http://curl.haxx.se/download.html. Documentation for curl is available here: http://curl.haxx.se/docs/manpage.html. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3  Page 319 of 1195 Using curl or other command line tools for posting data is just fine for examples or tests, but it’s not the recommended method for achieving the best performance for updates in production environments. You will achieve better performance with Solr Cell or the other methods described in this section. Instead of curl, you can use utilities such as GNU wget (http://www.gnu.org/software/ wget/) or manage GETs and POSTS with Perl, although the command line options will differ. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 320 of 1195 Apache Solr Reference Guide 7.3 Post Tool Solr includes a simple command line tool for POSTing various types of content to a Solr server. The tool is bin/post. The bin/post tool is a Unix shell script; for Windows (non-Cygwin) usage, see the section Post Tool Windows Support below. To run it, open a window and enter: bin/post -c gettingstarted example/films/films.json This will contact the server at localhost:8983. Specifying the collection/core name is mandatory. The -help (or simply -h) option will output information on its usage (i.e., bin/post -help). Using the bin/post Tool Specifying either the collection/core name or the full update url is mandatory when using bin/post. The basic usage of bin/post is: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 321 of 1195 $ bin/post -h Usage: post -c [OPTIONS] or post -help collection name defaults to DEFAULT_SOLR_COLLECTION if not specified OPTIONS ======= Solr options: -url (overrides collection, host, and port) -host (default: localhost) -p or -port (default: 8983) -commit yes|no (default: yes) -u or -user (sets BasicAuth credentials) Web crawl options: -recursive (default: 1) -delay (default: 10) Directory crawl options: -delay (default: 0) stdin/args options: -type (default: application/xml) Other options: -filetypes [,,...] (default: xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log) -params "=[&=...]" (values must be URL-encoded; these pass through to Solr update request) -out yes|no (default: no; yes outputs Solr response to console) ... Examples Using bin/post There are several ways to use bin/post. This section presents several examples. Indexing XML Add all documents with file extension .xml to collection or core named gettingstarted. bin/post -c gettingstarted *.xml Add all documents with file extension .xml to the gettingstarted collection/core on Solr running on port 8984. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 322 of 1195 Apache Solr Reference Guide 7.3 bin/post -c gettingstarted -p 8984 *.xml Send XML arguments to delete a document from gettingstarted. bin/post -c gettingstarted -d '42' Indexing CSV Index all CSV files into gettingstarted: bin/post -c gettingstarted *.csv Index a tab-separated file into gettingstarted: bin/post -c signals -params "separator=%09" -type text/csv data.tsv The content type (-type) parameter is required to treat the file as the proper type, otherwise it will be ignored and a WARNING logged as it does not know what type of content a .tsv file is. The CSV handler supports the separator parameter, and is passed through using the -params setting. Indexing JSON Index all JSON files into gettingstarted. bin/post -c gettingstarted *.json Indexing Rich Documents (PDF, Word, HTML, etc.) Index a PDF file into gettingstarted. bin/post -c gettingstarted a.pdf Automatically detect content types in a folder, and recursively scan it for documents for indexing into gettingstarted. bin/post -c gettingstarted afolder/ Automatically detect content types in a folder, but limit it to PPT and HTML files and index into gettingstarted. bin/post -c gettingstarted -filetypes ppt,html afolder/ Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 323 of 1195 Indexing to a Password Protected Solr (Basic Auth) Index a PDF as the user "solr" with password "SolrRocks": bin/post -u solr:SolrRocks -c gettingstarted a.pdf Post Tool Windows Support bin/post exists currently only as a Unix shell script, however it delegates its work to a cross-platform capable Java program. The SimplePostTool can be run directly in supported environments, including Windows. SimplePostTool The bin/post script currently delegates to a standalone Java program called SimplePostTool. This tool, bundled into a executable JAR, can be run directly using java -jar example/exampledocs/post.jar. See the help output and take it from there to post files, recurse a website or file system folder, or send direct commands to a Solr server. $ java -jar example/exampledocs/post.jar -h SimplePostTool version 5.0.0 Usage: java [SystemProperties] -jar post.jar [-h|-] [ [...]] . . . © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 324 of 1195 Apache Solr Reference Guide 7.3 Uploading Data with Index Handlers Index Handlers are Request Handlers designed to add, delete and update documents to the index. In addition to having plugins for importing rich documents using Tika or from structured data sources using the Data Import Handler, Solr natively supports indexing structured documents in XML, CSV and JSON. The recommended way to configure and use request handlers is with path based names that map to paths in the request url. However, request handlers can also be specified with the qt (query type) parameter if the requestDispatcher is appropriately configured. It is possible to access the same handler using more than one name, which can be useful if you wish to specify different sets of default options. A single unified update request handler supports XML, CSV, JSON, and javabin update requests, delegating to the appropriate ContentStreamLoader based on the Content-Type of the ContentStream. UpdateRequestHandler Configuration The default configuration file has the update request handler configured by default. XML Formatted Index Updates Index update commands can be sent as XML message to the update handler using Content-type: application/xml or Content-type: text/xml. Adding Documents The XML schema recognized by the update handler for adding documents is very straightforward: • The element introduces one more documents to be added. • The element introduces the fields making up a document. • The element presents the content for a specific field. For example: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 325 of 1195 Patrick Eagar Sports 796.35 128 12.40 Summer of the all-rounder: Test and championship cricket in England 1982 0002166313 1982 Collins ... The add command supports some optional attributes which may be specified. commitWithin Add the document within the specified number of milliseconds. overwrite Default is true. Indicates if the unique key constraints should be checked to overwrite previous versions of the same document (see below). If the document schema defines a unique key, then by default an /update operation to add a document will overwrite (i.e., replace) any document in the index with the same unique key. If no unique key has been defined, indexing performance is somewhat faster, as no check has to be made for an existing documents to replace. If you have a unique key field, but you feel confident that you can safely bypass the uniqueness check (e.g., you build your indexes in batch, and your indexing code guarantees it never adds the same document more than once) you can specify the overwrite="false" option when adding your documents. XML Update Commands Commit and Optimize During Updates The operation writes all documents loaded since the last commit to one or more segment files on the disk. Before a commit has been issued, newly indexed content is not visible to searches. The commit operation opens a new searcher, and triggers any event listeners that have been configured. Commits may be issued explicitly with a message, and can also be triggered from parameters in solrconfig.xml. The operation requests Solr to merge internal data structures. For a large index, optimization will take some time to complete, but by merging many small segment files into a larger one, search © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 326 of 1195 Apache Solr Reference Guide 7.3 performance may improve. If you are using Solr’s replication mechanism to distribute searches across many systems, be aware that after an optimize, a complete index will need to be transferred.  You should only consider using optimize on static indexes, i.e., indexes that can be optimized as part of the regular update process (say once-a-day updates). Applications requiring NRT functionalty are discouraged from using optimize. The and elements accept these optional attributes: waitSearcher Default is true. Blocks until a new searcher is opened and registered as the main query searcher, making the changes visible. expungeDeletes (commit only) Default is false. Merges segments that have more than 10% deleted docs, expunging them in the process.  expungeDeletes is "less expensive" than optimize, but the same warnings apply. maxSegments (optimize only) Default is 1. Merges the segments down to no more than this number of segments. Here are examples of and using optional attributes: Delete Operations Documents can be deleted from the index in two ways. "Delete by ID" deletes the document with the specified ID, and can be used only if a UniqueID field has been defined in the schema. "Delete by Query" deletes all documents matching a specified query, although commitWithin is ignored for a Delete by Query. A single delete message can contain multiple delete operations. 0002166313 0031745983 subject:sport publisher:penguin  When using the Join query parser in a Delete By Query, you should use the score parameter with a value of "none" to avoid a ClassCastException. See the section on the Join Query Parser for more details on the score parameter. Rollback Operations The rollback command rolls back all add and deletes made to the index since the last commit. It neither calls Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 327 of 1195 any event listeners nor creates a new searcher. Its syntax is simple: . Grouping Operations You can post several commands in a single XML file by grouping them with the surrounding element. 0002166313 Using curl to Perform Updates You can use the curl utility to perform any of the above commands, using its --data-binary option to append the XML message to the curl command, and generating a HTTP POST request. For example: curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml" --data-binary ' Patrick Eagar Sports 796.35 0002166313 1982 Collins ' For posting XML messages contained in a file, you can use the alternative form: curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml" --data-binary @myfile.xml The approach above works well, but using the --data-binary option causes curl to load the whole myfile.xml into memory before posting it to server. This may be problematic when dealing with multigigabyte files. This alternative curl command performs equivalent operations but with minimal curl memory usage: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 328 of 1195 Apache Solr Reference Guide 7.3 curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml" -T "myfile.xml" -X POST Short requests can also be sent using a HTTP GET command, if enabled in RequestDispatcher in SolrConfig element, URL-encoding the request, as in the following. Note the escaping of "<" and ">": curl http://localhost:8983/solr/my_collection/update?stream.body=%3Ccommit/%3E&wt=xml Responses from Solr take the form shown here: 0 127 The status field will be non-zero in case of failure. Using XSLT to Transform XML Index Updates The UpdateRequestHandler allows you to index any arbitrary XML using the parameter to apply an XSL transformation. You must have an XSLT stylesheet in the conf/xslt directory of your config set that can transform the incoming data to the expected format, and use the tr parameter to specify the name of that stylesheet. Here is an example XSLT stylesheet: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 329 of 1195 This stylesheet transforms Solr’s XML search result format into Solr’s Update XML syntax. One example usage would be to copy a Solr 1.3 index (which does not have CSV response writer) into a format which can be indexed into another Solr file (provided that all fields are stored): http://localhost:8983/solr/my_collection/select?q=*:*&wt=xslt&tr=updateXml.xsl&rows=1000 You can also use the stylesheet in XsltUpdateRequestHandler to transform an index when updating: curl "http://localhost:8983/solr/my_collection/update?commit=true&tr=updateXml.xsl" -H "ContentType: text/xml" --data-binary @myexporteddata.xml © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 330 of 1195 Apache Solr Reference Guide 7.3 JSON Formatted Index Updates Solr can accept JSON that conforms to a defined structure, or can accept arbitrary JSON-formatted documents. If sending arbitrarily formatted JSON, there are some additional parameters that need to be sent with the update request, described below in the section Transforming and Indexing Custom JSON. Solr-Style JSON JSON formatted update requests may be sent to Solr’s /update handler using Content-Type: application/json or Content-Type: text/json. JSON formatted updates can take 3 basic forms, described in depth below: • A single document to add, expressed as a top level JSON Object. To differentiate this from a set of commands, the json.command=false request parameter is required. • A list of documents to add, expressed as a top level JSON Array containing a JSON Object per document. • A sequence of update commands, expressed as a top level JSON Object (aka: Map). Adding a Single JSON Document The simplest way to add Documents via JSON is to send each document individually as a JSON Object, using the /update/json/docs path: curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/my_collection/update/json/docs' --data-binary ' { "id": "1", "title": "Doc 1" }' Adding Multiple JSON Documents Adding multiple documents at one time via JSON can be done via a JSON Array of JSON Objects, where each object represents a document: curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/my_collection/update' --data-binary ' [ { "id": "1", "title": "Doc 1" }, { "id": "2", "title": "Doc 2" } ]' Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 331 of 1195 A sample JSON file is provided at example/exampledocs/books.json and contains an array of objects that you can add to the Solr techproducts example: curl 'http://localhost:8983/solr/techproducts/update?commit=true' --data-binary @example/exampledocs/books.json -H 'Content-type:application/json' Sending JSON Update Commands In general, the JSON update syntax supports all of the update commands that the XML update handler supports, through a straightforward mapping. Multiple commands, adding and deleting documents, may be contained in one message: curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/my_collection/update' --data-binary ' { "add": { "doc": { "id": "DOC1", "my_field": 2.3, "my_multivalued_field": [ "aaa", "bbb" ] ① } }, "add": { "commitWithin": 5000, ② "overwrite": false, ③ "doc": { "f1": "v1", ④ "f1": "v2" } }, "commit": {}, "optimize": { "waitSearcher":false }, "delete": { "id":"ID" }, ⑤ "delete": { "query":"QUERY" } ⑥ }' ① Can use an array for a multi-valued field ② Commit this document within 5 seconds ③ Don’t check for existing documents with the same uniqueKey ④ Can use repeated keys for a multi-valued field ⑤ Delete by ID (uniqueKey field) ⑥ Delete by Query As with other update handlers, parameters such as commit, commitWithin, optimize, and overwrite may be specified in the URL instead of in the body of the message. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 332 of 1195 Apache Solr Reference Guide 7.3 The JSON update format allows for a simple delete-by-id. The value of a delete can be an array which contains a list of zero or more specific document id’s (not a range) to be deleted. For example, a single document: { "delete":"myid" } Or a list of document IDs: { "delete":["id1","id2"] } The value of a "delete" can be an array which contains a list of zero or more id’s to be deleted. It is not a range (start and end). You can also specify _version_ with each "delete": { "delete":"id":50, "_version_":12345 } You can specify the version of deletes in the body of the update request as well. JSON Update Convenience Paths In addition to the /update handler, there are a few additional JSON specific request handler paths available by default in Solr, that implicitly override the behavior of some request parameters: Path Default Parameters /update/json stream.contentType=application/json /update/json/docs stream.contentType=application/json json.command=false The /update/json path may be useful for clients sending in JSON formatted update commands from applications where setting the Content-Type proves difficult, while the /update/json/docs path can be particularly convenient for clients that always want to send in documents – either individually or as a list – without needing to worry about the full JSON command syntax. Custom JSON Documents Solr can support custom JSON. This is covered in the section Transforming and Indexing Custom JSON. CSV Formatted Index Updates CSV formatted update requests may be sent to Solr’s /update handler using Content-Type: application/csv or Content-Type: text/csv. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 333 of 1195 A sample CSV file is provided at example/exampledocs/books.csv that you can use to add some documents to the Solr techproducts example: curl 'http://localhost:8983/solr/my_collection/update?commit=true' --data-binary @example/exampledocs/books.csv -H 'Content-type:application/csv' CSV Update Parameters The CSV handler allows the specification of many parameters in the URL in the form: f.parameter .optional_fieldname=value. The table below describes the parameters for the update handler. separator Character used as field separator; default is ",". This parameter is global; for per-field usage, see the split parameter. Example: separator=%09 trim If true, remove leading and trailing whitespace from values. The default is false. This parameter can be either global or per-field. Examples: f.isbn.trim=true or trim=false header Set to true if first line of input contains field names. These will be used if the fieldnames parameter is absent. This parameter is global. fieldnames Comma-separated list of field names to use when adding documents. This parameter is global. Example: fieldnames=isbn,price,title literal.field_name A literal value for a specified field name. This parameter is global. Example: literal.color=red skip Comma separated list of field names to skip. This parameter is global. Example: skip=uninteresting,shoesize skipLines Number of lines to discard in the input stream before the CSV data starts, including the header, if present. Default=0. This parameter is global. Example: skipLines=5 encapsulator The character optionally used to surround values to preserve characters such as the CSV separator or © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 334 of 1195 Apache Solr Reference Guide 7.3 whitespace. This standard CSV format handles the encapsulator itself appearing in an encapsulated value by doubling the encapsulator. This parameter is global; for per-field usage, see split. Example: encapsulator=" escape The character used for escaping CSV separators or other reserved characters. If an escape is specified, the encapsulator is not used unless also explicitly specified since most formats use either encapsulation or escaping, not both. |g | Example: escape=\ keepEmpty Keep and index zero length (empty) fields. The default is false. This parameter can be global or per-field. Example: f.price.keepEmpty=true map Map one value to another. Format is value:replacement (which can be empty). This parameter can be global or per-field. Example: map=left:right or f.subject.map=history:bunk split If true, split a field into multiple values by a separate parser. This parameter is used on a per-field basis. overwrite If true (the default), check for and overwrite duplicate documents, based on the uniqueKey field declared in the Solr schema. If you know the documents you are indexing do not contain any duplicates then you may see a considerable speed up setting this to false. This parameter is global. commit Issues a commit after the data has been ingested. This parameter is global. commitWithin Add the document within the specified number of milliseconds. This parameter is global. Example: commitWithin=10000 rowid Map the rowid (line number) to a field specified by the value of the parameter, for instance if your CSV doesn’t have a unique key and you want to use the row id as such. This parameter is global. Example: rowid=id rowidOffset Add the given offset (as an integer) to the rowid before adding it to the document. Default is 0. This parameter is global. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 335 of 1195 Example: rowidOffset=10 Indexing Tab-Delimited files The same feature used to index CSV documents can also be easily used to index tab-delimited files (TSV files) and even handle backslash escaping rather than CSV encapsulation. For example, one can dump a MySQL table to a tab delimited file with: SELECT * INTO OUTFILE '/tmp/result.txt' FROM mytable; This file could then be imported into Solr by setting the separator to tab (%09) and the escape to backslash (%5c). curl 'http://localhost:8983/solr/my_collection/update/csv?commit=true&separator=%09&escape=%5c' --data-binary @/tmp/result.txt CSV Update Convenience Paths In addition to the /update handler, there is an additional CSV specific request handler path available by default in Solr, that implicitly override the behavior of some request parameters: Path Default Parameters /update/csv stream.contentType=application/csv The /update/csv path may be useful for clients sending in CSV formatted update commands from applications where setting the Content-Type proves difficult. Nested Child Documents Solr indexes nested documents in blocks as a way to model documents containing other documents, such as a blog post parent document and comments as child documents — or products as parent documents and sizes, colors, or other variations as child documents. At query time, the Block Join Query Parsers can search these relationships. In terms of performance, indexing the relationships between documents may be more efficient than attempting to do joins only at query time, since the relationships are already stored in the index and do not need to be computed. Nested documents may be indexed via either the XML or JSON data syntax (or using SolrJ) - but regardless of syntax, you must include a field that identifies the parent document as a parent; it can be any field that suits this purpose, and it will be used as input for the block join query parsers. To support nested documents, the schema must include an indexed/non-stored field _root_. The value of that field is populated automatically and is the same for all documents in the block, regardless of the inheritance depth. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 336 of 1195 Apache Solr Reference Guide 7.3 XML Examples For example, here are two documents and their child documents: 1 Solr adds block join support parentDocument 2 SolrCloud supports it too! 3 New Lucene and Solr release is out parentDocument 4 Lots of new features In this example, we have indexed the parent documents with the field content_type, which has the value "parentDocument". We could have also used a boolean field, such as isParent, with a value of "true", or any other similar approach. JSON Examples This example is equivalent to the XML example above, note the special _childDocuments_ key need to indicate the nested documents in JSON. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 337 of 1195 [ { "id": "1", "title": "Solr adds block join support", "content_type": "parentDocument", "_childDocuments_": [ { "id": "2", "comments": "SolrCloud supports it too!" } ] }, { "id": "3", "title": "New Lucene and Solr release is out", "content_type": "parentDocument", "_childDocuments_": [ { "id": "4", "comments": "Lots of new features" } ] } ] Note  One limitation of indexing nested documents is that the whole block of parent-children documents must be updated together whenever any changes are required. In other words, even if a single child document or the parent document is changed, the whole block of parent-child documents must be indexed together. Transforming and Indexing Custom JSON If you have JSON documents that you would like to index without transforming them into Solr’s structure, you can add them to Solr by including some parameters with the update request. These parameters provide information on how to split a single JSON file into multiple Solr documents and how to map fields to Solr’s schema. One or more valid JSON documents can be sent to the /update/json/docs path with the configuration params. Mapping Parameters These parameters allow you to define how a JSON file should be read for multiple Solr documents. split Defines the path at which to split the input JSON into multiple Solr documents and is required if you have multiple documents in a single JSON file. If the entire JSON makes a single Solr document, the path must be “/”. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 338 of 1195 Apache Solr Reference Guide 7.3 It is possible to pass multiple split paths by separating them with a pipe (|), for example: split=/|/foo|/foo/bar. If one path is a child of another, they automatically become a child document. f Provides multivalued mapping to map document field names to Solr field names. The format of the parameter is target-field-name:json-path, as in f=first:/first. The json-path is required. The target-field-name is the Solr document field name, and is optional. If not specified, it is automatically derived from the input JSON. The default target field name is the fully qualified name of the field. Wildcards can be used here, see Using Wildcards for Field Names below for more information. mapUniqueKeyOnly (boolean) This parameter is particularly convenient when the fields in the input JSON are not available in the schema and schemaless mode is not enabled. This will index all the fields into the default search field (using the df parameter, below) and only the uniqueKey field is mapped to the corresponding field in the schema. If the input JSON does not have a value for the uniqueKey field then a UUID is generated for the same. df If the mapUniqueKeyOnly flag is used, the update handler needs a field where the data should be indexed to. This is the same field that other handlers use as a default search field. srcField This is the name of the field to which the JSON source will be stored into. This can only be used if split=/ (i.e., you want your JSON input file to be indexed as a single Solr document). Note that atomic updates will cause the field to be out-of-sync with the document. echo This is for debugging purpose only. Set it to true if you want the docs to be returned as a response. Nothing will be indexed. For example, if we have a JSON file that includes two documents, we could define an update request like this: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 339 of 1195 V1 API curl 'http://localhost:8983/solr/techproducts/update/json/docs'\ '?split=/exams'\ '&f=first:/first'\ '&f=last:/last'\ '&f=grade:/grade'\ '&f=subject:/exams/subject'\ '&f=test:/exams/test'\ '&f=marks:/exams/marks'\ -H 'Content-type:application/json' -d ' { "first": "John", "last": "Doe", "grade": 8, "exams": [ { "subject": "Maths", "test" : "term1", "marks" : 90}, { "subject": "Biology", "test" : "term1", "marks" : 86} ] }' © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 340 of 1195 Apache Solr Reference Guide 7.3 V2 API Standalone Solr curl 'http://localhost:8983/api/cores/techproducts/update/json/docs'\ '?split=/exams'\ '&f=first:/first'\ '&f=last:/last'\ '&f=grade:/grade'\ '&f=subject:/exams/subject'\ '&f=test:/exams/test'\ '&f=marks:/exams/marks'\ -H 'Content-type:application/json' -d ' { "first": "John", "last": "Doe", "grade": 8, "exams": [ { "subject": "Maths", "test" : "term1", "marks" : 90}, { "subject": "Biology", "test" : "term1", "marks" : 86} ] }' Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 341 of 1195 V2 API SolrCloud curl 'http://localhost:8983/api/collections/techproducts/update/json/docs'\ '?split=/exams'\ '&f=first:/first'\ '&f=last:/last'\ '&f=grade:/grade'\ '&f=subject:/exams/subject'\ '&f=test:/exams/test'\ '&f=marks:/exams/marks'\ -H 'Content-type:application/json' -d ' { "first": "John", "last": "Doe", "grade": 8, "exams": [ { "subject": "Maths", "test" : "term1", "marks" : 90}, { "subject": "Biology", "test" : "term1", "marks" : 86} ] }' With this request, we have defined that "exams" contains multiple documents. In addition, we have mapped several fields from the input document to Solr fields. When the update request is complete, the following two documents will be added to the index: { "first":"John", "last":"Doe", "marks":90, "test":"term1", "subject":"Maths", "grade":8 } { "first":"John", "last":"Doe", "marks":86, "test":"term1", "subject":"Biology", "grade":8 } © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 342 of 1195 Apache Solr Reference Guide 7.3 In the prior example, all of the fields we wanted to use in Solr had the same names as they did in the input JSON. When that is the case, we can simplify the request by only specifying the json-path portion of the f parameter, as in this example: V1 API curl 'http://localhost:8983/solr/techproducts/update/json/docs'\ '?split=/exams'\ '&f=/first'\ '&f=/last'\ '&f=/grade'\ '&f=/exams/subject'\ '&f=/exams/test'\ '&f=/exams/marks'\ -H 'Content-type:application/json' -d ' { "first": "John", "last": "Doe", "grade": 8, "exams": [ { "subject": "Maths", "test" : "term1", "marks" : 90}, { "subject": "Biology", "test" : "term1", "marks" : 86} ] }' Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 343 of 1195 V2 API Standalone Solr curl 'http://localhost:8983/api/cores/techproducts/update/json/docs'\ '?split=/exams'\ '&f=/first'\ '&f=/last'\ '&f=/grade'\ '&f=/exams/subject'\ '&f=/exams/test'\ '&f=/exams/marks'\ -H 'Content-type:application/json' -d ' { "first": "John", "last": "Doe", "grade": 8, "exams": [ { "subject": "Maths", "test" : "term1", "marks" : 90}, { "subject": "Biology", "test" : "term1", "marks" : 86} ] }' © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 344 of 1195 Apache Solr Reference Guide 7.3 V2 API SolrCloud curl 'http://localhost:8983/api/collections/techproducts/update/json/docs'\ '?split=/exams'\ '&f=/first'\ '&f=/last'\ '&f=/grade'\ '&f=/exams/subject'\ '&f=/exams/test'\ '&f=/exams/marks'\ -H 'Content-type:application/json' -d ' { "first": "John", "last": "Doe", "grade": 8, "exams": [ { "subject": "Maths", "test" : "term1", "marks" : 90}, { "subject": "Biology", "test" : "term1", "marks" : 86} ] }' In this example, we simply named the field paths (such as /exams/test). Solr will automatically attempt to add the content of the field from the JSON input to the index in a field with the same name.  Documents will be rejected during indexing if the fields do not exist in the schema before indexing. So, if you are NOT using schemaless mode, you must pre-create all fields. If you are working in Schemaless Mode, however, fields that don’t exist will be created on the fly with Solr’s best guess for the field type. Reusing Parameters in Multiple Requests You can store and re-use parameters with Solr’s Request Parameters API. Say we wanted to define parameters to split documents at the exams field, and map several other fields. We could make an API request such as: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 345 of 1195 V1 API curl http://localhost:8983/solr/techproducts/config/params -H 'Contenttype:application/json' -d '{ "set": { "my_params": { "split": "/exams", "f": ["first:/first","last:/last","grade:/grade","subject:/exams/subject","test:/exams/test"] }}}' V2 API Standalone Solr curl http://localhost:8983/api/cores/techproducts/config/params -H 'Contenttype:application/json' -d '{ "set": { "my_params": { "split": "/exams", "f": ["first:/first","last:/last","grade:/grade","subject:/exams/subject","test:/exams/test"] }}}' V2 API SolrCloud curl http://localhost:8983/api/collections/techproducts/config/params -H 'Contenttype:application/json' -d '{ "set": { "my_params": { "split": "/exams", "f": ["first:/first","last:/last","grade:/grade","subject:/exams/subject","test:/exams/test"] }}}' When we send the documents, we’d use the useParams parameter with the name of the parameter set we defined: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 346 of 1195 Apache Solr Reference Guide 7.3 V1 API curl 'http://localhost:8983/solr/techproducts/update/json/docs?useParams=my_params' -H 'Content-type:application/json' -d '{ "first": "John", "last": "Doe", "grade": 8, "exams": [{ "subject": "Maths", "test": "term1", "marks": 90 }, { "subject": "Biology", "test": "term1", "marks": 86 } ] }' V2 API Standalone Solr curl 'http://localhost:8983/api/cores/techproducts/update/json?useParams=my_params' -H 'Content-type:application/json' -d '{ "first": "John", "last": "Doe", "grade": 8, "exams": [{ "subject": "Maths", "test": "term1", "marks": 90 }, { "subject": "Biology", "test": "term1", "marks": 86 } ] }' Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 347 of 1195 V2 API SolrCloud curl 'http://localhost:8983/api/collections/techproducts/update/json?useParams=my_params' -H 'Content-type:application/json' -d '{ "first": "John", "last": "Doe", "grade": 8, "exams": [{ "subject": "Maths", "test": "term1", "marks": 90 }, { "subject": "Biology", "test": "term1", "marks": 86 } ] }' Using Wildcards for Field Names Instead of specifying all the field names explicitly, it is possible to specify wildcards to map fields automatically. There are two restrictions: wildcards can only be used at the end of the json-path, and the split path cannot use wildcards. A single asterisk * maps only to direct children, and a double asterisk ** maps recursively to all descendants. The following are example wildcard path mappings: • f=$FQN:/**: maps all fields to the fully qualified name ($FQN) of the JSON field. The fully qualified name is obtained by concatenating all the keys in the hierarchy with a period (.) as a delimiter. This is the default behavior if no f path mappings are specified. • f=/docs/*: maps all the fields under docs and in the name as given in json • f=/docs/**: maps all the fields under docs and its children in the name as given in json • f=searchField:/docs/*: maps all fields under /docs to a single field called ‘searchField’ • f=searchField:/docs/**: maps all fields under /docs and its children to searchField With wildcards we can further simplify our previous example as follows: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 348 of 1195 Apache Solr Reference Guide 7.3 V1 API curl 'http://localhost:8983/solr/techproducts/update/json/docs'\ '?split=/exams'\ '&f=/**'\ -H 'Content-type:application/json' -d ' { "first": "John", "last": "Doe", "grade": 8, "exams": [ { "subject": "Maths", "test" : "term1", "marks" : 90}, { "subject": "Biology", "test" : "term1", "marks" : 86} ] }' V2 API Standalone Solr curl 'http://localhost:8983/api/cores/techproducts/update/json'\ '?split=/exams'\ '&f=/**'\ -H 'Content-type:application/json' -d ' { "first": "John", "last": "Doe", "grade": 8, "exams": [ { "subject": "Maths", "test" : "term1", "marks" : 90}, { "subject": "Biology", "test" : "term1", "marks" : 86} ] }' Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 349 of 1195 V2 API SolrCloud curl 'http://localhost:8983/api/collections/techproducts/update/json'\ '?split=/exams'\ '&f=/**'\ -H 'Content-type:application/json' -d ' { "first": "John", "last": "Doe", "grade": 8, "exams": [ { "subject": "Maths", "test" : "term1", "marks" : 90}, { "subject": "Biology", "test" : "term1", "marks" : 86} ] }' Because we want the fields to be indexed with the field names as they are found in the JSON input, the double wildcard in f=/** will map all fields and their descendants to the same fields in Solr. It is also possible to send all the values to a single field and do a full text search on that. This is a good option to blindly index and query JSON documents without worrying about fields and schema. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 350 of 1195 Apache Solr Reference Guide 7.3 V1 API curl 'http://localhost:8983/solr/techproducts/update/json/docs'\ '?split=/'\ '&f=txt:/**'\ -H 'Content-type:application/json' -d ' { "first": "John", "last": "Doe", "grade": 8, "exams": [ { "subject": "Maths", "test" : "term1", "marks" : 90}, { "subject": "Biology", "test" : "term1", "marks" : 86} ] }' V2 API Standalone Solr curl 'http://localhost:8983/api/cores/techproducts/update/json'\ '?split=/'\ '&f=txt:/**'\ -H 'Content-type:application/json' -d ' { "first": "John", "last": "Doe", "grade": 8, "exams": [ { "subject": "Maths", "test" : "term1", "marks" : 90}, { "subject": "Biology", "test" : "term1", "marks" : 86} ] }' Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 351 of 1195 V2 API SolrCloud curl 'http://localhost:8983/api/collections/techproducts/update/json'\ '?split=/'\ '&f=txt:/**'\ -H 'Content-type:application/json' -d ' { "first": "John", "last": "Doe", "grade": 8, "exams": [ { "subject": "Maths", "test" : "term1", "marks" : 90}, { "subject": "Biology", "test" : "term1", "marks" : 86} ] }' In the above example, we’ve said all of the fields should be added to a field in Solr named 'txt'. This will add multiple fields to a single field, so whatever field you choose should be multi-valued. The default behavior is to use the fully qualified name (FQN) of the node. So, if we don’t define any field mappings, like this: V1 API curl 'http://localhost:8983/solr/techproducts/update/json/docs?split=/exams'\ -H 'Content-type:application/json' -d ' { "first": "John", "last": "Doe", "grade": 8, "exams": [ { "subject": "Maths", "test" : "term1", "marks" : 90}, { "subject": "Biology", "test" : "term1", "marks" : 86} ] }' © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 352 of 1195 Apache Solr Reference Guide 7.3 V2 API Standalone Solr curl 'http://localhost:8983/api/cores/techproducts/update/json?split=/exams'\ -H 'Content-type:application/json' -d ' { "first": "John", "last": "Doe", "grade": 8, "exams": [ { "subject": "Maths", "test" : "term1", "marks" : 90}, { "subject": "Biology", "test" : "term1", "marks" : 86} ] }' V2 API SolrCloud curl 'http://localhost:8983/api/collections/techproducts/update/json?split=/exams'\ -H 'Content-type:application/json' -d ' { "first": "John", "last": "Doe", "grade": 8, "exams": [ { "subject": "Maths", "test" : "term1", "marks" : 90}, { "subject": "Biology", "test" : "term1", "marks" : 86} ] }' The indexed documents would be added to the index with fields that look like this: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 353 of 1195 { "first":"John", "last":"Doe", "grade":8, "exams.subject":"Maths", "exams.test":"term1", "exams.marks":90}, { "first":"John", "last":"Doe", "grade":8, "exams.subject":"Biology", "exams.test":"term1", "exams.marks":86} Multiple Documents in a Single Payload This functionality supports documents in the JSON Lines format (.jsonl), which specifies one document per line. For example: V1 API curl 'http://localhost:8983/solr/techproducts/update/json/docs' -H 'Contenttype:application/json' -d ' { "first":"Steve", "last":"Jobs", "grade":1, "subject":"Social Science", "test":"term1", "marks":90} { "first":"Steve", "last":"Woz", "grade":1, "subject":"Political Science", "test":"term1", "marks":86}' V2 API Standalone Solr curl 'http://localhost:8983/api/collections/techproducts/update/json' -H 'Contenttype:application/json' -d ' { "first":"Steve", "last":"Jobs", "grade":1, "subject":"Social Science", "test":"term1", "marks":90} { "first":"Steve", "last":"Woz", "grade":1, "subject":"Political Science", "test":"term1", "marks":86}' © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 354 of 1195 Apache Solr Reference Guide 7.3 V2 API SolrCloud curl 'http://localhost:8983/api/collections/techproducts/update/json' -H 'Contenttype:application/json' -d ' { "first":"Steve", "last":"Jobs", "grade":1, "subject":"Social Science", "test":"term1", "marks":90} { "first":"Steve", "last":"Woz", "grade":1, "subject":"Political Science", "test":"term1", "marks":86}' Or even an array of documents, as in this example: V1 API curl 'http://localhost:8983/solr/techproducts/update/json/docs' -H 'Contenttype:application/json' -d '[ {"first":"Steve", "last":"Jobs", "grade":1, "subject":"Computer Science", "test":"term1", "marks":90}, {"first":"Steve", "last":"Woz", "grade":1, "subject":"Calculus", "test":"term1", "marks":86}]' V2 API Standalone Solr curl 'http://localhost:8983/api/cores/techproducts/update/json' -H 'Contenttype:application/json' -d '[ {"first":"Steve", "last":"Jobs", "grade":1, "subject":"Computer Science", "test":"term1", "marks":90}, {"first":"Steve", "last":"Woz", "grade":1, "subject":"Calculus", "test":"term1", "marks":86}]' V2 API SolrCloud curl 'http://localhost:8983/api/collections/techproducts/update/json' -H 'Contenttype:application/json' -d '[ {"first":"Steve", "last":"Jobs", "grade":1, "subject":"Computer Science", "test":"term1", "marks":90}, {"first":"Steve", "last":"Woz", "grade":1, "subject":"Calculus", "test":"term1", "marks":86}]' Indexing Nested Documents The following is an example of indexing nested documents: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 355 of 1195 V1 API curl 'http://localhost:8983/solr/techproducts/update/json/docs?split=/|/orgs'\ -H 'Content-type:application/json' -d '{ "name": "Joe Smith", "phone": 876876687, "orgs": [ { "name": "Microsoft", "city": "Seattle", "zip": 98052 }, { "name": "Apple", "city": "Cupertino", "zip": 95014 } ] }' V2 API Standalone Solr curl 'http://localhost:8983/api/cores/techproducts/update/json?split=/|/orgs'\ -H 'Content-type:application/json' -d '{ "name": "Joe Smith", "phone": 876876687, "orgs": [ { "name": "Microsoft", "city": "Seattle", "zip": 98052 }, { "name": "Apple", "city": "Cupertino", "zip": 95014 } ] }' © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 356 of 1195 Apache Solr Reference Guide 7.3 V2 API SolrCloud curl 'http://localhost:8983/api/collections/techproducts/update/json?split=/|/orgs'\ -H 'Content-type:application/json' -d '{ "name": "Joe Smith", "phone": 876876687, "orgs": [ { "name": "Microsoft", "city": "Seattle", "zip": 98052 }, { "name": "Apple", "city": "Cupertino", "zip": 95014 } ] }' With this example, the documents indexed would be, as follows: { "name":"Joe Smith", "phone":876876687, "_childDocuments_":[ { "name":"Microsoft", "city":"Seattle", "zip":98052}, { "name":"Apple", "city":"Cupertino", "zip":95014}]} Tips for Custom JSON Indexing 1. Schemaless mode: This handles field creation automatically. The field guessing may not be exactly as you expect, but it works. The best thing to do is to setup a local server in schemaless mode, index a few sample docs and create those fields in your real setup with proper field types before indexing 2. Pre-created Schema: Post your docs to the /update/json/docs endpoint with echo=true. This gives you the list of field names you need to create. Create the fields before you actually index 3. No schema, only full-text search: All you need to do is to do full-text search on your JSON. Set the configuration as given in the Setting JSON Defaults section. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 357 of 1195 Setting JSON Defaults It is possible to send any JSON to the /update/json/docs endpoint and the default configuration of the component is as follows: _src_ true text So, if no params are passed, the entire JSON file would get indexed to the _src_ field and all the values in the input JSON would go to a field named text. If there is a value for the uniqueKey it is stored and if no value could be obtained from the input JSON, a UUID is created and used as the uniqueKey field value. Alternately, use the Request Parameters feature to set these parameters, as shown earlier in the section Reusing Parameters in Multiple Requests. V1 API curl http://localhost:8983/solr/techproducts/config/params -H 'Contenttype:application/json' -d '{ "set": { "full_txt": { "srcField": "_src_", "mapUniqueKeyOnly" : true, "df": "text" }}}' © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 358 of 1195 Apache Solr Reference Guide 7.3 V2 API Standalone Solr curl http://localhost:8983/api/cores/techproducts/config/params -H 'Contenttype:application/json' -d '{ "set": { "full_txt": { "srcField": "_src_", "mapUniqueKeyOnly" : true, "df": "text" }}}' V2 API SolrCloud curl http://localhost:8983/api/collections/techproducts/config/params -H 'Contenttype:application/json' -d '{ "set": { "full_txt": { "srcField": "_src_", "mapUniqueKeyOnly" : true, "df": "text" }}}' To use these parameters, send the parameter useParams=full_txt with each request. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 359 of 1195 Uploading Data with Solr Cell using Apache Tika Solr uses code from the Apache Tika project to provide a framework for incorporating many different fileformat parsers such as Apache PDFBox and Apache POI into Solr itself. Working with this framework, Solr’s ExtractingRequestHandler can use Tika to support uploading binary files, including files in popular formats such as Word and PDF, for data extraction and indexing. When this framework was under development, it was called the Solr Content Extraction Library or CEL; from that abbreviation came this framework’s name: Solr Cell. If you want to supply your own ContentHandler for Solr to use, you can extend the ExtractingRequestHandler and override the createFactory() method. This factory is responsible for constructing the SolrContentHandler that interacts with Tika, and allows literals to override Tika-parsed values. Set the parameter literalsOverride, which normally defaults to true, to false to append Tikaparsed values to literal values. Key Solr Cell Concepts When using the Solr Cell framework, it is helpful to keep the following in mind: • Tika will automatically attempt to determine the input document type (Word, PDF, HTML) and extract the content appropriately. If you like, you can explicitly specify a MIME type for Tika with the stream.type parameter. • Tika works by producing an XHTML stream that it feeds to a SAX ContentHandler. SAX is a common interface implemented for many different XML parsers. For more information, see http://www.saxproject.org/quickstart.html. • Solr then responds to Tika’s SAX events and creates the fields to index. • Tika produces metadata such as Title, Subject, and Author according to specifications such as the DublinCore. See http://tika.apache.org/1.16/formats.html for the file types supported. • Tika adds all the extracted text to the content field. • You can map Tika’s metadata fields to Solr fields. • You can pass in literals for field values. Literals will override Tika-parsed values, including fields in the Tika metadata object, the Tika content field, and any "captured content" fields. • You can apply an XPath expression to the Tika XHTML to restrict the content that is produced.  While Apache Tika is quite powerful, it is not perfect and fails on some files. PDF files are particularly problematic, mostly due to the PDF format itself. In case of a failure processing any file, the ExtractingRequestHandler does not have a secondary mechanism to try to extract some text from the file; it will throw an exception and fail. Trying out Tika with the Solr techproducts Example You can try out the Tika framework using the techproducts example included in Solr. Start the example: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 360 of 1195 Apache Solr Reference Guide 7.3 bin/solr -e techproducts You can now use curl to send a sample PDF file via HTTP POST: curl 'http://localhost:8983/solr/techproducts/update/extract?literal.id=doc1&commit=true' -F "myfile=@example/exampledocs/solr-word.pdf" The URL above calls the Extracting Request Handler, uploads the file solr-word.pdf and assigns it the unique ID doc1. Here’s a closer look at the components of this command: • The literal.id=doc1 parameter provides the necessary unique ID for the document being indexed. • The commit=true parameter causes Solr to perform a commit after indexing the document, making it immediately searchable. For optimum performance when loading many documents, don’t call the commit command until you are done. • The -F flag instructs curl to POST data using the Content-Type multipart/form-data and supports the uploading of binary files. The @ symbol instructs curl to upload the attached file. • The argument myfile=@tutorial.html needs a valid path, which can be absolute or relative. You can also use bin/post to send a PDF file into Solr (without the params, the literal.id parameter would be set to the absolute path to the file): bin/post -c techproducts example/exampledocs/solr-word.pdf -params "literal.id=a" Now you should be able to execute a query and find that document. You can make a request like http://localhost:8983/solr/techproducts/select?q=pdf. You may notice that although the content of the sample document has been indexed and stored, there are not a lot of metadata fields associated with this document. This is because unknown fields are ignored according to the default parameters configured for the /update/extract handler in solrconfig.xml, and this behavior can be easily changed or overridden. For example, to store and see all metadata and content, execute the following: bin/post -c techproducts example/exampledocs/solr-word.pdf -params "literal.id=doc1&uprefix=attr_" In this command, the uprefix=attr_ parameter causes all generated fields that aren’t defined in the schema to be prefixed with attr_, which is a dynamic field that is stored and indexed. This command allows you to query the document using an attribute, as in: http://localhost:8983/solr/techproducts/select?q=attr_meta:microsoft. Solr Cell Input Parameters The table below describes the parameters accepted by the Extracting Request Handler. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 361 of 1195 capture Captures XHTML elements with the specified name for a supplementary addition to the Solr document. This parameter can be useful for copying chunks of the XHTML into a separate field. For instance, it could be used to grab paragraphs (

) and index them into a separate field. Note that content is still also captured into the overall "content" field. captureAttr Indexes attributes of the Tika XHTML elements into separate fields, named after the element. If set to true, for example, when extracting from HTML, Tika can return the href attributes in tags as fields named "a". See the examples below. commitWithin Add the document within the specified number of milliseconds. date.formats Defines the date format patterns to identify in the documents. defaultField If the uprefix parameter (see below) is not specified and a field cannot be determined, the default field will be used. extractOnly Default is false. If true, returns the extracted content from Tika without indexing the document. This literally includes the extracted XHTML as a string in the response. When viewing manually, it may be useful to use a response format other than XML to aid in viewing the embedded XHTML tags. For an example, see http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput. extractFormat The default is xml, but the other option is text. Controls the serialization format of the extract content. The xml format is actually XHTML, the same format that results from passing the -x command to the Tika command line application, while the text format is like that produced by Tika’s -t command. This parameter is valid only if extractOnly is set to true. fmap.source_field Maps (moves) one field name to another. The source_field must be a field in incoming documents, and the value is the Solr field to map to. Example: fmap.content=text causes the data in the content field generated by Tika to be moved to the Solr’s text field. ignoreTikaException If true, exceptions found during processing will be skipped. Any metadata available, however, will be indexed. literal.fieldname Populates a field with the name supplied with the specified value for each document. The data can be multivalued if the field is multivalued. literalsOverride If true (the default), literal field values will override other values with the same field name. If false, literal values defined with literal.fieldname will be appended to data already in the fields extracted from Tika. If setting literalsOverride to false, the field must be multivalued. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 362 of 1195 Apache Solr Reference Guide 7.3 lowernames Values are true or false. If true, all field names will be mapped to lowercase with underscores, if needed. For example, "Content-Type" would be mapped to "content_type." multipartUploadLimitInKB Useful if uploading very large documents, this defines the KB size of documents to allow. passwordsFile Defines a file path and name for a file of file name to password mappings. resource.name Specifies the optional name of the file. Tika can use it as a hint for detecting a file’s MIME type. resource.password Defines a password to use for a password-protected PDF or OOXML file tika.config Defines a file path and name to a customized Tika configuration file. This is only required if you have customized your Tika implementation. uprefix Prefixes all fields that are not defined in the schema with the given prefix. This is very useful when combined with dynamic field definitions. Example: uprefix=ignored_ would effectively ignore all unknown fields generated by Tika given the example schema contains xpath When extracting, only return Tika XHTML content that satisfies the given XPath expression. See http://tika.apache.org/1.16/index.html for details on the format of Tika XHTML. See also http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput. Order of Operations Here is the order in which the Solr Cell framework, using the Extracting Request Handler and Tika, processes its input. 1. Tika generates fields or passes them in as literals specified by literal.=. If literalsOverride=false, literals will be appended as multi-value to the Tika-generated field. 2. If lowernames=true, Tika maps fields to lowercase. 3. Tika applies the mapping rules specified by fmap.source=target parameters. 4. If uprefix is specified, any unknown field names are prefixed with that value, else if defaultField is specified, any unknown fields are copied to the default field. Configuring the Solr ExtractingRequestHandler If you are not working with the supplied sample_techproducts_configs or _default config set, you must configure your own solrconfig.xml to know about the Jar’s containing the ExtractingRequestHandler and its dependencies: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 363 of 1195 You can then configure the ExtractingRequestHandler in solrconfig.xml. last_modified ignored_ /my/path/to/tika.config yyyy-MM-dd parseContext.xml In the defaults section, we are mapping Tika’s Last-Modified Metadata attribute to a field named last_modified. We are also telling it to ignore undeclared fields. These are all overridden parameters. The tika.config entry points to a file containing a Tika configuration. The date.formats allows you to specify various java.text.SimpleDateFormats date formats for working with transforming extracted input to a Date. Solr comes configured with the following date formats (see the DateUtil in Solr): • yyyy-MM-dd’T’HH:mm:ss’Z' • yyyy-MM-dd’T’HH:mm:ss • yyyy-MM-dd • yyyy-MM-dd hh:mm:ss • yyyy-MM-dd HH:mm:ss • EEE MMM d hh:mm:ss z yyyy • EEE, dd MMM yyyy HH:mm:ss zzz • EEEE, dd-MMM-yy HH:mm:ss zzz • EEE MMM d HH:mm:ss yyyy Parser-Specific Properties Parsers used by Tika may have specific properties to govern how data is extracted. For instance, when using the Tika library from a Java program, the PDFParserConfig class has a method setSortByPosition(boolean) that can extract vertically oriented text. To access that method via configuration with the ExtractingRequestHandler, one can add the parseContext.config property to the solrconfig.xml file (see above) and then set properties in Tika’s PDFParserConfig as below. Consult the Tika Java API documentation © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 364 of 1195 Apache Solr Reference Guide 7.3 for configuration parameters that can be set for any particular parsers that require this level of control. ... Multi-Core Configuration For a multi-core configuration, you can specify sharedLib='lib' in the section of solr.xml and place the necessary jar files there. For more information about Solr cores, see The Well-Configured Solr Instance. Indexing Encrypted Documents with the ExtractingUpdateRequestHandler The ExtractingRequestHandler will decrypt encrypted files and index their content if you supply a password in either resource.password on the request, or in a passwordsFile file. In the case of passwordsFile, the file supplied must be formatted so there is one line per rule. Each rule contains a file name regular expression, followed by "=", then the password in clear-text. Because the passwords are in clear-text, the file should have strict access restrictions. # This is a comment myFileName = myPassword .*\.docx$ = myWordPassword .*\.pdf$ = myPdfPassword Solr Cell Examples Metadata Created by Tika As mentioned before, Tika produces metadata about the document. Metadata describes different aspects of a document, such as the author’s name, the number of pages, the file size, and so on. The metadata produced depends on the type of document submitted. For instance, PDFs have different metadata than Word documents do. In addition to Tika’s metadata, Solr adds the following metadata (defined in ExtractingMetadataConstants): stream_name The name of the Content Stream as uploaded to Solr. Depending on how the file is uploaded, this may or Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 365 of 1195 may not be set. stream_source_info Any source info about the stream. (See the section on Content Streams later in this section.) stream_size The size of the stream in bytes. stream_content_type The content type of the stream, if available.  We recommend that you try using the extractOnly option to discover which values Solr is setting for these metadata elements. Examples of Uploads Using the Extracting Request Handler Capture and Mapping The command below captures

tags separately, and then maps all the instances of that field to a dynamic field named foo_t. bin/post -c techproducts example/exampledocs/sample.html -params "literal.id=doc2&captureAttr=true&defaultField=_text_&fmap.div=foo_t&capture=div" Using Literals to Define Your Own Metadata To add in your own metadata, pass in the literal parameter along with the file: bin/post -c techproducts -params "literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&literal.blah_s=Bah " example/exampledocs/sample.html XPath Expressions The example below passes in an XPath expression to restrict the XHTML returned by Tika: bin/post -c techproducts -params "literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&xpath=/xhtml:html/ xhtml:body/xhtml:div//node()" example/exampledocs/sample.html Extracting Data without Indexing It Solr allows you to extract data without indexing. You might want to do this if you’re using Solr solely as an extraction server or if you’re interested in testing Solr extraction. The example below sets the extractOnly=true parameter to extract data without indexing it. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 366 of 1195 Apache Solr Reference Guide 7.3 curl "http://localhost:8983/solr/techproducts/update/extract?&extractOnly=true" --data-binary @example/exampledocs/sample.html -H 'Content-type:text/html' The output includes XML generated by Tika (and further escaped by Solr’s XML) using a different output format to make it more readable (-out yes instructs the tool to echo Solr’s output to the console): bin/post -c techproducts -params "extractOnly=true&wt=ruby&indent=true" -out yes example/exampledocs/sample.html Sending Documents to Solr with a POST The example below streams the file as the body of the POST, which does not, then, provide information to Solr about the name of the file. curl "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc6&defaultField=text&commit= true" --data-binary @example/exampledocs/sample.html -H 'Content-type:text/html' Sending Documents to Solr with Solr Cell and SolrJ SolrJ is a Java client that you can use to add documents to the index, update the index, or query the index. You’ll find more information on SolrJ in Client APIs. Here’s an example of using Solr Cell and SolrJ to add documents to a Solr index. First, let’s use SolrJ to create a new SolrClient, then we’ll construct a request containing a ContentStream (essentially a wrapper around a file) and sent it to Solr: public class SolrCellRequestDemo { public static void main (String[] args) throws IOException, SolrServerException { SolrClient client = new HttpSolrClient.Builder("http://localhost:8983/solr/my_collection") .build(); ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract"); req.addFile(new File("my-file.pdf")); req.setParam(ExtractingParams.EXTRACT_ONLY, "true"); NamedList result = client.request(req); System.out.println("Result: " + result); } This operation streams the file my-file.pdf into the Solr index for my_collection. The sample code above calls the extract command, but you can easily substitute other commands that are supported by Solr Cell. The key class to use is the ContentStreamUpdateRequest, which makes sure the ContentStreams are set properly. SolrJ takes care of the rest. Note that the ContentStreamUpdateRequest is not just specific to Solr Cell. You can send CSV to the CSV Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 367 of 1195 Update handler and to any other Request Handler that works with Content Streams for updates. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 368 of 1195 Apache Solr Reference Guide 7.3 Uploading Structured Data Store Data with the Data Import Handler Many search applications store the content to be indexed in a structured data store, such as a relational database. The Data Import Handler (DIH) provides a mechanism for importing content from a data store and indexing it. In addition to relational databases, DIH can index content from HTTP based data sources such as RSS and ATOM feeds, e-mail repositories, and structured XML where an XPath processor is used to generate fields. DIH Concepts and Terminology Descriptions of the Data Import Handler use several familiar terms, such as entity and processor, in specific ways, as explained in the table below. Datasource As its name suggests, a datasource defines the location of the data of interest. For a database, it’s a DSN. For an HTTP datasource, it’s the base URL. Entity Conceptually, an entity is processed to generate a set of documents, containing multiple fields, which (after optionally being transformed in various ways) are sent to Solr for indexing. For a RDBMS data source, an entity is a view or table, which would be processed by one or more SQL statements to generate a set of rows (documents) with one or more columns (fields). Processor An entity processor does the work of extracting content from a data source, transforming it, and adding it to the index. Custom entity processors can be written to extend or replace the ones supplied. Transformer Each set of fields fetched by the entity may optionally be transformed. This process can modify the fields, create new fields, or generate multiple rows/documents form a single row. There are several built-in transformers in the DIH, which perform functions such as modifying dates and stripping HTML. It is possible to write custom transformers using the publicly available interface. Solr’s DIH Examples The example/example-DIH directory contains several collections to demonstrate many of the features of the data import handler. These are available with the dih example from the Solr Control Script: bin/solr -e dih This launches a standalone Solr instance with several collections that correspond to detailed examples. The available examples are atom, db, mail, solr, and tika. All examples in this section assume you are running the DIH example server. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 369 of 1195 Configuring DIH Configuring solrconfig.xml for DIH The Data Import Handler has to be registered in solrconfig.xml. For example: /path/to/my/DIHconfigfile.xml The only required parameter is the config parameter, which specifies the location of the DIH configuration file that contains specifications for the data source, how to fetch data, what data to fetch, and how to process it to generate the Solr documents to be posted to the index. You can have multiple DIH configuration files. Each file would require a separate definition in the solrconfig.xml file, specifying a path to the file. Configuring the DIH Configuration File An annotated configuration file, based on the db collection in the dih example server, is shown below (this file is located in example/example-DIH/solr/db/conf/db-data-config.xml). This example shows how to extract fields from four tables defining a simple product database. More information about the parameters and options shown here will be described in the sections following. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 370 of 1195 Apache Solr Reference Guide 7.3 ① The first element is the dataSource, in this case an HSQLDB database. The path to the JDBC driver and the JDBC URL and login credentials are all specified here. Other permissible attributes include whether or not to autocommit to Solr, the batchsize used in the JDBC connection, and a readOnly flag. ② The password attribute is optional if there is no password set for the DB. Alternately, the password can be encrypted; the section Encrypting a Database Password below describes how to do this. ③ A document element follows, containing multiple entity elements. Note that entity elements can be nested, and this allows the entity relationships in the sample database to be mirrored here, so that we can generate a denormalized Solr record which may include multiple features for one item, for instance. ④ The possible attributes for the entity element are described in later sections. Entity elements may contain one or more field elements, which map the data source field names to Solr fields, and optionally specify per-field transformations. This entity is the root entity. ⑤ This entity is nested and reflects the one-to-many relationship between an item and its multiple Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 371 of 1195 features. Note the use of variables; ${item.ID} is the value of the column 'ID' for the current item (item referring to the entity name). Datasources can still be specified in solrconfig.xml. These must be specified in the defaults section of the handler in solrconfig.xml. However, these are not parsed until the main configuration is loaded. The entire configuration itself can be passed as a request parameter using the dataConfig parameter rather than using a file. When configuration errors are encountered, the error message is returned in XML format. A reload-config command is also supported, which is useful for validating a new configuration file, or if you want to specify a file, load it, and not have it reloaded again on import. If there is an xml mistake in the configuration a user-friendly message is returned in xml format. You can then fix the problem and do a reload-config.  You can also view the DIH configuration in the Solr Admin UI from the Dataimport Screen. It includes an interface to import content. DIH Request Parameters Request parameters can be substituted in configuration with placeholder ${dataimporter.request.paramname}, as in this example: These parameters can then be passed to the full-import command or defined in the section in solrconfig.xml. This example shows the parameters with the full-import command: http://localhost:8983/solr/dih/dataimport?command=full-import&jdbcurl=jdbc:hsqldb:./exampleDIH/hsqldb/ex&jdbcuser=sa&jdbcpassword=secret Encrypting a Database Password The database password can be encrypted if necessary to avoid plaintext passwords being exposed in unsecured files. To do this, follow these steps: 1. In a terminal window, run the command openssl enc -aes-128-cbc -a -salt -in pwd.txt. ..This assumes the password is in a file named pwd.txt. If you don’t have the password in this file yet, you can do echo "mypassword" > pwd.txt. a. The openssl session will ask for a password to use for the decryption. You will use this file with a encryptKeyFile parameter in data-config.xml. b. The output of the process will be a long string such as U2FsdGVkX18QMjY0yfCqlfBMvAB4d3XkwY96L7gfO2o=. This will be the password you put in your dataconfig.xml file. 2. Save the password you used as the decryption password in the previous step to a file, and determine the location of the file on the Solr server. You could use a command such as echo myencrypfilepwd > © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 372 of 1195 Apache Solr Reference Guide 7.3 /location/of/encryptionkey. Replace "myencrypfilepwd" with the password you used while generating the key. 3. If the file is not yet on the Solr server, move it there. Also make sure the encryption key file permissions do not allow it to be read by unauthorized users. The chmod 0600 command should set the permissions sufficiently. 4. In your data-config.xml, you’ll add the password and encryptKeyFile parameters to the configuration, as in this example: The parameters available are: dateFormat A java.text.SimpleDateFormat to use when converting the date to text. The default is yyyy-MM-dd HH:mm:ss. type © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 374 of 1195 Apache Solr Reference Guide 7.3 The implementation class. Use SimplePropertiesWriter for non-SolrCloud installations. If using SolrCloud, use ZKPropertiesWriter. If this is not specified, it will default to the appropriate class depending on if SolrCloud mode is enabled. directory Used with the SimplePropertiesWriter only. The directory for the properties file. If not specified, the default is conf. filename Used with the SimplePropertiesWriter only. The name of the properties file. If not specified, the default is the requestHandler name (as defined in solrconfig.xml, appended by ".properties" (such as, dataimport.properties). locale The locale. If not defined, the ROOT locale is used. It must be specified as language-country (BCP 47 language tag). For example, en-US. Data Sources A data source specifies the origin of data and its type. Somewhat confusingly, some data sources are configured within the associated entity processor. Data sources can also be specified in solrconfig.xml, which is useful when you have multiple environments (for example, development, QA, and production) differing only in their data sources. You can create a custom data source by writing a class that extends org.apache.solr.handler.dataimport.DataSource. The mandatory attributes for a data source definition are its name and type. The name identifies the data source to an Entity element. The types of data sources available are described below. ContentStreamDataSource This takes the POST data as the data source. This can be used with any EntityProcessor that uses a DataSource. FieldReaderDataSource This can be used where a database field contains XML which you wish to process using the XPathEntityProcessor. You would set up a configuration with both JDBC and FieldReader data sources, and two entities, as follows: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 375 of 1195 /> ... The FieldReaderDataSource can take an encoding parameter, which will default to "UTF-8" if not specified. It must be specified as language-country. For example, en-US. FileDataSource This can be used like a URLDataSource, but is used to fetch content from files on disk. The only difference from URLDataSource, when accessing disk files, is how a pathname is specified. This data source accepts these optional attributes. basePath The base path relative to which the value is evaluated if it is not absolute. encoding Defines the character encoding to use. If not defined, UTF-8 is used. JdbcDataSource This is the default datasource. It’s used with the SqlEntityProcessor. See the example in the FieldReaderDataSource section for details on configuration. JdbcDatasource supports at least the following attributes: driver, url, user, password, encryptKeyFile Usual JDBC connection properties. batchSize Passed to Statement#setFetchSize, default value 500. For MySQL driver, which doesn’t honor fetchSize and pulls whole resultSet, which often lead to OutOfMemoryError. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 376 of 1195 Apache Solr Reference Guide 7.3 In this case, set batchSize=-1 that pass setFetchSize(Integer.MIN_VALUE), and switch result set to pull row by row All of them substitute properties via ${placeholders}. URLDataSource This data source is often used with XPathEntityProcessor to fetch content from an underlying file:// or http:// location. Here’s an example: The URLDataSource type accepts these optional parameters: baseURL Specifies a new baseURL for pathnames. You can use this to specify host/port changes between Dev/QA/Prod environments. Using this attribute isolates the changes to be made to the solrconfig.xml connectionTimeout Specifies the length of time in milliseconds after which the connection should time out. The default value is 5000ms. encoding By default the encoding in the response header is used. You can use this property to override the default encoding. readTimeout Specifies the length of time in milliseconds after which a read operation should time out. The default value is 10000ms. Entity Processors Entity processors extract data, transform it, and add it to a Solr index. Examples of entities include views or tables in a data store. Each processor has its own set of attributes, described in its own section below. In addition, there are several attributes common to all entities which may be specified: dataSource The name of a data source. If there are multiple data sources defined, use this attribute with the name of the data source for this entity. name Required. The unique name used to identify an entity. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 377 of 1195 pk The primary key for the entity. It is optional, and required only when using delta-imports. It has no relation to the uniqueKey defined in schema.xml but they can both be the same. This attribute is mandatory if you do delta-imports and then refer to the column name in ${dataimporter.delta.} which is used as the primary key. processor Default is SqlEntityProcessor. Required only if the datasource is not RDBMS. onError Defines what to do if an error is encountered. Permissible values are: abort Stops the import. skip Skips the current document. continue Ignores the error and processing continues. preImportDeleteQuery Before a full-import command, use this query this to cleanup the index instead of using *:*. This is honored only on an entity that is an immediate sub-child of . postImportDeleteQuery Similar to preImportDeleteQuery, but it executes after the import has completed. rootEntity By default the entities immediately under are root entities. If this attribute is set to false, the entity directly falling under that entity will be treated as the root entity (and so on). For every row returned by the root entity, a document is created in Solr. transformer Optional. One or more transformers to be applied on this entity. cacheImpl Optional. A class (which must implement DIHCache) to use for caching this entity when doing lookups from an entity which wraps it. Provided implementation is SortedMapBackedCache. cacheKey The name of a property of this entity to use as a cache key if cacheImpl is specified. cacheLookup An entity + property name that will be used to lookup cached instances of this entity if cacheImpl is specified. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 378 of 1195 Apache Solr Reference Guide 7.3 where An alternative way to specify cacheKey and cacheLookup concatenated with '='. For example, where="CODE=People.COUNTRY_CODE" is equivalent to cacheKey="CODE" cacheLookup="People.COUNTRY_CODE" child="true" Enables indexing document blocks aka Nested Child Documents for searching with Block Join Query Parsers. It can be only specified on the element under another root entity. It switches from default behavior (merging field values) to nesting documents as children documents. Note: parent should add a field which is used as a parent filter in query time. join="zipper" Enables merge join, aka "zipper" algorithm, for joining parent and child entities without cache. It should be specified at child (nested) . It implies that parent and child queries return results ordered by keys, otherwise it throws an exception. Keys should be specified either with where attribute or with cacheKey and cacheLookup. Entity Caching Caching of entities in DIH is provided to avoid repeated lookups for same entities again and again. The default SortedMapBackedCache is a HashMap where a key is a field in the row and the value is a bunch of rows for that same key. In the example below, each manufacturer entity is cached using the id property as a cache key. Cache lookups will be performed for each product entity based on the product’s manu property. When the cache has no data for a particular key, the query is run and the cache is populated The SQL Entity Processor The SqlEntityProcessor is the default processor. The associated JdbcDataSource should be a JDBC URL. The entity attributes specific to this processor are shown in the table below. These are in addition to the attributes common to all entity processors described above. query Required. The SQL query used to select rows. deltaQuery SQL query used if the operation is delta-import. This query selects the primary keys of the rows which will be parts of the delta-update. The pks will be available to the deltaImportQuery through the variable ${dataimporter.delta.}. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 379 of 1195 parentDeltaQuery SQL query used if the operation is delta-import. deletedPkQuery SQL query used if the operation is delta-import. deltaImportQuery SQL query used if the operation is delta-import. If this is not present, DIH tries to construct the import query by (after identifying the delta) modifying the 'query' (this is error prone). There is a namespace ${dataimporter.delta.} which can be used in this query. For example, select * from tbl where id=${dataimporter.delta.id}. The XPathEntityProcessor This processor is used when indexing XML formatted data. The data source is typically URLDataSource or FileDataSource. XPath can also be used with the FileListEntityProcessor described below, to generate a document from each file. The entity attributes unique to this processor are shown below. These are in addition to the attributes common to all entity processors described above. Processor Required. Must be set to XpathEntityProcessor. url Required. The HTTP URL or file location. stream Optional: Set to true for a large file or download. forEach Required unless you define useSolrAddSchema. The XPath expression which demarcates each record. This will be used to set up the processing loop. xsl Optional: Its value (a URL or filesystem path) is the name of a resource used as a preprocessor for applying the XSL transformation. useSolrAddSchema Set this to true if the content is in the form of the standard Solr update XML schema. Each element in the entity can have the following attributes as well as the default ones. xpath Required. The XPath expression which will extract the content from the record for this field. Only a subset of XPath syntax is supported. commonField Optional. If true, then when this field is encountered in a record it will be copied to future records when © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 380 of 1195 Apache Solr Reference Guide 7.3 creating a Solr document. flatten Optional. If set to true, then any children text nodes are collected to form the value of a field. The default value is false, meaning that if there are any sub-elements of the node pointed to by the XPath expression, they will be quietly omitted.  Here is an example from the atom collection in the dih example (data-config file found at example/exampleDIH/solr/atom/conf/atom-data-config.xml): xpath="/feed/entry/author/name"/> xpath="/feed/entry/category/@term"/> xpath="/feed/entry/link[@rel='alternate']/@href"/> Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 381 of 1195 The MailEntityProcessor The MailEntityProcessor uses the Java Mail API to index email messages using the IMAP protocol. The MailEntityProcessor works by connecting to a specified mailbox using a username and password, fetching the email headers for each message, and then fetching the full email contents to construct a document (one document for each mail message). The entity attributes unique to the MailEntityProcessor are shown below. These are in addition to the attributes common to all entity processors described above. processor Required. Must be set to MailEntityProcessor. user Required. Username for authenticating to the IMAP server; this is typically the email address of the mailbox owner. password Required. Password for authenticating to the IMAP server. host Required. The IMAP server to connect to. protocol Required. The IMAP protocol to use, valid values are: imap, imaps, gimap, and gimaps. fetchMailsSince Optional. Date/time used to set a filter to import messages that occur after the specified date; expected format is: yyyy-MM-dd HH:mm:ss. folders Required. Comma-delimited list of folder names to pull messages from, such as "inbox". recurse Optional. Default is true. Flag to indicate if the processor should recurse all child folders when looking for messages to import. include Optional. Comma-delimited list of folder patterns to include when processing folders (can be a literal value or regular expression). exclude Optional. Comma-delimited list of folder patterns to exclude when processing folders (can be a literal value or regular expression). Excluded folder patterns take precedence over include folder patterns. processAttachement or processAttachments Optional. Default is true. Use Tika to process message attachments. includeContent © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 382 of 1195 Apache Solr Reference Guide 7.3 Optional. Default is true. Include the message body when constructing Solr documents for indexing. Here is an example from the mail collection of the dih example (data-config file found at example/exampleDIH/mail/conf/mail-data-config.xml): Importing New Emails Only After running a full import, the MailEntityProcessor keeps track of the timestamp of the previous import so that subsequent imports can use the fetchMailsSince filter to only pull new messages from the mail server. This occurs automatically using the DataImportHandler dataimport.properties file (stored in conf). For instance, if you set fetchMailsSince="2014-08-22 00:00:00" in your mail-data-config.xml, then all mail messages that occur after this date will be imported on the first run of the importer. Subsequent imports will use the date of the previous import as the fetchMailsSince filter, so that only new emails since the last import are indexed each time. GMail Extensions When connecting to a GMail account, you can improve the efficiency of the MailEntityProcessor by setting the protocol to gimap or gimaps. This allows the processor to send the fetchMailsSince filter to the GMail server to have the date filter applied on the server, which means the processor only receives new messages from the server. However, GMail only supports date granularity, so the server-side filter may return previously seen messages if run more than once a day. The TikaEntityProcessor The TikaEntityProcessor uses Apache Tika to process incoming documents. This is similar to Uploading Data with Solr Cell using Apache Tika, but using DataImportHandler options instead. The parameters for this processor are described in the table below. These are in addition to the attributes common to all entity processors described above. dataSource This parameter defines the data source and an optional name which can be referred to in later parts of Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 383 of 1195 the configuration if needed. This is the same dataSource explained in the description of general entity processor attributes above. The available data source types for this processor are: • BinURLDataSource: used for HTTP resources, but can also be used for files. • BinContentStreamDataSource: used for uploading content as a stream. • BinFileDataSource: used for content on the local filesystem. url Required. The path to the source file(s), as a file path or a traditional internet URL. htmlMapper Optional. Allows control of how Tika parses HTML. If this parameter is defined, it must be either default or identity; if it is absent, "default" is assumed. The "default" mapper strips much of the HTML from documents while the "identity" mapper passes all HTML as-is with no modifications. format The output format. The options are text, xml, html or none. The default is "text" if not defined. The format "none" can be used if metadata only should be indexed and not the body of the documents. parser Optional. The default parser is org.apache.tika.parser.AutoDetectParser. If a custom or other parser should be used, it should be entered as a fully-qualified name of the class and path. fields The list of fields from the input documents and how they should be mapped to Solr fields. If the attribute meta is defined as "true", the field will be obtained from the metadata of the document and not parsed from the body of the main text. extractEmbedded Instructs the TikaEntityProcessor to extract embedded documents or attachments when true. If false, embedded documents and attachments will be ignored. onError By default, the TikaEntityProcessor will stop processing documents if it finds one that generates an error. If you define onError to "skip", the TikaEntityProcessor will instead skip documents that fail processing and log a message that the document was skipped. Here is an example from the tika collection of the dih example (data-config file found in example/exampleDIH/tika/conf/tika-data-config.xml): © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 384 of 1195 Apache Solr Reference Guide 7.3 The FileListEntityProcessor This processor is basically a wrapper, and is designed to generate a set of files satisfying conditions specified in the attributes which can then be passed to another processor, such as the XPathEntityProcessor. The entity information for this processor would be nested within the FileListEntity entry. It generates five implicit fields: fileAbsolutePath, fileDir, fileSize, fileLastModified, and file, which can be used in the nested processor. This processor does not use a data source. The attributes specific to this processor are described in the table below: fileName Required. A regular expression pattern to identify files to be included. basedir Required. The base directory (absolute path). recursive Whether to search directories recursively. Default is 'false'. excludes A regular expression pattern to identify files which will be excluded. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 385 of 1195 newerThan A date in the format yyyy-MM-ddHH:mm:ss or a date math expression (NOW - 2YEARS). olderThan A date, using the same formats as newerThan. rootEntity This should be set to false. This ensures that each row (filepath) emitted by this processor is considered to be a document. dataSource Must be set to null. The example below shows the combination of the FileListEntityProcessor with another processor which will generate a set of fields from each file found. LineEntityProcessor This EntityProcessor reads all content from the data source on a line by line basis and returns a field called rawLine for each line read. The content is not parsed in any way; however, you may add transformers to manipulate the data within the rawLine field, or to create other additional fields. The lines read can be filtered by two regular expressions specified with the acceptLineRegex and omitLineRegex attributes. The LineEntityProcessor has the following attributes: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 386 of 1195 Apache Solr Reference Guide 7.3 url A required attribute that specifies the location of the input file in a way that is compatible with the configured data source. If this value is relative and you are using FileDataSource or URLDataSource, it assumed to be relative to baseLoc. acceptLineRegex An optional attribute that if present discards any line which does not match the regular expression. omitLineRegex An optional attribute that is applied after any acceptLineRegex and that discards any line which matches this regular expression. For example: While there are use cases where you might need to create a Solr document for each line read from a file, it is expected that in most cases that the lines read by this processor will consist of a pathname, which in turn will be consumed by another entity processor, such as the XPathEntityProcessor. PlainTextEntityProcessor This EntityProcessor reads all content from the data source into an single implicit field called plainText. The content is not parsed in any way, however you may add transformers to manipulate the data within the plainText as needed, or to create other additional fields. For example: Ensure that the dataSource is of type DataSource (FileDataSource, URLDataSource). SolrEntityProcessor This EntityProcessor imports data from different Solr instances and cores. The data is retrieved based on a specified filter query. This EntityProcessor is useful in cases you want to copy your Solr index and want to modify the data in the target index. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 387 of 1195 The SolrEntityProcessor can only copy fields that are stored in the source index. The SolrEntityProcessor supports the following parameters: url Required. The URL of the source Solr instance and/or core. query Required. The main query to execute on the source index. fq Any filter queries to execute on the source index. If more than one filter query is defined, they must be separated by a comma. rows The number of rows to return for each iteration. The default is 50 rows. fl A comma-separated list of fields to fetch from the source index. Note, these fields must be stored in the source Solr instance. qt The search handler to use, if not the default. wt The response format to use, either javabin or xml. timeout The query timeout in seconds. The default is 5 minutes (300 seconds). cursorMark="true" Use this to enable cursor for efficient result set scrolling sort="id asc" This should be used to specify a sort parameter referencing the uniqueKey field of the source Solr instance. See Pagination of Results for details. Here is a simple example of a SolrEntityProcessor: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 388 of 1195 Apache Solr Reference Guide 7.3 Transformers Transformers manipulate the fields in a document returned by an entity. A transformer can create new fields or modify existing ones. You must tell the entity which transformers your import operation will be using, by adding an attribute containing a comma separated list to the element. Specific transformation rules are then added to the attributes of a element, as shown in the examples below. The transformers are applied in the order in which they are specified in the transformer attribute. The DataImportHandler contains several built-in transformers. You can also write your own custom transformers, as described in the DIHCustomTransformer section of the Solr Wiki. The ScriptTransformer (described below) offers an alternative method for writing your own transformers. ClobTransformer You can use the ClobTransformer to create a string out of a CLOB in a database. A CLOB is a character large object: a collection of character data typically stored in a separate location that is referenced in the database. The ClobTransformer accepts these attributes: clob Boolean value to signal if ClobTransformer should process this field or not. If this attribute is omitted, then the corresponding field is not transformed. sourceColName The source column to be used as input. If this is absent source and target are same Here’s an example of invoking the ClobTransformer. ... The DateFormatTransformer This transformer converts dates from one format to another. This would be useful, for example, in a situation where you wanted to convert a field with a fully specified date/time into a less precise date format, for use in faceting. DateFormatTransformer applies only on the fields with an attribute dateTimeFormat. Other fields are not modified. This transformer recognizes the following attributes: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 389 of 1195 dateTimeFormat The format used for parsing this field. This must comply with the syntax of the Java SimpleDateFormat class. sourceColName The column on which the dateFormat is to be applied. If this is absent source and target are same. locale The locale to use for date transformations. If not defined, the ROOT locale is used. It must be specified as language-country (BCP 47 language tag). For example, en-US. Here is example code that returns the date rounded up to the month "2007-JUL": ... The HTMLStripTransformer You can use this transformer to strip HTML out of a field. There is one attribute for this transformer, stripHTML, which is a boolean value (true or false) to signal if the HTMLStripTransformer should process the field or not. For example: ... The LogTransformer You can use this transformer to log data to the console or log files. For example: .... Unlike other transformers, the LogTransformer does not apply to any field, so the attributes are applied on the entity itself. The NumberFormatTransformer Use this transformer to parse a number from a string, converting it into the specified format, and optionally © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 390 of 1195 Apache Solr Reference Guide 7.3 using a different locale. NumberFormatTransformer will be applied only to fields with an attribute formatStyle. This transformer recognizes the following attributes: formatStyle The format used for parsing this field. The value of the attribute must be one of number, percent, integer, or currency. This uses the semantics of the Java NumberFormat class. sourceColName The column on which the NumberFormat is to be applied. This is attribute is absent. The source column and the target column are the same. locale The locale to be used for parsing the strings. The locale. If not defined, the ROOT locale is used. It must be specified as language-country (BCP 47 language tag). For example, en-US. For example: ... The RegexTransformer The regex transformer helps in extracting or manipulating values from fields (from the source) using Regular Expressions. The actual class name is org.apache.solr.handler.dataimport.RegexTransformer. But as it belongs to the default package the package-name can be omitted. The table below describes the attributes recognized by the regex transformer. regex The regular expression that is used to match against the column or sourceColName’s value(s). If replaceWith is absent, each regex group is taken as a value and a list of values is returned. sourceColName The column on which the regex is to be applied. If not present, then the source and target are identical. splitBy Used to split a string. It returns a list of values. Note, this is a regular expression so it may need to be escaped (e.g., via back-slashes). groupNames A comma separated list of field column names, used where the regex contains groups and each group is to be saved to a different field. If some groups are not to be named leave a space between commas. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 391 of 1195 replaceWith Used along with regex. It is equivalent to the method new String().replaceAll(, ). Here is an example of configuring the regex transformer: ① In this example, regex and sourceColName are custom attributes used by the transformer. ② The transformer reads the field full_name from the result set and transforms it to two new target fields, firstName and lastName. Even though the query returned only one column, full_name, in the result set, the Solr document gets two extra fields firstName and lastName which are "derived" fields. These new fields are only created if the regexp matches. ③ The emailids field in the table can be a comma-separated value. It ends up producing one or more email IDs, and we expect the mailId to be a multivalued field in Solr. Note that this transformer can be used to either split a string into tokens based on a splitBy pattern, or to perform a string substitution as per replaceWith, or it can assign groups within a pattern to a list of groupNames. It decides what it is to do based upon the above attributes splitBy, replaceWith and groupNames which are looked for in order. This first one found is acted upon and other unrelated attributes are ignored. The ScriptTransformer The script transformer allows arbitrary transformer functions to be written in any scripting language supported by Java, such as Javascript, JRuby, Jython, Groovy, or BeanShell. Javascript is integrated into Java 8; you’ll need to integrate other languages yourself. Each function you write must accept a row variable (which corresponds to a Java Map, thus permitting get,put,remove operations). Thus you can modify the value of an existing field or add new fields. The return value of the function is the returned object. The script is inserted into the DIH configuration file at the top level and is called once for each row. Here is a simple example. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 392 of 1195 Apache Solr Reference Guide 7.3 .... The TemplateTransformer You can use the template transformer to construct or modify a field value, perhaps using the value of other fields. You can insert extra text into the template. ... Special Commands for DIH You can pass special commands to the DIH by adding any of the variables listed below to any row returned by any component: $skipDoc Skip the current document; that is, do not add it to Solr. The value can be the string true or false. $skipRow Skip the current row. The document will be added with rows from other entities. The value can be the Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 393 of 1195 string true or false. $deleteDocById Delete a document from Solr with this ID. The value has to be the uniqueKey value of the document. $deleteDocByQuery Delete documents from Solr using this query. The value must be a Solr Query. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 394 of 1195 Apache Solr Reference Guide 7.3 Updating Parts of Documents Once you have indexed the content you need in your Solr index, you will want to start thinking about your strategy for dealing with changes to those documents. Solr supports three approaches to updating documents that have only partially changed. The first is atomic updates. This approach allows changing only one or more fields of a document without having to re-index the entire document. The second approach is known as in-place updates. This approach is similar to atomic updates (is a subset of atomic updates in some sense), but can be used only for updating single valued non-indexed and nonstored docValue-based numeric fields. The third approach is known as optimistic concurrency or optimistic locking. It is a feature of many NoSQL databases, and allows conditional updating a document based on its version. This approach includes semantics and rules for how to deal with version matches or mis-matches. Atomic Updates (and in-place updates) and Optimistic Concurrency may be used as independent strategies for managing changes to documents, or they may be combined: you can use optimistic concurrency to conditionally apply an atomic update. Atomic Updates Solr supports several modifiers that atomically update values of a document. This allows updating only specific fields, which can help speed indexing processes in an environment where speed of index additions is critical to the application. To use atomic updates, add a modifier to the field that needs to be updated. The content can be updated, added to, or incrementally increased if the field has a numeric type. set Set or replace the field value(s) with the specified value(s), or remove the values if 'null' or empty list is specified as the new value. May be specified as a single value, or as a list for multiValued fields. add Adds the specified values to a multiValued field. May be specified as a single value, or as a list. add-distinct Adds the specified values to a multiValued field, only if not already present. May be specified as a single value, or as a list. remove Removes (all occurrences of) the specified values from a multiValued field. May be specified as a single value, or as a list. removeregex Removes all occurrences of the specified regex from a multiValued field. May be specified as a single value, or as a list. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 395 of 1195 inc Increments a numeric value by a specific amount. Must be specified as a single numeric value. Field Storage The core functionality of atomically updating a document requires that all fields in your schema must be configured as stored (stored="true") or docValues (docValues="true") except for fields which are destinations, which must be configured as stored="false". Atomic updates are applied to the document represented by the existing stored field values. All data in copyField destinations fields must originate from ONLY copyField sources. If destinations are configured as stored, then Solr will attempt to index both the current value of the field as well as an additional copy from any source fields. If such fields contain some information that comes from the indexing program and some information that comes from copyField, then the information which originally came from the indexing program will be lost when an atomic update is made. There are other kinds of derived fields that must also be set so they aren’t stored. Some spatial field types use derived fields. Examples of this are solr.BBoxField and solr.LatLonType. CurrencyFieldType also uses derived fields. These types create additional fields which are normally specified by a dynamic field definition. That dynamic field definition must be not stored, or indexing will fail. Example Updating Part of a Document If the following document exists in our collection: {"id":"mydoc", "price":10, "popularity":42, "categories":["kids"], "sub_categories":["under_5","under_10"], "promo_ids":["a123x"], "tags":["free_to_try","buy_now","clearance","on_sale"] } And we apply the following update command: {"id":"mydoc", "price":{"set":99}, "popularity":{"inc":20}, "categories":{"add":["toys","games"]}, "sub_categories":{"add-distinct":"under_10"}, "promo_ids":{"remove":"a123x"}, "tags":{"remove":["free_to_try","on_sale"]} } The resulting document in our collection will be: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 396 of 1195 Apache Solr Reference Guide 7.3 {"id":"mydoc", "price":99, "popularity":62, "categories":["kids","toys","games"], "sub_categories":["under_5","under_10"], "tags":["buy_now","clearance"] } In-Place Updates In-place updates are very similar to atomic updates; in some sense, this is a subset of atomic updates. In regular atomic updates, the entire document is re-indexed internally during the application of the update. However, in this approach, only the fields to be updated are affected and the rest of the documents are not re-indexed internally. Hence, the efficiency of updating in-place is unaffected by the size of the documents that are updated (i.e., number of fields, size of fields, etc.). Apart from these internal differences, there is no functional difference between atomic updates and in-place updates. An atomic update operation is performed using this approach only when the fields to be updated meet these three conditions: • are non-indexed (indexed="false"), non-stored (stored="false"), single valued ( multiValued="false") numeric docValues (docValues="true") fields; • the _version_ field is also a non-indexed, non-stored single valued docValues field; and, • copy targets of updated fields, if any, are also non-indexed, non-stored single valued numeric docValues fields. To use in-place updates, add a modifier to the field that needs to be updated. The content can be updated or incrementally increased. set Set or replace the field value(s) with the specified value(s). May be specified as a single value. inc Increments a numeric value by a specific amount. Must be specified as a single numeric value. In-Place Update Example If the price and popularity fields are defined in the schema as: If the following document exists in our collection: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 397 of 1195 { "id":"mydoc", "price":10, "popularity":42, "categories":["kids"], "promo_ids":["a123x"], "tags":["free_to_try","buy_now","clearance","on_sale"] } And we apply the following update command: { "id":"mydoc", "price":{"set":99}, "popularity":{"inc":20} } The resulting document in our collection will be: { "id":"mydoc", "price":99, "popularity":62, "categories":["kids"], "promo_ids":["a123x"], "tags":["free_to_try","buy_now","clearance","on_sale"] } Optimistic Concurrency Optimistic Concurrency is a feature of Solr that can be used by client applications which update/replace documents to ensure that the document they are replacing/updating has not been concurrently modified by another client application. This feature works by requiring a _version_ field on all documents in the index, and comparing that to a _version_ specified as part of the update command. By default, Solr’s Schema includes a _version_ field, and this field is automatically added to each new document. In general, using optimistic concurrency involves the following work flow: 1. A client reads a document. In Solr, one might retrieve the document with the /get handler to be sure to have the latest version. 2. A client changes the document locally. 3. The client resubmits the changed document to Solr, for example, perhaps with the /update handler. 4. If there is a version conflict (HTTP error code 409), the client starts the process over. When the client resubmits a changed document to Solr, the _version_ can be included with the update to invoke optimistic concurrency control. Specific semantics are used to define when the document should be © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 398 of 1195 Apache Solr Reference Guide 7.3 updated or when to report a conflict. • If the content in the _version_ field is greater than '1' (i.e., '12345'), then the _version_ in the document must match the _version_ in the index. • If the content in the _version_ field is equal to '1', then the document must simply exist. In this case, no version matching occurs, but if the document does not exist, the updates will be rejected. • If the content in the _version_ field is less than '0' (i.e., '-1'), then the document must not exist. In this case, no version matching occurs, but if the document exists, the updates will be rejected. • If the content in the _version_ field is equal to '0', then it doesn’t matter if the versions match or if the document exists or not. If it exists, it will be overwritten; if it does not exist, it will be added. If the document being updated does not include the _version_ field, and atomic updates are not being used, the document will be treated by normal Solr rules, which is usually to discard the previous version. When using Optimistic Concurrency, clients can include an optional versions=true request parameter to indicate that the new versions of the documents being added should be included in the response. This allows clients to immediately know what the _version_ is of every documented added without needing to make a redundant /get request. For example: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 399 of 1195 $ curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/techproducts/update?versions=true' --data-binary ' [ { "id" : "aaa" }, { "id" : "bbb" } ]' {"responseHeader":{"status":0,"QTime":6}, "adds":["aaa",1498562471222312960, "bbb",1498562471225458688]} $ curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/techproducts/update?_version_=999999&versions=true' --data-binary ' [{ "id" : "aaa", "foo_s" : "update attempt with wrong existing version" }]' {"responseHeader":{"status":409,"QTime":3}, "error":{"msg":"version conflict for aaa expected=999999 actual=1498562471222312960", "code":409}} $ curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/techproducts/update?_version_=1498562471222312960&versions=true&commi t=true' --data-binary ' [{ "id" : "aaa", "foo_s" : "update attempt with correct existing version" }]' {"responseHeader":{"status":0,"QTime":5}, "adds":["aaa",1498562624496861184]} $ curl 'http://localhost:8983/solr/techproducts/query?q=*:*&fl=id,_version_' { "responseHeader":{ "status":0, "QTime":5, "params":{ "fl":"id,_version_", "q":"*:*"}}, "response":{"numFound":2,"start":0,"docs":[ { "id":"bbb", "_version_":1498562471225458688}, { "id":"aaa", "_version_":1498562624496861184}] }} For more information, please also see Yonik Seeley’s presentation on NoSQL features in Solr 4 from Apache Lucene EuroCon 2012. Document Centric Versioning Constraints Optimistic Concurrency is extremely powerful, and works very efficiently because it uses an internally assigned, globally unique values for the _version_ field. However, In some situations users may want to configure their own document specific version field, where the version values are assigned on a perdocument basis by an external system, and have Solr reject updates that attempt to replace a document with an "older" version. In situations like this the DocBasedVersionConstraintsProcessorFactory can be useful. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 400 of 1195 Apache Solr Reference Guide 7.3 The basic usage of DocBasedVersionConstraintsProcessorFactory is to configure it in solrconfig.xml as part of the UpdateRequestProcessorChain and specify the name of your custom versionField in your schema that should be checked when validating updates: my_version_l Once configured, this update processor will reject (HTTP error code 409) any attempt to update an existing document where the value of the my_version_l field in the "new" document is not greater then the value of that field in the existing document. versionField vs _version_  The _version_ field used by Solr for its normal optimistic concurrency also has important semantics in how updates are distributed to replicas in SolrCloud, and MUST be assigned internally by Solr. Users can not re-purpose that field and specify it as the versionField for use in the DocBasedVersionConstraintsProcessorFactory configuration. DocBasedVersionConstraintsProcessorFactory supports two additional configuration params which are optional: • ignoreOldUpdates - A boolean option which defaults to false. If set to true then instead of rejecting updates where the versionField is too low, the update will be silently ignored (and return a status 200 to the client). • deleteVersionParam - A String parameter that can be specified to indicate that this processor should also inspect Delete By Id commands. The value of this configuration option should be the name of a request parameter that the processor will now consider mandatory for all attempts to Delete By Id, and must be be used by clients to specify a value for the versionField which is greater then the existing value of the document to be deleted. When using this request param, any Delete By Id command with a high enough document version number to succeed will be internally converted into an Add Document command that replaces the existing document with a new one which is empty except for the Unique Key and versionField to keeping a record of the deleted version so future Add Document commands will fail if their "new" version is not high enough. Please consult the DocBasedVersionConstraintsProcessorFactory javadocs and test solrconfig.xml file for additional information and example usages. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 401 of 1195 Detecting Languages During Indexing Solr can identify languages and map text to language-specific fields during indexing using the langid UpdateRequestProcessor. Solr supports three implementations of this feature: • Tika’s language detection feature: https://tika.apache.org/1.17/detection.html • LangDetect language detection: https://github.com/shuyo/language-detection • OpenNLP language detection: http://opennlp.apache.org/docs/1.8.4/manual/opennlp.html# tools.langdetect You can see a comparison between the Tika and LangDetect implementations here: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html. In general, the LangDetect implementation supports more languages with higher performance. For specific information on each of these language identification implementations, including a list of supported languages for each, see the relevant project websites. For more information about language analysis in Solr, see Language Analysis. Configuring Language Detection You can configure the langid UpdateRequestProcessor in solrconfig.xml. Both implementations take the same parameters, which are described in the following section. At a minimum, you must specify the fields for language identification and a field for the resulting language code. Configuring Tika Language Detection Here is an example of a minimal Tika langid configuration in solrconfig.xml: title,subject,text,keywords language_s Configuring LangDetect Language Detection Here is an example of a minimal LangDetect langid configuration in solrconfig.xml: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 402 of 1195 Apache Solr Reference Guide 7.3 title,subject,text,keywords language_s Configuring OpenNLP Language Detection Here is an example of a minimal OpenNLP langid configuration in solrconfig.xml: title,subject,text,keywords language_s langdetect-183.bin OpenNLP-specific Parameters langid.model An OpenNLP language detection model. The OpenNLP project provides a pre-trained 103 language model on the OpenNLP site’s model dowload page. Model training instructions are provided on the OpenNLP website. This parameter is required. OpenNLP Language Codes OpenNLPLangDetectUpdateProcessor automatically converts the 3-letter ISO 639-3 codes detected by the OpenNLP model into 2-letter ISO 639-1 codes. langid Parameters As previously mentioned, both implementations of the langid UpdateRequestProcessor take the same parameters. langid When true, the default, enables language detection. langid.fl A comma- or space-delimited list of fields to be processed by langid. This parameter is required. langid.langField Specifies the field for the returned language code. This parameter is required. langid.langsField Specifies the field for a list of returned language codes. If you use langid.map.individual, each detected language will be added to this field. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 403 of 1195 langid.overwrite Specifies whether the content of the langField and langsField fields will be overwritten if they already contain values. The default is false. langid.lcmap A space-separated list specifying colon delimited language code mappings to apply to the detected languages. For example, you might use this to map Chinese, Japanese, and Korean to a common cjk code, and map both American and British English to a single en code by using langid.lcmap=ja:cjk zh:cjk ko:cjk en_GB:en en_US:en. This affects both the values put into the langField and langsField fields, as well as the field suffixes when using langid.map, unless overridden by langid.map.lcmap. langid.threshold Specifies a threshold value between 0 and 1 that the language identification score must reach before langid accepts it. With longer text fields, a high threshold such as 0.8 will give good results. For shorter text fields, you may need to lower the threshold for language identification, though you will be risking somewhat lower quality results. We recommend experimenting with your data to tune your results. The default is 0.5. langid.whitelist Specifies a list of allowed language identification codes. Use this in combination with langid.map to ensure that you only index documents into fields that are in your schema. langid.map Enables field name mapping. If true, Solr will map field names for all fields listed in langid.fl. The default is false. langid.map.fl A comma-separated list of fields for langid.map that is different than the fields specified in langid.fl. langid.map.keepOrig If true, Solr will copy the field during the field name mapping process, leaving the original field in place. The default is false. langid.map.individual If true, Solr will detect and map languages for each field individually. The default is false. langid.map.individual.fl A comma-separated list of fields for use with langid.map.individual that is different than the fields specified in langid.fl. langid.fallback Specifies a language code to use if no language is detected or specified in langid.fallbackFields. langid.fallbackFields If no language is detected that meets the langid.threshold score, or if the detected language is not on © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 404 of 1195 Apache Solr Reference Guide 7.3 the langid.whitelist, this field specifies language codes to be used as fallback values. If no appropriate fallback languages are found, Solr will use the language code specified in langid.fallback. langid.map.lcmap A space-separated list specifying colon-delimited language code mappings to use when mapping field names. For example, you might use this to make Chinese, Japanese, and Korean language fields use a common *_cjk suffix, and map both American and British English fields to a single *_en by using langid.map.lcmap=ja:cjk zh:cjk ko:cjk en_GB:en en_US:en. A list defined with this parameter will override any configuration set with langid.lcmap. langid.map.pattern By default, fields are mapped as _. To change this pattern, you can specify a Java regular expression in this parameter. langid.map.replace By default, fields are mapped as _. To change this pattern, you can specify a Java replace in this parameter. langid.enforceSchema If false, the langid processor does not validate field names against your schema. This may be useful if you plan to rename or delete fields later in the UpdateChain. The default is true. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 405 of 1195 De-Duplication If duplicate, or near-duplicate documents are a concern in your index, de-duplication may be worth implementing. Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash algorithm. Solr natively supports de-duplication techniques of this type via the Signature class and allows for the easy addition of new hash/signature implementations. A Signature can be implemented in a few ways: • MD5Signature: 128-bit hash used for exact duplicate detection. • Lookup3Signature: 64-bit hash used for exact duplicate detection. This is much faster than MD5 and smaller to index. • TextProfileSignature: Fuzzy hashing implementation from Apache Nutch for near duplicate detection. It’s tunable but works best on longer text. Other, more sophisticated algorithms for fuzzy/near hashing can be added later. Adding in the de-duplication process will change the allowDups setting so that it applies to an update term (with signatureField in this case) rather than the unique field Term.  Of course the signatureField could be the unique field, but generally you want the unique field to be unique. When a document is added, a signature will automatically be generated and attached to the document in the specified signatureField. Configuration Options There are two places in Solr to configure de-duplication: in solrconfig.xml and in schema.xml. In solrconfig.xml The SignatureUpdateProcessorFactory has to be registered in solrconfig.xml as part of an Update Request Processor Chain, as in this example: true id false name,features,cat solr.processor.Lookup3Signature The SignatureUpdateProcessorFactory takes several properties: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 406 of 1195 Apache Solr Reference Guide 7.3 signatureClass A Signature implementation for generating a signature hash. The default is org.apache.solr.update.processor.Lookup3Signature. The full classpath of the implementation must be specified. The available options are described above, the associated classpaths to use are: • org.apache.solr.update.processor.Lookup3Signature • org.apache.solr.update.processor.MD5Signature • org.apache.solr.update.process.TextProfileSignature fields The fields to use to generate the signature hash in a comma separated list. By default, all fields on the document will be used. signatureField The name of the field used to hold the fingerprint/signature. The field should be defined in schema.xml. The default is signatureField. enabled Set to false to disable de-duplication processing. The default is true. overwriteDupes If true, the default, when a document exists that already matches this signature, it will be overwritten. In schema.xml If you are using a separate field for storing the signature, you must have it indexed: Be sure to change your update handlers to use the defined chain, as below: dedupe ... This example assumes you have other sections of your request handler defined.  The update processor can also be specified per request with a parameter of update.chain=dedupe. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 407 of 1195 Content Streams Content streams are bulk data passed with a request to Solr. When Solr RequestHandlers are accessed using path based URLs, the SolrQueryRequest object containing the parameters of the request may also contain a list of ContentStreams containing bulk data for the request. (The name SolrQueryRequest is a bit misleading: it is involved in all requests, regardless of whether it is a query request or an update request.) Content Stream Sources Currently request handlers can get content streams in a variety of ways: • For multipart file uploads, each file is passed as a stream. • For POST requests where the content-type is not application/x-www-form-urlencoded, the raw POST body is passed as a stream. The full POST body is parsed as parameters and included in the Solr parameters. • The contents of parameter stream.body is passed as a stream. • If remote streaming is enabled and URL content is called for during request handling, the contents of each stream.url and stream.file parameters are fetched and passed as a stream. By default, curl sends a contentType="application/x-www-form-urlencoded" header. If you need to test a SolrContentHeader content stream, you will need to set the content type with curl’s -H flag. Remote Streaming Remote streaming lets you send the contents of a URL as a stream to a given Solr RequestHandler. You could use remote streaming to send a remote or local file to an update plugin. Remote streaming is disabled by default. Enabling it is not recommended in a production situation without additional security between you and untrusted remote clients. In solrconfig.xml, you can enable it by changing the following enableRemoteStreaming parameter to true: *** WARNING *** Before enabling remote streaming, you should make sure your system has authentication enabled. When enableRemoteStreaming is not specified in solrconfig.xml, the default behavior is to not allow remote streaming (i.e., enableRemoteStreaming="false"). Remote streaming can also be enabled through the Config API as follows: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 408 of 1195 Apache Solr Reference Guide 7.3 V1 API curl -H 'Content-type:application/json' -d '{"set-property": {"requestDispatcher.requestParsers.enableRemoteStreaming":true}}' 'http://localhost:8983/solr/techproducts/config' V2 API curl -X POST -H 'Content-type: application/json' -d '{"set-property": {"requestDispatcher.requestParsers.enableRemoteStreaming":true}}' 'http://localhost:8983/api/collections/techproducts/config'  If enableRemoteStreaming="true" is used, be aware that this allows anyone to send a request to any URL or local file. If the DumpRequestHandler is enabled, it will allow anyone to view any file on your system. Debugging Requests The implicit "dump" RequestHandler (see Implicit RequestHandlers) simply outputs the contents of the Solr QueryRequest using the specified writer type wt. This is a useful tool to help understand what streams are available to the RequestHandlers. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 409 of 1195 UIMA Integration You can integrate the Apache Unstructured Information Management Architecture (UIMA) with Solr. UIMA lets you define custom pipelines of Analysis Engines that incrementally add metadata to your documents as annotations. Configuring UIMA The SolrUIMA UpdateRequestProcessor is a custom update request processor that takes documents being indexed, sends them to a UIMA pipeline, and then returns the documents enriched with the specified metadata. To configure UIMA for Solr, follow these steps: 1. Copy solr-uima-VERSION.jar (under /solr-VERSION/dist/) and its libraries (under contrib/uima/lib) to a Solr libraries directory, or set tags in solrconfig.xml appropriately to point to those jar files: 2. Modify schema.xml, adding your desired metadata fields specifying proper values for type, indexed, stored, and multiValued options. For example: 3. Add the following snippet to solrconfig.xml: VALID_ALCHEMYAPI_KEY VALID_ALCHEMYAPI_KEY VALID_ALCHEMYAPI_KEY VALID_ALCHEMYAPI_KEY VALID_ALCHEMYAPI_KEY VALID_OPENCALAIS_KEY /org/apache/uima/desc/OverridingParamsExtServicesAE.xml true false text org.apache.uima.alchemy.ts.concept.ConceptFS text concept org.apache.uima.alchemy.ts.language.LanguageFS language language org.apache.uima.SentenceAnnotation coveredText sentence ◦ VALID_ALCHEMYAPI_KEY is your AlchemyAPI Access Key. You need to register an AlchemyAPI Access key to use AlchemyAPI services: http://www.alchemyapi.com/ api/register.html.  ◦ VALID_OPENCALAIS_KEY is your Calais Service Key. You need to register a Calais Service key to use the Calais services: http://www.opencalais.com/apikey. ◦ analysisEngine must contain an AE descriptor inside the specified path in the classpath. ◦ analyzeFields must contain the input fields that need to be analyzed by UIMA. If merge=true then their content will be merged and analyzed only once. ◦ Field mapping describes which features of which types should go in a field. 4. In your solrconfig.xml replace the existing default UpdateRequestHandler or create a new UpdateRequestHandler: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 411 of 1195 uima Once you are done with the configuration your documents will be automatically enriched with the specified fields when you index them. For more information about Solr UIMA integration, see https://wiki.apache.org/solr/SolrUIMA. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 412 of 1195 Apache Solr Reference Guide 7.3 Searching This section describes how Solr works with search requests. It covers the following topics: • Overview of Searching in Solr: An introduction to searching with Solr. • Velocity Search UI: A simple search UI using the VelocityResponseWriter. • Relevance: Conceptual information about understanding relevance in search results. • Query Syntax and Parsing: A brief conceptual overview of query syntax and parsing. It also contains the following sub-sections: ◦ Common Query Parameters: No matter the query parser, there are several parameters that are common to all of them. ◦ The Standard Query Parser: Detailed information about the standard Lucene query parser. ◦ The DisMax Query Parser: Detailed information about Solr’s DisMax query parser. ◦ The Extended DisMax Query Parser: Detailed information about Solr’s Extended DisMax (eDisMax) Query Parser. ◦ Function Queries: Detailed information about parameters for generating relevancy scores using values from one or more numeric fields. ◦ Local Parameters in Queries: How to add local arguments to queries. ◦ Other Parsers: More parsers designed for use in specific situations. • JSON Request API: Overview of Solr’s JSON Request API. ◦ JSON Query DSL: Detailed information about a simple yet powerful query language for JSON Request API. • JSON Facet API: Overview of Solr’s JSON Facet API. • Faceting: Detailed information about categorizing search results based on indexed terms. • Highlighting: Detailed information about Solr’s highlighting capabilities, including multiple underlying highlighter implementations. • Spell Checking: Detailed information about Solr’s spelling checker. • Query Re-Ranking: Detailed information about re-ranking top scoring documents from simple queries using more complex scores. ◦ Learning To Rank: How to use LTR to run machine learned ranking models in Solr. • Transforming Result Documents: Detailed information about using DocTransformers to add computed information to individual documents • Suggester: Detailed information about Solr’s powerful autosuggest component. • MoreLikeThis: Detailed information about Solr’s similar results query component. • Pagination of Results: Detailed information about fetching paginated results for display in a UI, or for fetching all documents matching a query. • Result Grouping: Detailed information about grouping results based on common field values. • Result Clustering: Detailed information about grouping search results based on cluster analysis applied to text fields. A bit like "unsupervised" faceting. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 413 of 1195 • Spatial Search: How to use Solr’s spatial search capabilities. • The Terms Component: Detailed information about accessing indexed terms and the documents that include them. • The Term Vector Component: How to get term information about specific documents. • The Stats Component: How to return information from numeric fields within a document set. • The Query Elevation Component: How to force documents to the top of the results for certain queries. • Response Writers: Detailed information about configuring and using Solr’s response writers. • Near Real Time Searching: How to include documents in search results nearly immediately after they are indexed. • RealTime Get: How to get the latest version of a document without opening a searcher. • Exporting Result Sets: Functionality to export large result sets out of Solr. • Streaming Expressions: A stream processing language for Solr, with a suite of functions to perform many types of queries and parallel execution tasks. • Parallel SQL Interface: An interface for sending SQL statements to Solr, and using advanced parallel query processing and relational algebra for complex data analysis. • The Analytics Component: A framework to compute complex analytics over a result set. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 414 of 1195 Apache Solr Reference Guide 7.3 Overview of Searching in Solr Solr offers a rich, flexible set of features for search. To understand the extent of this flexibility, it’s helpful to begin with an overview of the steps and components involved in a Solr search. When a user runs a search in Solr, the search query is processed by a request handler. A request handler is a Solr plug-in that defines the logic to be used when Solr processes a request. Solr supports a variety of request handlers. Some are designed for processing search queries, while others manage tasks such as index replication. Search applications select a particular request handler by default. In addition, applications can be configured to allow users to override the default selection in preference of a different request handler. To process a search query, a request handler calls a query parser, which interprets the terms and parameters of a query. Different query parsers support different syntax. Solr’s default query parser is known as the Standard Query Parser,or more commonly just the "lucene" query parser. Solr also includes the DisMaxquery parser, and the Extended DisMax (eDisMax) query parser. The standard query parser’s syntax allows for greater precision in searches, but the DisMax query parser is much more tolerant of errors. The DisMax query parser is designed to provide an experience similar to that of popular search engines such as Google, which rarely display syntax errors to users. The Extended DisMax query parser is an improved version of DisMax that handles the full Lucene query syntax while still tolerating syntax errors. It also includes several additional features. In addition, there are common query parameters that are accepted by all query parsers. Input to a query parser can include: • search strings---that is, terms to search for in the index • parameters for fine-tuning the query by increasing the importance of particular strings or fields, by applying Boolean logic among the search terms, or by excluding content from the search results • parameters for controlling the presentation of the query response, such as specifying the order in which results are to be presented or limiting the response to particular fields of the search application’s schema. Search parameters may also specify a filter query. As part of a search response, a filter query runs a query against the entire index and caches the results. Because Solr allocates a separate cache for filter queries, the strategic use of filter queries can improve search performance. (Despite their similar names, query filters are not related to analysis filters. Filter queries perform queries at search time against data already in the index, while analysis filters, such as Tokenizers, parse content for indexing, following specified rules). A search query can request that certain terms be highlighted in the search response; that is, the selected terms will be displayed in colored boxes so that they "jump out" on the screen of search results. Highlighting can make it easier to find relevant passages in long documents returned in a search. Solr supports multi-term highlighting. Solr includes a rich set of search parameters for controlling how terms are highlighted. Search responses can also be configured to include snippets (document excerpts) featuring highlighted text. Popular search engines such as Google and Yahoo! return snippets in their search results: 3-4 lines of text offering a description of a search result. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 415 of 1195 To help users zero in on the content they’re looking for, Solr supports two special ways of grouping search results to aid further exploration: faceting and clustering. Faceting is the arrangement of search results into categories (which are based on indexed terms). Within each category, Solr reports on the number of hits for relevant term, which is called a facet constraint. Faceting makes it easy for users to explore search results on sites such as movie sites and product review sites, where there are many categories and many items within a category. The screen shot below shows an example of faceting from the CNET Web site (CBS Interactive Inc.), which was the first site to use Solr. Faceting makes use of fields defined when the search applications were indexed. In the example above, these fields include categories of information that are useful for describing digital cameras: manufacturer, resolution, and zoom range. Clustering groups search results by similarities discovered when a search is executed, rather than when content is indexed. The results of clustering often lack the neat hierarchical organization found in faceted search results, but clustering can be useful nonetheless. It can reveal unexpected commonalities among search results, and it can help users rule out content that isn’t pertinent to what they’re really searching for. Solr also supports a feature called MoreLikeThis, which enables users to submit new queries that focus on particular terms returned in an earlier query. MoreLikeThis queries can make use of faceting or clustering to provide additional aid to users. A Solr component called a response writer manages the final presentation of the query response. Solr includes a variety of response writers, including an XML Response Writer and a JSON Response Writer. The diagram below summarizes some key elements of the search process. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 416 of 1195 Guide Version 7.3 - Published: 2018-03-27 Apache Solr Reference Guide 7.3 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 417 of 1195 Velocity Search UI Solr includes a sample search UI based on the VelocityResponseWriter (also known as Solritas) that demonstrates several useful features, such as searching, faceting, highlighting, autocomplete, and geospatial searching. When using the sample_techproducts_configs config set, you can access the Velocity sample Search UI: http://localhost:8983/solr/techproducts/browse The Velocity Search UI For more information about the Velocity Response Writer, see the Response Writer page. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 418 of 1195 Apache Solr Reference Guide 7.3 Relevance Relevance is the degree to which a query response satisfies a user who is searching for information. The relevance of a query response depends on the context in which the query was performed. A single search application may be used in different contexts by users with different needs and expectations. For example, a search engine of climate data might be used by a university researcher studying long-term climate trends, a farmer interested in calculating the likely date of the last frost of spring, a civil engineer interested in rainfall patterns and the frequency of floods, and a college student planning a vacation to a region and wondering what to pack. Because the motivations of these users vary, the relevance of any particular response to a query will vary as well. How comprehensive should query responses be? Like relevance in general, the answer to this question depends on the context of a search. The cost of not finding a particular document in response to a query is high in some contexts, such as a legal e-discovery search in response to a subpoena, and quite low in others, such as a search for a cake recipe on a Web site with dozens or hundreds of cake recipes. When configuring Solr, you should weigh comprehensiveness against other factors such as timeliness and ease-of-use. The e-discovery and recipe examples demonstrate the importance of two concepts related to relevance: • Precision is the percentage of documents in the returned results that are relevant. • Recall is the percentage of relevant results returned out of all relevant results in the system. Obtaining perfect recall is trivial: simply return every document in the collection for every query. Returning to the examples above, it’s important for an e-discovery search application to have 100% recall returning all the documents that are relevant to a subpoena. It’s far less important that a recipe application offer this degree of precision, however. In some cases, returning too many results in casual contexts could overwhelm users. In some contexts, returning fewer results that have a higher likelihood of relevance may be the best approach. Using the concepts of precision and recall, it’s possible to quantify relevance across users and queries for a collection of documents. A perfect system would have 100% precision and 100% recall for every user and every query. In other words, it would retrieve all the relevant documents and nothing else. In practical terms, when talking about precision and recall in real systems, it is common to focus on precision and recall at a certain number of results, the most common (and useful) being ten results. Through faceting, query filters, and other search components, a Solr application can be configured with the flexibility to help users fine-tune their searches in order to return the most relevant results for users. That is, Solr can be configured to balance precision and recall to meet the needs of a particular user community. The configuration of a Solr application should take into account: • the needs of the application’s various users (which can include ease of use and speed of response, in addition to strictly informational needs) • the categories that are meaningful to these users in their various contexts (e.g., dates, product categories, or regions) • any inherent relevance of documents (e.g., it might make sense to ensure that an official product description or FAQ is always returned near the top of the search results) • whether or not the age of documents matters significantly (in some contexts, the most recent Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 419 of 1195 documents might always be the most important) Keeping all these factors in mind, it’s often helpful in the planning stages of a Solr deployment to sketch out the types of responses you think the search application should return for sample queries. Once the application is up and running, you can employ a series of testing methodologies, such as focus groups, inhouse testing, TREC tests and A/B testing to fine tune the configuration of the application to best meet the needs of its users. For more information about relevance, see Grant Ingersoll’s tech article Debugging Search Application Relevance Issues which is available on SearchHub.org. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 420 of 1195 Apache Solr Reference Guide 7.3 Query Syntax and Parsing Solr supports several query parsers, offering search application designers great flexibility in controlling how queries are parsed. This section explains how to specify the query parser to be used. It also describes the syntax and features supported by the main query parsers included with Solr and describes some other parsers that may be useful for particular situations. There are some query parameters common to all Solr parsers; these are discussed in the section Common Query Parameters. The parsers discussed in this Guide are: • The Standard Query Parser • The DisMax Query Parser • The Extended DisMax Query Parser • Other Parsers The query parser plugins are all subclasses of QParserPlugin. If you have custom parsing needs, you may want to extend that class to create your own query parser. Common Query Parameters Several query parsers share supported query parameters. The following sections describe Solr’s common query parameters, which are supported by the Search RequestHandlers. defType Parameter The defType parameter selects the query parser that Solr should use to process the main query parameter (q) in the request. For example: defType=dismax If no defType parameter is specified, then by default, the The Standard Query Parser is used. (e.g., defType=lucene) sort Parameter The sort parameter arranges search results in either ascending (asc) or descending (desc) order. The parameter can be used with either numerical or alphabetical content. The directions can be entered in either all lowercase or all uppercase letters (i.e., both asc and ASC are accepted). Solr can sort query responses according to: • Document scores • Function results • The value of any primitive field (numerics, string, boolean, dates, etc.) which has docValues="true" (or multiValued="false" and indexed="true", in which case the indexed terms will used to build DocValue Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 421 of 1195 like structures on the fly at runtime) • A SortableTextField which implicitly uses docValues="true" by default to allow sorting on the original input string regardless of the analyzers used for Searching. • A single-valued TextField that uses an analyzer (such as the KeywordTokenizer) that produces only a single term per document. TextField does not support docValues="true", but a DocValue-like structure will be built on the fly at runtime. ◦ NOTE: If you want to be able to sort on a field whose contents you want to tokenize to facilitate searching, use a copyField directive in the the Schema to clone the field. Then search on the field and sort on its clone. In the case of primitive fields, or SortableTextFields, that are multiValued="true" the representative value used for each doc when sorting depends on the sort direction: The minimum value in each document is used for ascending (asc) sorting, while the maximal value in each document is used for descending (desc) sorting. This default behavior is equivilent to explicitly sorting using the 2 argument field() function: sort=field(name,min) asc and sort=field(name,max) desc The table below explains how Solr responds to various settings of the sort parameter. Example Result If the sort parameter is omitted, sorting is performed as though the parameter were set to score desc. score desc Sorts in descending order from the highest score to the lowest score. price asc Sorts in ascending order of the price field div(popularity,price) desc Sorts in descending order of the result of the function popularity / inStock desc, price asc Sorts by the contents of the inStock field in descending order, then when multiple documents have the same value for the inStock field, those results are sorted in ascending order by the contents of the price field. categories asc, price asc Sorts by the lowest value of the (multivalued) categories field in ascending order, then when multiple documents have the same lowest categories value, those results are sorted in ascending order by the contents of the price field. price Regarding the sort parameter’s arguments: • A sort ordering must include a field name (or score as a pseudo field), followed by whitespace (escaped as + or %20 in URL strings), followed by a sort direction (asc or desc). • Multiple sort orderings can be separated by a comma, using this syntax: sort=,],… ◦ When more than one sort criteria is provided, the second entry will only be used if the first entry results in a tie. If there is a third entry, it will only be used if the first AND second entries are tied. This pattern continues with further entries. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 422 of 1195 Apache Solr Reference Guide 7.3 start Parameter When specified, the start parameter specifies an offset into a query’s result set and instructs Solr to begin displaying results from this offset. The default value is 0. In other words, by default, Solr returns results without an offset, beginning where the results themselves begin. Setting the start parameter to some other number, such as 3, causes Solr to skip over the preceding records and start at the document identified by the offset. You can use the start parameter this way for paging. For example, if the rows parameter is set to 10, you could display three successive pages of results by setting start to 0, then re-issuing the same query and setting start to 10, then issuing the query again and setting start to 20. rows Parameter You can use the rows parameter to paginate results from a query. The parameter specifies the maximum number of documents from the complete result set that Solr should return to the client at one time. The default value is 10. That is, by default, Solr returns 10 documents at a time in response to a query. fq (Filter Query) Parameter The fq parameter defines a query that can be used to restrict the superset of documents that can be returned, without influencing score. It can be very useful for speeding up complex queries, since the queries specified with fq are cached independently of the main query. When a later query uses the same filter, there’s a cache hit, and filter results are returned quickly from the cache. When using the fq parameter, keep in mind the following: • The fq parameter can be specified multiple times in a query. Documents will only be included in the result if they are in the intersection of the document sets resulting from each instance of the parameter. In the example below, only documents which have a popularity greater then 10 and have a section of 0 will match. fq=popularity:[10 TO *]&fq=section:0 • Filter queries can involve complicated Boolean queries. The above example could also be written as a single fq with two mandatory clauses like so: fq=+popularity:[10 TO *] +section:0 • The document sets from each filter query are cached independently. Thus, concerning the previous examples: use a single fq containing two mandatory clauses if those clauses appear together often, and use two separate fq parameters if they are relatively independent. (To learn about tuning cache sizes and making sure a filter cache actually exists, see The Well-Configured Solr Instance.) • It is also possible to use filter(condition) syntax inside the fq to cache clauses individually and - among Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 423 of 1195 other things - to achieve union of cached filter queries. • As with all parameters: special characters in an URL need to be properly escaped and encoded as hex values. Online tools are available to help you with URL-encoding. For example: http://meyerweb.com/ eric/tools/dencoder/. fl (Field List) Parameter The fl parameter limits the information included in a query response to a specified list of fields. The fields must be either stored="true" or docValues="true"`.` The field list can be specified as a space-separated or comma-separated list of field names. The string "score" can be used to indicate that the score of each document for the particular query should be returned as a field. The wildcard character * selects all the fields in the document which are either stored="true" or docValues="true" and useDocValuesAsStored="true" (which is the default when docValues are enabled). You can also add pseudo-fields, functions and transformers to the field list request. This table shows some basic examples of how to use fl: Field List Result id name price Return only the id, name, and price fields. id,name,price Return only the id, name, and price fields. id name, price Return only the id, name, and price fields. id score Return the id field and the score. * Return all the stored fields in each document, as well as any docValues fields that have useDocValuesAsStored="true". This is the default value of the fl parameter. * score Return all the fields in each document, along with each field’s score. *,dv_field_name Return all the stored fields in each document, and any docValues fields that have useDocValuesAsStored="true" and the docValues from dv_field_name even if it has useDocValuesAsStored="false" Functions with fl Functions can be computed for each document in the result and returned as a pseudo-field: fl=id,title,product(price,popularity) Document Transformers with fl Document Transformers can be used to modify the information returned about each documents in the results of a query: fl=id,title,[explain] © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 424 of 1195 Apache Solr Reference Guide 7.3 Field Name Aliases You can change the key used to in the response for a field, function, or transformer by prefixing it with a `"displayName:`". For example: fl=id,sales_price:price,secret_sauce:prod(price,popularity),why_score:[explain style=nl] { "response": { "numFound": 2, "start": 0, "docs": [{ "id": "6H500F0", "secret_sauce": 2100.0, "sales_price": 350.0, "why_score": { "match": true, "value": 1.052226, "description": "weight(features:cache in 2) [DefaultSimilarity], result of:", "details": [{ "..." }]}}]}} debug Parameter The debug parameter can be specified multiple times and supports the following arguments: • debug=query: return debug information about the query only. • debug=timing: return debug information about how long the query took to process. • debug=results: return debug information about the score results (also known as "explain"). ◦ By default, score explanations are returned as large string values, using newlines and tab indenting for structure & readability, but an additional debug.explain.structured=true parameter may be specified to return this information as nested data structures native to the response format requested by wt. • debug=all: return all available debug information about the request request. (alternatively usage: debug=true) For backwards compatibility with older versions of Solr, debugQuery=true may instead be specified as an alternative way to indicate debug=all The default behavior is not to include debugging information. explainOther Parameter The explainOther parameter specifies a Lucene query in order to identify a set of documents. If this parameter is included and is set to a non-blank value, the query will return debugging information, along with the "explain info" of each document that matches the Lucene query, relative to the main query (which Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 425 of 1195 is specified by the q parameter). For example: q=supervillians&debugQuery=on&explainOther=id:juggernaut The query above allows you to examine the scoring explain info of the top matching documents, compare it to the explain info for documents matching id:juggernaut, and determine why the rankings are not as you expect. The default value of this parameter is blank, which causes no extra "explain info" to be returned. timeAllowed Parameter This parameter specifies the amount of time, in milliseconds, allowed for a search to complete. If this time expires before the search is complete, any partial results will be returned, but values such as numFound, facet counts, and result stats may not be accurate for the entire result set. This value is only checked at the time of: 1. Query Expansion, and 2. Document collection As this check is periodically performed, the actual time for which a request can be processed before it is aborted would be marginally greater than or equal to the value of timeAllowed. If the request consumes more time in other stages, custom components, etc., this parameter is not expected to abort the request. segmentTerminateEarly Parameter This parameter may be set to either true or false. If set to true, and if the mergePolicyFactory for this collection is a SortingMergePolicyFactory which uses a sort option compatible with the sort parameter specified for this query, then Solr will attempt to use an EarlyTerminatingSortingCollector. If early termination is used, a segmentTerminatedEarly header will be included in the responseHeader. Similar to using the timeAllowed Parameter, when early segment termination happens values such as numFound, Facet counts, and result Stats may not be accurate for the entire result set. The default value of this parameter is false. omitHeader Parameter This parameter may be set to either true or false. If set to true, this parameter excludes the header from the returned results. The header contains information about the request, such as the time it took to complete. The default value for this parameter is false. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 426 of 1195 Apache Solr Reference Guide 7.3 wt Parameter The wt parameter selects the Response Writer that Solr should use to format the query’s response. For detailed descriptions of Response Writers, see Response Writers. If you do not define the wt parameter in your queries, JSON will be returned as the format of the response. cache Parameter Solr caches the results of all queries and filter queries by default. To disable result caching, set the cache=false parameter. You can also use the cost option to control the order in which non-cached filter queries are evaluated. This allows you to order less expensive non-cached filters before expensive non-cached filters. For very high cost filters, if cache=false and cost>=100 and the query implements the PostFilter interface, a Collector will be requested from that query and used to filter documents after they have matched the main query and all other filter queries. There can be multiple post filters; they are also ordered by cost. For most queries the default behavior is cost=0 — but some types of queries such as {!frange} default to cost=100, because they are most efficient when used as a PostFilter. For example: This is an example of 3 regular filters, where all matching documents generated by each are computed up front and cached independently: q=some keywords fq=quantity_in_stock:[5 TO *] fq={!frange l=10 u=100}mul(popularity,price) fq={!frange cost=200 l=0}pow(mul(sum(1, query('tag:smartphone')), div(1,avg_rating)), 2.3) These are the same filters run w/o caching. The simple range query on the quantity_in_stock field will be run in parallel with the main query like a traditional lucene filter, while the 2 frange filters will only be checked against each document has already matched the main query and the quantity_in_stock range query — first the simpler mul(popularity,price) will be checked (because of it’s implicit cost=100) and only if it matches will the final very complex filter (with it’s higher cost=200) be checked. q=some keywords fq={!cache=false}quantity_in_stock:[5 TO *] fq={!frange cache=false l=10 u=100}mul(popularity,price) fq={!frange cache=false cost=200 l=0}pow(mul(sum(1, query('tag:smartphone')), div(1,avg_rating)), 2.3) logParamsList Parameter By default, Solr logs all parameters of requests. Set this parameter to restrict which parameters of a request are logged. This may help control logging to only those parameters considered important to your organization. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 427 of 1195 For example, you could define this like: logParamsList=q,fq And only the 'q' and 'fq' parameters will be logged. If no parameters should be logged, you can send logParamsList as empty (i.e., logParamsList=).  This parameter not only applies to query requests, but to any kind of request to Solr. echoParams Parameter The echoParams parameter controls what information about request parameters is included in the response header. The echoParams parameter accepts the following values: • explicit: This is the default value. Only parameters included in the actual request, plus the _ parameter (which is a 64-bit numeric timestamp) will be added to the params section of the response header. • all: Include all request parameters that contributed to the query. This will include everything defined in the request handler definition found in solrconfig.xml as well as parameters included with the request, plus the _ parameter. If a parameter is included in the request handler definition AND the request, it will appear multiple times in the response header. • none: Entirely removes the "params" section of the response header. No information about the request parameters will be available in the response. Here is an example of a JSON response where the echoParams parameter was not included, so the default of explicit is active. The request URL that created this response included three parameters - q, wt, and indent: { "responseHeader": { "status": 0, "QTime": 0, "params": { "q": "solr", "indent": "true", "wt": "json", "_": "1458227751857" } }, "response": { "numFound": 0, "start": 0, "docs": [] } } This is what happens if a similar request is sent that adds echoParams=all to the three parameters used in the previous example: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 428 of 1195 Apache Solr Reference Guide 7.3 { "responseHeader": { "status": 0, "QTime": 0, "params": { "q": "solr", "df": "text", "preferLocalShards": "false", "indent": "true", "echoParams": "all", "rows": "10", "wt": "json", "_": "1458228887287" } }, "response": { "numFound": 0, "start": 0, "docs": [] } } The Standard Query Parser Solr’s default Query Parser is also known as the “lucene” parser. The key advantage of the standard query parser is that it supports a robust and fairly intuitive syntax allowing you to create a variety of structured queries. The largest disadvantage is that it’s very intolerant of syntax errors, as compared with something like the DisMax query parser which is designed to throw as few errors as possible. Standard Query Parser Parameters In addition to the Common Query Parameters, Faceting Parameters, Highlighting Parameters, and MoreLikeThis Parameters, the standard query parser supports the parameters described in the table below. q Defines a query using standard query syntax. This parameter is mandatory. q.op Specifies the default operator for query expressions, overriding the default operator specified in the Schema. Possible values are "AND" or "OR". df Specifies a default field, overriding the definition of a default field in the Schema. sow Split on whitespace. If set to true, text analysis is invoked separately for each individual whitespaceseparated term. The default is false; whitespace-separated term sequences will be provided to text analysis in one shot, enabling proper function of analysis filters that operate over term sequences, e.g., Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 429 of 1195 multi-word synonyms and shingles. Default parameter values are specified in solrconfig.xml, or overridden by query-time values in the request. Standard Query Parser Response By default, the response from the standard query parser contains one block, which is unnamed. If the debug parameter is used, then an additional block will be returned, using the name "debug". This will contain useful debugging info, including the original query string, the parsed query string, and explain info for each document in the block. If the explainOther parameter is also used, then additional explain info will be provided for all the documents matching that query. Sample Responses This section presents examples of responses from the standard query parser. The URL below submits a simple query and requests the XML Response Writer to use indentation to make the XML response more readable. http://localhost:8983/solr/techproducts/select?q=id:SP2514N&wt=xml Results: 01 electronicshard drive 7200RPM, 8MB cache, IDE Ultra ATA-133 NoiseGuard, SilentSeek technology, Fluid Dynamic Bearing (FDB) motor SP2514N true Samsung Electronics Co. Ltd. Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133 6 92.0 SP2514N Here’s an example of a query with a limited field list. http://localhost:8983/solr/techproducts/select?q=id:SP2514N&fl=id+name&wt=xml Results: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 430 of 1195 Apache Solr Reference Guide 7.3 02 SP2514N Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133 Specifying Terms for the Standard Query Parser A query to the standard query parser is broken up into terms and operators. There are two types of terms: single terms and phrases. • A single term is a single word such as "test" or "hello" • A phrase is a group of words surrounded by double quotes such as "hello dolly" Multiple terms can be combined together with Boolean operators to form more complex queries (as described below).  It is important that the analyzer used for queries parses terms and phrases in a way that is consistent with the way the analyzer used for indexing parses terms and phrases; otherwise, searches may produce unexpected results. Term Modifiers Solr supports a variety of term modifiers that add flexibility or precision, as needed, to searches. These modifiers include wildcard characters, characters for making a search "fuzzy" or more general, and so on. The sections below describe these modifiers in detail. Wildcard Searches Solr’s standard query parser supports single and multiple character wildcard searches within single terms. Wildcard characters can be applied to single terms, but not to search phrases. Wildcard Search Type Special Character Example Single character (matches a single character) ? The search string te?t would match both test and text. Multiple characters (matches zero or more sequential characters) * The wildcard search: tes* would match test, testing, and tester. You can also use wildcard characters in the middle of a term. For example: te*t would match test and text. *est would match pest and test. Fuzzy Searches Solr’s standard query parser supports fuzzy searches based on the Damerau-Levenshtein Distance or Edit Distance algorithm. Fuzzy searches discover terms that are similar to a specified term without necessarily Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 431 of 1195 being an exact match. To perform a fuzzy search, use the tilde ~ symbol at the end of a single-word term. For example, to search for a term similar in spelling to "roam," use the fuzzy search: roam~ This search will match terms like roams, foam, & foams. It will also match the word "roam" itself. An optional distance parameter specifies the maximum number of edits allowed, between 0 and 2, defaulting to 2. For example: roam~1 This will match terms like roams & foam - but not foams since it has an edit distance of "2".  In many cases, stemming (reducing terms to a common stem) can produce similar effects to fuzzy searches and wildcard searches. Proximity Searches A proximity search looks for terms that are within a specific distance from one another. To perform a proximity search, add the tilde character ~ and a numeric value to the end of a search phrase. For example, to search for a "apache" and "jakarta" within 10 words of each other in a document, use the search: "jakarta apache"~10 The distance referred to here is the number of term movements needed to match the specified phrase. In the example above, if "apache" and "jakarta" were 10 spaces apart in a field, but "apache" appeared before "jakarta", more than 10 term movements would be required to move the terms together and position "apache" to the right of "jakarta" with a space in between. Range Searches A range search specifies a range of values for a field (a range with an upper bound and a lower bound). The query matches documents whose values for the specified field or fields fall within the range. Range queries can be inclusive or exclusive of the upper and lower bounds. Sorting is done lexicographically, except on numeric fields. For example, the range query below matches all documents whose popularity field has a value between 52 and 10,000, inclusive. popularity:[52 TO 10000] Range queries are not limited to date fields or even numerical fields. You could also use range queries with non-date fields: title:{Aida TO Carmen} This will find all documents whose titles are between Aida and Carmen, but not including Aida and Carmen. The brackets around a query determine its inclusiveness. • Square brackets [ & ] denote an inclusive range query that matches values including the upper and lower bound. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 432 of 1195 Apache Solr Reference Guide 7.3 • Curly brackets { & } denote an exclusive range query that matches values between the upper and lower bounds, but excluding the upper and lower bounds themselves. • You can mix these types so one end of the range is inclusive and the other is exclusive. Here’s an example: count:{1 TO 10] Boosting a Term with "^" Lucene/Solr provides the relevance level of matching documents based on the terms found. To boost a term use the caret symbol ^ with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be. Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for "jakarta apache" and you want the term "jakarta" to be more relevant, you can boost it by adding the ^ symbol along with the boost factor immediately after the term. For example, you could type: jakarta^4 apache This will make documents with the term jakarta appear more relevant. You can also boost Phrase Terms as in the example: "jakarta apache"^4 "Apache Lucene" By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (for example, it could be 0.2). Constant Score with "^=" Constant score queries are created with ^=, which sets the entire clause to the specified score for any documents matching that clause. This is desirable when you only care about matches for a particular clause and don’t want other relevancy factors such as term frequency (the number of times the term appears in the field) or inverse document frequency (a measure across the whole index for how rare a term is in a field). Example: (description:blue OR color:blue)^=1.0 text:shoes Querying Specific Fields Data indexed in Solr is organized in fields, which are defined in the Solr Schema. Searches can take advantage of fields to add precision to queries. For example, you can search for a term only in a specific field, such as a title field. The Schema defines one field as a default field. If you do not specify a field in a query, Solr searches only the default field. Alternatively, you can specify a different field or a combination of fields in a query. To specify a field, type the field name followed by a colon ":" and then the term you are searching for within the field. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 433 of 1195 For example, suppose an index contains two fields, title and text,and that text is the default field. If you want to find a document called "The Right Way" which contains the text "don’t go this way," you could include either of the following terms in your search query: title:"The Right Way" AND text:go title:"Do it right" AND go Since text is the default field, the field indicator is not required; hence the second query above omits it. The field is only valid for the term that it directly precedes, so the query title:Do it right will find only "Do" in the title field. It will find "it" and "right" in the default field (in this case the text field). Boolean Operators Supported by the Standard Query Parser Boolean operators allow you to apply Boolean logic to queries, requiring the presence or absence of specific terms or conditions in fields in order to match documents. The table below summarizes the Boolean operators supported by the standard query parser. Boolean Operator Alternative Symbol Description AND && Requires both terms on either side of the Boolean operator to be present for a match. NOT ! Requires that the following term not be present. OR || Requires that either term (or both terms) be present for a match. + Requires that the following term be present. - Prohibits the following term (that is, matches on fields or documents that do not include that term). The - operator is functionally similar to the Boolean operator !. Because it’s used by popular search engines such as Google, it may be more familiar to some user communities. Boolean operators allow terms to be combined through logic operators. Lucene supports AND, “+”, OR, NOT and “-” as Boolean operators.   When specifying Boolean operators with keywords such as AND or NOT, the keywords must appear in all uppercase. The standard query parser supports all the Boolean operators listed in the table above. The DisMax query parser supports only + and -. The OR operator is the default conjunction operator. This means that if there is no Boolean operator between two terms, the OR operator is used. The OR operator links two terms and finds a matching © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 434 of 1195 Apache Solr Reference Guide 7.3 document if either of the terms exist in a document. This is equivalent to a union using sets. The symbol || can be used in place of the word OR. To search for documents that contain either "jakarta apache" or just "jakarta," use the query: "jakarta apache" jakarta or "jakarta apache" OR jakarta The Boolean Operator "+" The + symbol (also known as the "required" operator) requires that the term after the + symbol exist somewhere in a field in at least one document in order for the query to return a match. For example, to search for documents that must contain "jakarta" and that may or may not contain "lucene," use the following query: +jakarta lucene  This operator is supported by both the standard query parser and the DisMax query parser. The Boolean Operator AND ("&&") The AND operator matches documents where both terms exist anywhere in the text of a single document. This is equivalent to an intersection using sets. The symbol && can be used in place of the word AND. To search for documents that contain "jakarta apache" and "Apache Lucene," use either of the following queries: "jakarta apache" AND "Apache Lucene" "jakarta apache" && "Apache Lucene" The Boolean Operator NOT ("!") The NOT operator excludes documents that contain the term after NOT. This is equivalent to a difference using sets. The symbol ! can be used in place of the word NOT. The following queries search for documents that contain the phrase "jakarta apache" but do not contain the phrase "Apache Lucene": "jakarta apache" NOT "Apache Lucene" "jakarta apache" ! "Apache Lucene" The Boolean Operator "-" The - symbol or "prohibit" operator excludes documents that contain the term after the - symbol. For example, to search for documents that contain "jakarta apache" but not "Apache Lucene," use the following query: "jakarta apache" -"Apache Lucene" Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 435 of 1195 Escaping Special Characters Solr gives the following characters special meaning when they appear in a query: + - && || ! ( ) { } [ ] ^ " ~ * ? : / To make Solr interpret any of these characters literally, rather as a special character, precede the character with a backslash character \. For example, to search for (1+1):2 without having Solr interpret the plus sign and parentheses as special characters for formulating a sub-query with two terms, escape the characters by preceding each one with a backslash: \(1\+1\)\:2 Grouping Terms to Form Sub-Queries Lucene/Solr supports using parentheses to group clauses to form sub-queries. This can be very useful if you want to control the Boolean logic for a query. The query below searches for either "jakarta" or "apache" and "website": (jakarta OR apache) AND website This adds precision to the query, requiring that the term "website" exist, along with either term "jakarta" and "apache." Grouping Clauses within a Field To apply two or more Boolean operators to a single field in a search, group the Boolean clauses within parentheses. For example, the query below searches for a title field that contains both the word "return" and the phrase "pink panther": title:(+return +"pink panther") Comments in Queries C-Style comments are supported in query strings. Example: "jakarta apache" /* this is a comment in the middle of a normal query string */ OR jakarta Comments may be nested. Differences between Lucene’s Classic Query Parser and Solr’s Standard Query Parser Solr’s standard query parser originated as a variation of Lucene’s "classic" QueryParser. It diverges in the following ways: • A * may be used for either or both endpoints to specify an open-ended range query ◦ field:[* TO 100] finds all field values less than or equal to 100 © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 436 of 1195 Apache Solr Reference Guide 7.3 ◦ field:[100 TO *] finds all field values greater than or equal to 100 ◦ field:[* TO *] matches all documents with the field • Pure negative queries (all clauses prohibited) are allowed (only as a top-level clause) ◦ -inStock:false finds all field values where inStock is not false ◦ -field:[* TO *] finds all documents without a value for field • Support for embedded Solr queries (sub-queries) using any type of query parser as a nested clause using the local-params syntax. ◦ inStock:true OR {!dismax qf='name manu' v='ipod'} Gotcha: Be careful not to start your query with {! at the very beginning, which changes the parsing of the entire query string, which may not be what you want if there are additional clauses. So flipping the example above so the sub-query comes first would fail to work as expected without a leading space. Sub-queries can also be done with the magic field _query_ and for function queries with the magic field _val_ but it should be considered deprecated since it is less clear. Example: _val_:"recip(rord(myfield),1,2,3)" • Support for a special filter(…) syntax to indicate that some query clauses should be cached in the filter cache (as a constant score boolean query). This allows sub-queries to be cached and re-used in other queries. For example inStock:true will be cached and re-used in all three of the queries below: ◦ q=features:songs OR filter(inStock:true) ◦ q=+manu:Apple +filter(inStock:true) ◦ q=+manu:Apple & fq=inStock:true This can even be used to cache individual clauses of complex filter queries. In the first query below, 3 items will be added to the filter cache (the top level fq and both filter(…) clauses) and in the second query, there will be 2 cache hits, and one new cache insertion (for the new top level fq): ◦ q=features:songs & fq=+filter(inStock:true) +filter(price:[* TO 100]) ◦ q=manu:Apple & fq=-filter(inStock:true) -filter(price:[* TO 100]) • Range queries ("[a TO z]"), prefix queries ("a*"), and wildcard queries ("a*b") are constant-scoring (all matching documents get an equal score). The scoring factors TF, IDF, index boost, and "coord" are not used. There is no limitation on the number of terms that match (as there was in past versions of Lucene). • Constant score queries are created with ^=, which sets the entire clause to the specified score for any documents matching that clause: ◦ q=(description:blue color:blue)^=1.0 title:blue^=5.0 Specifying Dates and Times Queries against date based fields must use the appropriate date formating. Queries for exact date values will require quoting or escaping since : is the parser syntax used to denote a field query: • createdate:1976-03-06T23\:59\:59.999Z • createdate:"1976-03-06T23:59:59.999Z" • createdate:[1976-03-06T23:59:59.999Z TO *] Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 437 of 1195 • createdate:[1995-12-31T23:59:59.999Z TO 2007-03-06T00:00:00Z] • timestamp:[* TO NOW] • pubdate:[NOW-1YEAR/DAY TO NOW/DAY+1DAY] • createdate:[1976-03-06T23:59:59.999Z TO 1976-03-06T23:59:59.999Z+1YEAR] • createdate:[1976-03-06T23:59:59.999Z/YEAR TO 1976-03-06T23:59:59.999Z] The DisMax Query Parser The DisMax query parser is designed to process simple phrases (without complex syntax) entered by users and to search for individual terms across several fields using different weighting (boosts) based on the significance of each field. Additional options enable users to influence the score based on rules specific to each use case (independent of user input). In general, the DisMax query parser’s interface is more like that of Google than the interface of the 'lucene' Solr query parser. This similarity makes DisMax the appropriate query parser for many consumer applications. It accepts a simple syntax, and it rarely produces error messages. The DisMax query parser supports an extremely simplified subset of the Lucene QueryParser syntax. As in Lucene, quotes can be used to group phrases, and +/- can be used to denote mandatory and optional clauses. All other Lucene query parser special characters (except AND and OR) are escaped to simplify the user experience. The DisMax query parser takes responsibility for building a good query from the user’s input using Boolean clauses containing DisMax queries across fields and boosts specified by the user. It also lets the Solr administrator provide additional boosting queries, boosting functions, and filtering queries to artificially affect the outcome of all searches. These options can all be specified as default parameters for the request handler in the solrconfig.xml file or overridden in the Solr query URL. Interested in the technical concept behind the DisMax name? DisMax stands for Maximum Disjunction. Here’s a definition of a Maximum Disjunction or "DisMax" query: A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries. Whether or not you remember this explanation, do remember that the DisMax Query Parser was primarily designed to be easy to use and to accept almost any input without returning an error. DisMax Query Parser Parameters In addition to the common request parameters, highlighting parameters, and simple facet parameters, the DisMax query parser supports the parameters described below. Like the standard query parser, the DisMax query parser allows default parameter values to be specified in solrconfig.xml, or overridden by querytime values in the request. The sections below explain these parameters in detail. q Parameter The q parameter defines the main "query" constituting the essence of the search. The parameter supports raw input strings provided by users with no special escaping. The + and - characters are treated as © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 438 of 1195 Apache Solr Reference Guide 7.3 "mandatory" and "prohibited" modifiers for terms. Text wrapped in balanced quote characters (for example, "San Jose") is treated as a phrase. Any query containing an odd number of quote characters is evaluated as if there were no quote characters at all.  The q parameter does not support wildcard characters such as *. q.alt Parameter If specified, the q.alt parameter defines a query (which by default will be parsed using standard query parsing syntax) when the main q parameter is not specified or is blank. The q.alt parameter comes in handy when you need something like a query to match all documents (don’t forget &rows=0 for that one!) in order to get collection-wide faceting counts. qf (Query Fields) Parameter The qf parameter introduces a list of fields, each of which is assigned a boost factor to increase or decrease that particular field’s importance in the query. For example, the query below: qf="fieldOne^2.3 fieldTwo fieldThree^0.4" assigns fieldOne a boost of 2.3, leaves fieldTwo with the default boost (because no boost factor is specified), and fieldThree a boost of 0.4. These boost factors make matches in fieldOne much more significant than matches in fieldTwo, which in turn are much more significant than matches in fieldThree. mm (Minimum Should Match) Parameter When processing queries, Lucene/Solr recognizes three types of clauses: mandatory, prohibited, and "optional" (also known as "should" clauses). By default, all words or phrases specified in the q parameter are treated as "optional" clauses unless they are preceded by a "+" or a "-". When dealing with these "optional" clauses, the mm parameter makes it possible to say that a certain minimum number of those clauses must match. The DisMax query parser offers great flexibility in how the minimum number can be specified. The table below explains the various ways that mm values can be specified. Syntax Example Description Positive integer 3 Defines the minimum number of clauses that must match, regardless of how many clauses there are in total. Negative integer -2 Sets the minimum number of matching clauses to the total number of optional clauses, minus this value. Percentage 75% Sets the minimum number of matching clauses to this percentage of the total number of optional clauses. The number computed from the percentage is rounded down and used as the minimum. Negative percentage -25% Indicates that this percent of the total number of optional clauses can be missing. The number computed from the percentage is rounded down, before being subtracted from the total to determine the minimum number. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Syntax Page 439 of 1195 Example Description An expression beginning with a 3<90% positive integer followed by a > or < sign and another value Defines a conditional expression indicating that if the number of optional clauses is equal to (or less than) the integer, they are all required, but if it’s greater than the integer, the specification applies. In this example: if there are 1 to 3 clauses they are all required, but for 4 or more clauses only 90% are required. Multiple conditional expressions involving > or < signs Defines multiple conditions, each one being valid only for numbers greater than the one before it. In the example at left, if there are 1 or 2 clauses, then both are required. If there are 39 clauses all but 25% are required. If there are more then 9 clauses, all but three are required. 2<-25% 9<-3 When specifying mm values, keep in mind the following: • When dealing with percentages, negative values can be used to get different behavior in edge cases. 75% and -25% mean the same thing when dealing with 4 clauses, but when dealing with 5 clauses 75% means 3 are required, but -25% means 4 are required. • If the calculations based on the parameter arguments determine that no optional clauses are needed, the usual rules about Boolean queries still apply at search time. (That is, a Boolean query containing no required clauses must still match at least one optional clause). • No matter what number the calculation arrives at, Solr will never use a value greater than the number of optional clauses, or a value less than 1. In other words, no matter how low or how high the calculated result, the minimum number of required matches will never be less than 1 or greater than the number of clauses. • When searching across multiple fields that are configured with different query analyzers, the number of optional clauses may differ between the fields. In such a case, the value specified by mm applies to the maximum number of optional clauses. For example, if a query clause is treated as stopword for one of the fields, the number of optional clauses for that field will be smaller than for the other fields. A query with such a stopword clause would not return a match in that field if mm is set to 100% because the removed clause does not count as matched. The default value of mm is 100% (meaning that all clauses must match). pf (Phrase Fields) Parameter Once the list of matching documents has been identified using the fq and qf parameters, the pf parameter can be used to "boost" the score of documents in cases where all of the terms in the q parameter appear in close proximity. The format is the same as that used by the qf parameter: a list of fields and "boosts" to associate with each of them when making phrase queries out of the entire q parameter. ps (Phrase Slop) Parameter The ps parameter specifies the amount of "phrase slop" to apply to queries specified with the pf parameter. Phrase slop is the number of positions one token needs to be moved in relation to another token in order to match a phrase specified in a query. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 440 of 1195 Apache Solr Reference Guide 7.3 qs (Query Phrase Slop) Parameter The qs parameter specifies the amount of slop permitted on phrase queries explicitly included in the user’s query string with the qf parameter. As explained above, slop refers to the number of positions one token needs to be moved in relation to another token in order to match a phrase specified in a query. The tie (Tie Breaker) Parameter The tie parameter specifies a float value (which should be something much less than 1) to use as tiebreaker in DisMax queries. When a term from the user’s input is tested against multiple fields, more than one field may match. If so, each field will generate a different score based on how common that word is in that field (for each document relative to all other documents). The tie parameter lets you control how much the final score of the query will be influenced by the scores of the lower scoring fields compared to the highest scoring field. A value of "0.0" - the default - makes the query a pure "disjunction max query": that is, only the maximum scoring subquery contributes to the final score. A value of "1.0" makes the query a pure "disjunction sum query" where it doesn’t matter what the maximum scoring sub query is, because the final score will be the sum of the subquery scores. Typically a low value, such as 0.1, is useful. bq (Boost Query) Parameter The bq parameter specifies an additional, optional, query clause that will be added to the user’s main query to influence the score. For example, if you wanted to add a relevancy boost for recent documents: q=cheese bq=date:[NOW/DAY-1YEAR TO NOW/DAY] You can specify multiple bq parameters. If you want your query to be parsed as separate clauses with separate boosts, use multiple bq parameters. bf (Boost Functions) Parameter The bf parameter specifies functions (with optional boosts) that will be used to construct FunctionQueries which will be added to the user’s main query as optional clauses that will influence the score. Any function supported natively by Solr can be used, along with a boost value. For example: recip(rord(myfield),1,2,3)^1.5 Specifying functions with the bf parameter is essentially just shorthand for using the bq parameter combined with the {!func} parser. For example, if you want to show the most recent documents first, you could use either of the following: bf=recip(rord(creationDate),1,1000,1000) ...or... bq={!func}recip(rord(creationDate),1,1000,1000) Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 441 of 1195 Examples of Queries Submitted to the DisMax Query Parser All of the sample URLs in this section assume you are running Solr’s "techproducts" example: bin/solr -e techproducts Results for the word "video" using the standard query parser, and we assume "df" is pointing to a field to search: http://localhost:8983/solr/techproducts/select?q=video&fl=name+score The "dismax" parser is configured to search across the text, features, name, sku, id, manu, and cat fields all with varying boosts designed to ensure that "better" matches appear first, specifically: documents which match on the name and cat fields get higher scores. http://localhost:8983/solr/techproducts/select?defType=dismax&q=video Note that this instance is also configured with a default field list, which can be overridden in the URL. http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&fl=*,score You can also override which fields are searched on and how much boost each field gets. http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&qf=features^20.0+text^ 0.3 You can boost results that have a field that matches a specific value. http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&bq=cat:electronics^5.0 Another request handler is registered at "/instock" and has slightly different configuration options, notably: a filter for (you guessed it) inStock:true). http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&fl=name,score,inStock http://localhost:8983/solr/techproducts/instock?defType=dismax&q=video&fl=name,score,inStock One of the other really cool features in this parser is robust support for specifying the "BooleanQuery.minimumNumberShouldMatch" you want to be used based on how many terms are in your user’s query. These allows flexibility for typos and partial matches. For the dismax parser, one and two word queries require that all of the optional clauses match, but for three to five word queries one missing word is allowed. http://localhost:8983/solr/techproducts/select?defType=dismax&q=belkin+ipod http://localhost:8983/solr/techproducts/select?defType=dismax&q=belkin+ipod+gibberish http://localhost:8983/solr/techproducts/select?defType=dismax&q=belkin+ipod+apple Use the debugQuery option to see the parsed query, and the score explanations for each document. http://localhost:8983/solr/techproducts/select?defType=dismax&q=belkin+ipod+gibberish&debugQ uery=true http://localhost:8983/solr/techproducts/select?defType=dismax&q=video+card&debugQuery=true © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 442 of 1195 Apache Solr Reference Guide 7.3 The Extended DisMax (eDismax) Query Parser The Extended DisMax (eDisMax) query parser is an improved version of the DisMax query parser. In addition to supporting all the DisMax query parser parameters, Extended Dismax: • supports Solr’s standard query parser syntax such as (non-exhaustive list): ◦ boolean operators such as AND (+, &&), OR (||), NOT (-). ◦ optionally treats lowercase "and" and "or" as "AND" and "OR" in Lucene syntax mode ◦ optionally allows embedded queries using other query parsers or functions • includes improved smart partial escaping in the case of syntax errors; fielded queries, +/-, and phrase queries are still supported in this mode. • improves proximity boosting by using word shingles; you do not need the query to match all words in the document before proximity boosting is applied. • includes advanced stopword handling: stopwords are not required in the mandatory part of the query but are still used in the proximity boosting part. If a query consists of all stopwords, such as "to be or not to be", then all words are required. • includes improved boost function: in Extended DisMax, the boost function is a multiplier rather than an addend, improving your boost results; the additive boost functions of DisMax (bf and bq) are also supported. • supports pure negative nested queries: queries such as +foo (-foo) will match all documents. • lets you specify which fields the end user is allowed to query, and to disallow direct fielded searches. Extended DisMax Parameters In addition to all the DisMax parameters, Extended DisMax includes these query parameters: sow Split on whitespace. If set to true, text analysis is invoked separately for each individual whitespaceseparated term. The default is false; whitespace-separated term sequences will be provided to text analysis in one shot, enabling proper function of analysis filters that operate over term sequences, e.g., multi-word synonyms and shingles. mm.autoRelax If true, the number of clauses required (minimum should match) will automatically be relaxed if a clause is removed (by e.g., stopwords filter) from some but not all qf fields. Use this parameter as a workaround if you experience that queries return zero hits due to uneven stopword removal between the qf fields. Note that relaxing mm may cause undesired side effects, such as hurting the precision of the search, depending on the nature of your index content. boost A multivalued list of strings parsed as queries with scores multiplied by the score from the main query for all matching documents. This parameter is shorthand for wrapping the query produced by eDisMax using the BoostQParserPlugin. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 443 of 1195 lowercaseOperators A Boolean parameter indicating if lowercase "and" and "or" should be treated the same as operators "AND" and "OR". Defaults to false. ps Phrase Slop. The default amount of slop - distance between terms - on phrase queries built with pf, pf2 and/or pf3 fields (affects boosting). See also the section Using 'Slop' below. pf2 A multivalued list of fields with optional weights. Similar to pf, but based on pairs of word shingles. ps2 This is similar to ps but overrides the slop factor used for pf2. If not specified, ps is used. pf3 A multivalued list of fields with optional weights, based on triplets of word shingles. Similar to pf, except that instead of building a phrase per field out of all the words in the input, it builds a set of phrases for each field out of each triplet of word shingles. ps3 This is similar to ps but overrides the slop factor used for pf3. If not specified, ps is used. stopwords A Boolean parameter indicating if the StopFilterFactory configured in the query analyzer should be respected when parsing the query. If this is set to false, then the StopFilterFactory in the query analyzer is ignored. uf Specifies which schema fields the end user is allowed to explicitly query and to toggle whether embedded Solr queries are supported. This parameter supports wildcards. Multiple fields must be separated by a space. The default is to allow all fields and no embedded Solr queries, equivalent to uf=* -_query_. • To allow only title field, use uf=title. • To allow title and all fields ending with '_s', use uf=title *_s. • To allow all fields except title, use uf=* -title. • To disallow all fielded searches, use uf=-*. • To allow embedded Solr queries (e.g. _query_:"…" or _val_:"…" or {!lucene …}), you must expressly enable this by referring to the magic field _query_ in uf. Field Aliasing using Per-Field qf Overrides Per-field overrides of the qf parameter may be specified to provide 1-to-many aliasing from field names specified in the query string, to field names used in the underlying query. By default, no aliasing is used and field names specified in the query string are treated as literal field names in the index. Examples of eDismax Queries All of the sample URLs in this section assume you are running Solr’s “techproducts” example: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 444 of 1195 Apache Solr Reference Guide 7.3 bin/solr -e techproducts Boost the result of the query term "hello" based on the document’s popularity: http://localhost:8983/solr/techproducts/select?defType=edismax&q=hello&pf=text&qf=text&boost=popu larity Search for iPods OR video: http://localhost:8983/solr/techproducts/select?defType=edismax&q=ipod+OR+video Search across multiple fields, specifying (via boosts) how important each field is relative each other: http://localhost:8983/solr/techproducts/select?q=video&defType=edismax&qf=features^20.0+text^0.3 You can boost results that have a field that matches a specific value: http://localhost:8983/solr/techproducts/select?q=video&defType=edismax&qf=features^20.0+text^0.3& bq=cat:electronics^5.0 Using the "mm" param, 1 and 2 word queries require that all of the optional clauses match, but for queries with three or more clauses one missing clause is allowed: http://localhost:8983/solr/techproducts/select?q=belkin+ipod&defType=edismax&mm=2 http://localhost:8983/solr/techproducts/select?q=belkin+ipod+gibberish&defType=edismax&mm=2 http://localhost:8983/solr/techproducts/select?q=belkin+ipod+apple&defType=edismax&mm=2 In the example below, we see a per-field override of the qf parameter being used to alias "name" in the query string to either the “last_name” and “first_name” fields: defType=edismax q=sysadmin name:Mike qf=title text last_name first_name f.name.qf=last_name first_name Using Negative Boost Negative query boosts have been supported at the "Query" object level for a long time (resulting in negative scores for matching documents). Now the QueryParsers have been updated to handle this too. Using 'Slop' Dismax and Edismax can run queries against all query fields, and also run a query in the form of a phrase against the phrase fields. (This will work only for boosting documents, not actually for matching.) However, Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 445 of 1195 that phrase query can have a 'slop,' which is the distance between the terms of the query while still considering it a phrase match. For example: q=foo bar qf=field1^5 field2^10 pf=field1^50 field2^20 defType=dismax With these parameters, the Dismax Query Parser generates a query that looks something like this: (+(field1:foo^5 OR field2:foo^10) AND (field1:bar^5 OR field2:bar^10)) But it also generates another query that will only be used for boosting results: field1:"foo bar"^50 OR field2:"foo bar"^20 Thus, any document that has the terms "foo" and "bar" will match; however if some of those documents have both of the terms as a phrase, it will score much higher because it’s more relevant. If you add the parameter ps (phrase slop), the second query will instead be: ps=10 field1:"foo bar"~10^50 OR field2:"foo bar"~10^20 This means that if the terms "foo" and "bar" appear in the document with less than 10 terms between each other, the phrase will match. For example the doc that says: *Foo* term1 term2 term3 *bar* will match the phrase query. How does one use phrase slop? Usually it is configured in the request handler (in solrconfig). With query slop (qs) the concept is similar, but it applies to explicit phrase queries from the user. For example, if you want to search for a name, you could enter: q="Hans Anderson" A document that contains "Hans Anderson" will match, but a document that contains the middle name "Christian" or where the name is written with the last name first ("Anderson, Hans") won’t. For those cases one could configure the query field qs, so that even if the user searches for an explicit phrase query, a slop is applied. Finally, in addition to the phrase fields (pf) parameter, edismax also supports the pf2 and pf3 parameters, for fields over which to create bigram and trigram phrase queries. The phrase slop for these parameters' queries can be specified using the ps2 and ps3 parameters, respectively. If you use pf2/pf3 but ps2/ps3, © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 446 of 1195 Apache Solr Reference Guide 7.3 then the phrase slop for these parameters' queries will be taken from the ps parameter, if any. Function Queries Function queries enable you to generate a relevancy score using the actual value of one or more numeric fields. Function queries are supported by the DisMax, Extended DisMax, and standard query parsers. Function queries use functions. The functions can be a constant (numeric or string literal), a field, another function or a parameter substitution argument. You can use these functions to modify the ranking of results for users. These could be used to change the ranking of results based on a user’s location, or some other calculation. Using Function Query Functions must be expressed as function calls (for example, sum(a,b) instead of simply a+b). There are several ways of using function queries in a Solr query: • Via an explicit query parser that expects function arguments, such func or frange. For example: q={!func}div(popularity,price)&fq={!frange l=1000}customer_ratings • In a Sort expression. For example: sort=div(popularity,price) desc, score desc • Add the results of functions as pseudo-fields to documents in query results. For instance, for: &fl=sum(x, y),id,a,b,c,score&wt=xml the output would be: ... foo 40 0.343 ... • Use in a parameter that is explicitly for specifying functions, such as the eDisMax query parser’s boost param, or DisMax query parser’s bf (boost function) parameter. (Note that the bf parameter actually takes a list of function queries separated by white space and each with an optional boost. Make sure you eliminate any internal white space in single function queries when using bf). For example: q=dismax&bf="ord(popularity)^0.5 recip(rord(price),1,1000,1000)^0.3" Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 447 of 1195 • Introduce a function query inline in the Lucene query parser with the _val_ keyword. For example: q=_val_:mynumericfield _val_:"recip(rord(myfield),1,2,3)" Only functions with fast random access are recommended. Available Functions The table below summarizes the functions available for function queries. abs Function Returns the absolute value of the specified value or function. Syntax Examples • abs(x) • abs(-5) childfield(field) Function Returns the value of the given field for one of the matched child docs when searching by {!parent}. It can be used only in sort parameter. Syntax Examples • sort=childfield(name) asc implies $q as a second argument and therefore it assumes q={!parent ..}..; • sort=childfield(field,$bjq) asc refers to a separate parameter bjq={!parent ..}..; • sort=childfield(field,{!parent of=…}…) desc allows to inline block join parent query concat Function Concatenates the given string fields, literals and other functions. Syntax Example • concat(name," ",$param,def(opt,"-")) "constant" Function Specifies a floating point constant. Syntax Example • 1.5 def Function def is short for default. Returns the value of field "field", or if the field does not exist, returns the default value specified. Yields the first value where exists()==true. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 448 of 1195 Apache Solr Reference Guide 7.3 Syntax Examples • def(rating,5): This def() function returns the rating, or if no rating specified in the doc, returns 5 • def(myfield, 1.0): equivalent to if(exists(myfield),myfield,1.0) div Function Divides one value or function by another. div(x,y) divides x by y. Syntax Examples • div(1,y) • div(sum(x,100),max(y,1)) dist Function Returns the distance between two vectors (points) in an n-dimensional space. Takes in the power, plus two or more ValueSource instances and calculates the distances between the two vectors. Each ValueSource must be a number. There must be an even number of ValueSource instances passed in and the method assumes that the first half represent the first vector and the second half represent the second vector. Syntax Examples • dist(2, x, y, 0, 0): calculates the Euclidean distance between (0,0) and (x,y) for each document. • dist(1, x, y, 0, 0): calculates the Manhattan (taxicab) distance between (0,0) and (x,y) for each document. • dist(2, x,y,z,0,0,0): Euclidean distance between (0,0,0) and (x,y,z) for each document. • dist(1,x,y,z,e,f,g): Manhattan distance between (x,y,z) and (e,f,g) where each letter is a field name. docfreq(field,val) Function Returns the number of documents that contain the term in the field. This is a constant (the same value for all documents in the index). You can quote the term if it’s more complex, or do parameter substitution for the term value. Syntax Examples • docfreq(text,'solr') • …&defType=func &q=docfreq(text,$myterm)&myterm=solr field Function Returns the numeric docValues or indexed value of the field with the specified name. In its simplest (single argument) form, this function can only be used on single valued fields, and can be called using the name of the field as a string, or for most conventional field names simply use the field name by itself without using the field(…) syntax. When using docValues, an optional 2nd argument can be specified to select the min or max value of Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 449 of 1195 multivalued fields. 0 is returned for documents without a value in the field. Syntax Examples These 3 examples are all equivalent: • myFloatFieldName • field(myFloatFieldName) • field("myFloatFieldName") The last form is convenient when your field name is atypical: • field("my complex float fieldName") For multivalued docValues fields: • field(myMultiValuedFloatField,min) • field(myMultiValuedFloatField,max) hsin Function The Haversine distance calculates the distance between two points on a sphere when traveling along the sphere. The values must be in radians. hsin also take a Boolean argument to specify whether the function should convert its output to radians. Syntax Example • hsin(2, true, x, y, 0, 0) idf Function Inverse document frequency; a measure of whether the term is common or rare across all documents. Obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. See also tf. Syntax Example • idf(fieldName,'solr'): measures the inverse of the frequency of the occurrence of the term 'solr' in fieldName. if Function Enables conditional function queries. In if(test,value1,value2): • test is or refers to a logical value or expression that returns a logical value (TRUE or FALSE). • value1 is the value that is returned by the function if test yields TRUE. • value2 is the value that is returned by the function if test yields FALSE. An expression can be any function which outputs boolean values, or even functions returning numeric values, in which case value 0 will be interpreted as false, or strings, in which case empty string is interpreted as false. Syntax Example © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 450 of 1195 Apache Solr Reference Guide 7.3 • if(termfreq (cat,'electronics'),popularity,42): This function checks each document for to see if it contains the term "electronics" in the cat field. If it does, then the value of the popularity field is returned, otherwise the value of 42 is returned. linear Function Implements m*x+c where m and c are constants and x is an arbitrary function. This is equivalent to sum(product(m,x),c), but slightly more efficient as it is implemented as a single function. Syntax Examples • linear(x,m,c) • linear(x,2,4): returns 2*x+4 log Function Returns the log base 10 of the specified function. Syntax Examples • log(x) • log(sum(x,100)) map Function Maps any values of an input function x that fall within min and max inclusive to the specified target. The arguments min and max must be constants. The arguments target and default can be constants or functions. If the value of x does not fall between min and max, then either the value of x is returned, or a default value is returned if specified as a 5th argument. Syntax Examples • map(x,min,max,target) ◦ map(x,0,0,1): Changes any values of 0 to 1. This can be useful in handling default 0 values. • map(x,min,max,target,default) ◦ map(x,0,100,1,-1): Changes any values between 0 and 100 to 1, and all other values to` -1`. ◦ map(x,0,100,sum(x,599),docfreq(text,solr)): Changes any values between 0 and 100 to x+599, and all other values to frequency of the term 'solr' in the field text. max Function Returns the maximum numeric value of multiple nested functions or constants, which are specified as arguments: max(x,y,…). The max function can also be useful for "bottoming out" another function or field at some specified constant. Use the field(myfield,max) syntax for selecting the maximum value of a single multivalued field. Syntax Example • max(myfield,myotherfield,0) Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 451 of 1195 maxdoc Function Returns the number of documents in the index, including those that are marked as deleted but have not yet been purged. This is a constant (the same value for all documents in the index). Syntax Example • maxdoc() min Function Returns the minimum numeric value of multiple nested functions of constants, which are specified as arguments: min(x,y,…). The min function can also be useful for providing an "upper bound" on a function using a constant. Use the field(myfield,min) syntax for selecting the minimum value of a single multivalued field. Syntax Example • min(myfield,myotherfield,0) ms Function Returns milliseconds of difference between its arguments. Dates are relative to the Unix or POSIX time epoch, midnight, January 1, 1970 UTC. Arguments may be the name of a DatePointField, TrieDateField, or date math based on a constant date or NOW. • ms(): Equivalent to ms(NOW), number of milliseconds since the epoch. • ms(a): Returns the number of milliseconds since the epoch that the argument represents. • ms(a,b): Returns the number of milliseconds that b occurs before a (that is, a - b) Syntax Examples • ms(NOW/DAY) • ms(2000-01-01T00:00:00Z) • ms(mydatefield) • ms(NOW,mydatefield) • ms(mydatefield, 2000-01-01T00:00:00Z) • ms(datefield1, datefield2) norm(field) Function Returns the "norm" stored in the index for the specified field. This is the product of the index time boost and the length normalization factor, according to the Similarity for the field. Syntax Example • norm(fieldName) © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 452 of 1195 Apache Solr Reference Guide 7.3 numdocs Function Returns the number of documents in the index, not including those that are marked as deleted but have not yet been purged. This is a constant (the same value for all documents in the index). Syntax Example • numdocs() ord Function Returns the ordinal of the indexed field value within the indexed list of terms for that field in Lucene index order (lexicographically ordered by unicode value), starting at 1. In other words, for a given field, all values are ordered lexicographically; this function then returns the offset of a particular value in that ordering. The field must have a maximum of one value per document (not multivalued). 0 is returned for documents without a value in the field.  ord() depends on the position in an index and can change when other documents are inserted or deleted. See also rord below. Syntax Example • ord(myIndexedField) • If there were only three values ("apple","banana","pear") for a particular field X, then ord(X) would be 1 for documents containing "apple", 2 for documents containing "banana", etc. payload Function Returns the float value computed from the decoded payloads of the term specified. The return value is computed using the min, max, or average of the decoded payloads. A special first function can be used instead of the others, to short-circuit term enumeration and return only the decoded payload of the first term. The field specified must have float or integer payload encoding capability (via DelimitedPayloadTokenFilter or NumericPayloadTokenFilter). If no payload is found for the term, the default value is returned. • payload(field_name,term): default value is 0.0, average function is used. • payload(field_name,term,default_value): default value can be a constant, field name, or another float returning function. average function used. • payload(field_name,term,default_value,function): function values can be min, max, average, or first. Syntax Example • payload(payloaded_field_dpf,term,0.0,first) Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 453 of 1195 pow Function Raises the specified base to the specified power. pow(x,y) raises x to the power of y. Syntax Examples • pow(x,y) • pow(x,log(y)) • pow(x,0.5): the same as sqrt product Function Returns the product of multiple values or functions, which are specified in a comma-separated list. mul(…) may also be used as an alias for this function. Syntax Examples • product(x,y,…) • product(x,2) • mul(x,y) query Function Returns the score for the given subquery, or the default value for documents not matching the query. Any type of subquery is supported through either parameter de-referencing $otherparam or direct specification of the query string in the Local Parameters through the v key. Syntax Examples • query(subquery, default) • q=product (popularity,query({!dismax v='solr rocks'}): returns the product of the popularity and the score of the DisMax query. • q=product (popularity,query($qq))&qq={!dismax}solr rocks: equivalent to the previous query, using parameter de-referencing. • q=product (popularity,query($qq,0.1))&qq={!dismax}solr rocks: specifies a default score of 0.1 for documents that don’t match the DisMax query. recip Function Performs a reciprocal function with recip(x,m,a,b) implementing a/(m*x+b) where m,a,b are constants, and x is any arbitrarily complex function. When a and b are equal, and x>=0, this function has a maximum value of 1 that drops as x increases. Increasing the value of a and b together results in a movement of the entire function to a flatter part of the curve. These properties can make this an ideal function for boosting more recent documents when x is rord(datefield). Syntax Examples • recip(myfield,m,a,b) • recip(rord (creationDate), 1,1000,1000) © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 454 of 1195 Apache Solr Reference Guide 7.3 rord Function Returns the reverse ordering of that returned by ord. Syntax Example • rord(myDateField) scale Function Scales values of the function x such that they fall between the specified minTarget and maxTarget inclusive. The current implementation traverses all of the function values to obtain the min and max, so it can pick the correct scale. The current implementation cannot distinguish when documents have been deleted or documents that have no value. It uses 0.0 values for these cases. This means that if values are normally all greater than 0.0, one can still end up with 0.0 as the min value to map from. In these cases, an appropriate map() function could be used as a workaround to change 0.0 to a value in the real range, as shown here: scale(map(x,0,0,5),1,2) Syntax Examples • scale(x, minTarget, maxTarget) • scale(x,1,2): scales the values of x such that all values will be between 1 and 2 inclusive. sqedist Function The Square Euclidean distance calculates the 2-norm (Euclidean distance) but does not take the square root, thus saving a fairly expensive operation. It is often the case that applications that care about Euclidean distance do not need the actual distance, but instead can use the square of the distance. There must be an even number of ValueSource instances passed in and the method assumes that the first half represent the first vector and the second half represent the second vector. Syntax Example • sqedist(x_td, y_td, 0, 0) sqrt Function Returns the square root of the specified value or function. Syntax Examples • sqrt(x) • sqrt(100) • sqrt(sum(x,100)) strdist Function Calculate the distance between two strings. Uses the Lucene spell checker StringDistance interface and supports all of the implementations available in that package, plus allows applications to plug in their own via Solr’s resource loading capabilities. strdist takes (string1, string2, distance measure). Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 455 of 1195 Possible values for distance measure are: • jw: Jaro-Winkler • edit: Levenstein or Edit distance • ngram: The NGramDistance, if specified, can optionally pass in the ngram size too. Default is 2. • FQN: Fully Qualified class Name for an implementation of the StringDistance interface. Must have a noarg constructor. Syntax Example • strdist("SOLR",id,edit) sub Function Returns x-y from sub(x,y). Syntax Examples • sub(myfield,myfield2) • sub(100, sqrt(myfield)) sum Function Returns the sum of multiple values or functions, which are specified in a comma-separated list. add(…) may be used as an alias for this function. Syntax Examples • sum(x,y,…) • sum(x,1) • sum(sqrt(x),log(y),z,0.5) • add(x,y) sumtotaltermfreq Function Returns the sum of totaltermfreq values for all terms in the field in the entire index (i.e., the number of indexed tokens for that field). (Aliases sumtotaltermfreq to sttf.) Syntax Example If doc1:(fieldX:A B C) and doc2:(fieldX:A A A A): • docFreq(fieldX:A) = 2 (A appears in 2 docs) • freq(doc1, fieldX:A) = 4 (A appears 4 times in doc 2) • totalTermFreq(fieldX:A) = 5 (A appears 5 times across all docs) • sumTotalTermFreq(fieldX) = 7 in fieldX, there are 5 As, 1 B, 1 C termfreq Function Returns the number of times the term appears in the field for that document. Syntax Example © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 456 of 1195 Apache Solr Reference Guide 7.3 • termfreq(text,'memory') tf Function Term frequency; returns the term frequency factor for the given term, using the Similarity for the field. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the document, which helps to control for the fact that some words are generally more common than others. See also idf. Syntax Examples • tf(text,'solr') top Function Causes the function query argument to derive its values from the top-level IndexReader containing all parts of an index. For example, the ordinal of a value in a single segment will be different from the ordinal of that same value in the complete index. The ord() and rord() functions implicitly use top(), and hence ord(foo) is equivalent to top(ord(foo)). totaltermfreq Function Returns the number of times the term appears in the field in the entire index. (Aliases totaltermfreq to ttf.) Syntax Example • ttf(text,'memory') Boolean Functions The following functions are boolean – they return true or false. They are mostly useful as the first argument of the if function, and some of these can be combined. If used somewhere else, it will yield a '1' or '0'. and Function Returns a value of true if and only if all of its operands evaluate to true. Syntax Example • and(not(exists(popularity)),exists(price)): returns true for any document which has a value in the price field, but does not have a value in the popularity field. or Function A logical disjunction. Syntax Example • or(value1,value2): true if either value1 or value2 is true. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 457 of 1195 xor Function Logical exclusive disjunction, or one or the other but not both. Syntax Example • xor(field1,field2) returns true if either field1 or field2 is true; FALSE if both are true. not Function The logically negated value of the wrapped function. Syntax Example • not(exists(author)): true only when exists(author) is false. exists Function Returns true if any member of the field exists. Syntax Example • exists(author): returns true for any document has a value in the "author" field. • exists(query(price:5.00)): returns true if "price" matches "5.00". Comparison Functions gt, gte, lt, lte, eq 5 comparison functions: Greater Than, Greater Than or Equal, Less Than, Less Than or Equal, Equal Syntax Example • if(lt(ms(mydatefield),315569259747),0.8,1) translates to this pseudocode: if mydatefield < 315569259747 then 0.8 else 1 Example Function Queries To give you a better understanding of how function queries can be used in Solr, suppose an index stores the dimensions in meters x,y,z of some hypothetical boxes with arbitrary names stored in field boxname. Suppose we want to search for box matching name findbox but ranked according to volumes of boxes. The query parameters would be: q=boxname:findbox _val_:"product(x,y,z)" This query will rank the results based on volumes. In order to get the computed volume, you will need to request the score, which will contain the resultant volume: &fl=*, score Suppose that you also have a field storing the weight of the box as weight. To sort by the density of the box © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 458 of 1195 Apache Solr Reference Guide 7.3 and return the value of the density in score, you would submit the following query: http://localhost:8983/solr/collection_name/select?q=boxname:findbox _val_:"div(weight,product(x,y,z))"&fl=boxname x y z weight score Sort By Function You can sort your query results by the output of a function. For example, to sort results by distance, you could enter: http://localhost:8983/solr/collection_name/select?q=*:*&sort=dist(2, point1, point2) desc Sort by function also supports pseudo-fields: fields can be generated dynamically and return results as though it was normal field in the index. For example, &fl=id,sum(x, y),score&wt=xml would return: foo 40 0.343 Local Parameters in Queries Local parameters are arguments in a Solr request that are specific to a query parameter. Local parameters provide a way to add meta-data to certain argument types such as query strings. (In Solr documentation, local parameters are sometimes referred to as LocalParams.) Local parameters are specified as prefixes to arguments. Take the following query argument, for example: q=solr rocks We can prefix this query string with local parameters to provide more information to the Standard Query Parser. For example, we can change the default operator type to "AND" and the default field to "title": q={!q.op=AND df=title}solr rocks These local parameters would change the query to require a match on both "solr" and "rocks" while searching the "title" field by default. Basic Syntax of Local Parameters To specify a local parameter, insert the following before the argument to be modified: • Begin with {! • Insert any number of key=value pairs separated by white space Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 459 of 1195 • End with } and immediately follow with the query argument You may specify only one local parameters prefix per argument. Values in the key-value pairs may be quoted via single or double quotes, and backslash escaping works within quoted strings. Query Type Short Form If a local parameter value appears without a name, it is given the implicit name of "type". This allows shortform representation for the type of query parser to use when parsing a query string. Thus q={!dismax qf=myfield}solr rocks is equivalent to: q={!type=dismax qf=myfield}solr rocks If no "type" is specified (either explicitly or implicitly) then the lucene parser is used by default. Thus fq={!df=summary}solr rocks is equivilent to: fq={!type=lucene df=summary}solr rocks Specifying the Parameter Value with the v Key A special key of v within local parameters is an alternate way to specify the value of that parameter. q={!dismax qf=myfield}solr rocks is equivalent to q={!type=dismax qf=myfield v='solr rocks'} Parameter Dereferencing Parameter dereferencing, or indirection, lets you use the value of another argument rather than specifying it directly. This can be used to simplify queries, decouple user input from query parameters, or decouple frontend GUI parameters from defaults set in solrconfig.xml. q={!dismax qf=myfield}solr rocks is equivalent to: q={!type=dismax qf=myfield v=$qq}&qq=solr rocks Other Parsers In addition to the main query parsers discussed earlier, there are several other query parsers that can be used instead of or in conjunction with the main parsers for specific purposes. This section details the other parsers, and gives examples for how they might be used. Many of these parsers are expressed the same way as Local Parameters in Queries. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 460 of 1195 Apache Solr Reference Guide 7.3 Block Join Query Parsers There are two query parsers that support block joins. These parsers allow indexing and searching for relational content that has been indexed as nested documents. The example usage of the query parsers below assumes these two documents and each of their child documents have been indexed: 1 Solr has block join support parentDocument 2 SolrCloud supports it too! 3 New Lucene and Solr release parentDocument 4 Lots of new features Block Join Children Query Parser This parser takes a query that matches some parent documents and returns their children. The syntax for this parser is: q={!child of=}. The parameter allParents is a filter that matches only parent documents; here you would define the field and value that you used to identify all parent documents. The parameter someParents identifies a query that will match some of the parent documents. The output is the children. Using the example documents above, we can construct a query such as q={!child of="content_type:parentDocument"}title:lucene&wt=xml. We only get one document in response: 4 Lots of new features Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 461 of 1195 The query for someParents should match only parent documents passed by allParents or you may get an exception:  Parent query must not match any docs besides parent filter. Combine them as must (+) and must-not (-) clauses to find a problem doc. You can search for q=+(someParents) -(allParents) to find a cause if you encounter this error. Filtering and Tagging {!child} also supports filters and excludeTags local parameters like the following: {!child of= filters=$parentfq excludeTags=certain}&parentfq=BRAND:Foo&parentfq=NAME:Bar&parentfq={!tag=certain}CAT EGORY:Baz This is equivalent to: {!child of=}+ +BRAND:Foo +NAME:Bar Notice "$" syntax in filters for referencing queries; comma-separated tags excludeTags allows to exclude certain queries by tagging. Overall the idea is similar to excluding fq in facets. Note, that filtering is applied to the subordinate clause (), and the intersection result is joined to the children. All Children Syntax When subordinate clause () is omitted, it’s parsed as a segmented and cached filter for children documents. More precisely, q={!child of=} is equivalent to q=*:* -. Block Join Parent Query Parser This parser takes a query that matches child documents and returns their parents. The syntax for this parser is similar: q={!parent which=}. The parameter allParents is a filter that matches only parent documents; here you would define the field and value that you used to identify all parent documents. The parameter someChildren is a query that matches some or all of the child documents. The query for someChildren should match only child documents or you may get an exception:  Child query must not match same docs with parent filter. Combine them as must clauses (+) to find a problem doc. You can search for q=+(parentFilter) +(someChildren) to find a cause. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 462 of 1195 Apache Solr Reference Guide 7.3 Again using the example documents above, we can construct a query such as q={!parent which="content_type:parentDocument"}comments:SolrCloud&wt=xml. We get this document in response: 1 Solr has block join support parentDocument Using which A common mistake is to try to filter parents with a which filter, as in this bad example:  q={!parent which="title:join"}comments:SolrCloud Instead, you should use a sibling mandatory clause as a filter: q= +title:join +{!parent which=" content_type:parentDocument"}comments:SolrCloud Filtering and Tagging The {!parent} query supports filters and excludeTags local parameters like the following: {!parent which= filters=$childfq excludeTags=certain}& childfq=COLOR:Red& childfq=SIZE:XL& childfq={!tag=certain}PRINT:Hatched This is equivalent to: {!parent which=}+ +COLOR:Red +SIZE:XL Notice the "$" syntax in filters for referencing queries. Comma-separated tags in excludeTags allow excluding certain queries by tagging. Overall the idea is similar to excluding fq in facets. Note that filtering is applied to the subordinate clause () first, and the intersection result is joined to the parents. Scoring with the Block Join Parent Query Parser You can optionally use the score local parameter to return scores of the subordinate query. The values to use for this parameter define the type of aggregation, which are avg (average), max (maximum), min (minimum), total (sum). Implicit default is none which returns 0.0. All Parents Syntax When subordinate clause () is omitted, it’s parsed as a segmented and cached filter for all parent documents, or more precisely q={!parent which=} is equivalent to q=. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 463 of 1195 Boolean Query Parser The BoolQParser creates a Lucene BooleanQuery which is a boolean combination of other queries. Subqueries along with their typed occurrences indicate how documents will be matched and scored. Parameters must A list of queries that must appear in matching documents and contribute to the score. must_not A list of queries that must not appear in matching documents. should A list of queries should appear in matching documents. For a BooleanQuery with no must queries, one or more should queries must match a document for the BooleanQuery to match. filter A list of queries that must appear in matching documents. However, unlike must, the score of filter queries is ignored. Examples {!bool must=foo must=bar} {!bool filter=foo should=bar} Boost Query Parser BoostQParser extends the QParserPlugin and creates a boosted query from the input value. The main value is the query to be boosted. Parameter b is the function query to use as the boost. The query to be boosted may be of any type. Boost Query Parser Examples Creates a query "foo" which is boosted (scores are multiplied) by the function query log(popularity): {!boost b=log(popularity)}foo Creates a query "foo" which is boosted by the date boosting function referenced in ReciprocalFloatFunction: {!boost b=recip(ms(NOW,mydatefield),3.16e-11,1,1)}foo Collapsing Query Parser The CollapsingQParser is really a post filter that provides more performant field collapsing than Solr’s © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 464 of 1195 Apache Solr Reference Guide 7.3 standard approach when the number of distinct groups in the result set is high. This parser collapses the result set to a single document per group before it forwards the result set to the rest of the search components. So all downstream components (faceting, highlighting, etc.) will work with the collapsed result set. Details about using the CollapsingQParser can be found in the section Collapse and Expand Results. Complex Phrase Query Parser The ComplexPhraseQParser provides support for wildcards, ORs, etc., inside phrase queries using Lucene’s ComplexPhraseQueryParser. Under the covers, this query parser makes use of the Span group of queries, e.g., spanNear, spanOr, etc., and is subject to the same limitations as that family or parsers. Parameters inOrder Set to true to force phrase queries to match terms in the order specified. The default is true. df The default search field. Examples {!complexphrase inOrder=true}name:"Jo* Smith" {!complexphrase inOrder=false}name:"(john jon jonathan~) peters*" A mix of ordered and unordered complex phrase queries: +_query_:"{!complexphrase inOrder=true}manu:\"a* c*\"" +_query_:"{!complexphrase inOrder=false df=name}\"bla* pla*\"" Complex Phrase Parser Limitations Performance is sensitive to the number of unique terms that are associated with a pattern. For instance, searching for "a*" will form a large OR clause (technically a SpanOr with many terms) for all of the terms in your index for the indicated field that start with the single letter 'a'. It may be prudent to restrict wildcards to at least two or preferably three letters as a prefix. Allowing very short prefixes may result in to many lowquality documents being returned. Notice that it also supports leading wildcards "*a" as well with consequent performance implications. Applying ReversedWildcardFilterFactory in index-time analysis is usually a good idea. MaxBooleanClauses with Complex Phrase Parser You may need to increase MaxBooleanClauses in solrconfig.xml as a result of the term expansion above: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 465 of 1195 4096 This property is described in more detail in the section Query Sizing and Warming. Stopwords with Complex Phrase Parser It is recommended not to use stopword elimination with this query parser. Lets say we add the terms the, up, and to to stopwords.txt for your collection, and index a document containing the text "Stores up to 15,000 songs, 25,00 photos, or 150 yours of video" in a field named "features". While the query below does not use this parser: q=features:"Stores up to 15,000" the document is returned. The next query that does use the Complex Phrase Query Parser, as in this query: q=features:"sto* up to 15*"&defType=complexphrase does not return that document because SpanNearQuery has no good way to handle stopwords in a way analogous to PhraseQuery. If you must remove stopwords for your use case, use a custom filter factory or perhaps a customized synonyms filter that reduces given stopwords to some impossible token. Escaping with Complex Phrase Parser Special care has to be given when escaping: clauses between double quotes (usually whole query) is parsed twice, these parts have to be escaped as twice, e.g., "foo\\: bar\\^". Field Query Parser The FieldQParser extends the QParserPlugin and creates a field query from the input value, applying text analysis and constructing a phrase query if appropriate. The parameter f is the field to be queried. Example: {!field f=myfield}Foo Bar This example creates a phrase query with "foo" followed by "bar" (assuming the analyzer for myfield is a text field with an analyzer that splits on whitespace and lowercase terms). This is generally equivalent to the Lucene query parser expression myfield:"Foo Bar". Filters Query Parser The syntax is: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 466 of 1195 Apache Solr Reference Guide 7.3 q={!filters param=$fqs excludeTags=sample}field:text& fqs=COLOR:Red& fqs=SIZE:XL& fqs={!tag=sample}BRAND:Foo which is equivalent to: q=+field:text +COLOR:Red +SIZE:XL param local parameter uses "$" syntax to refer to a few queries, where excludeTags may omit some of them. Function Query Parser The FunctionQParser extends the QParserPlugin and creates a function query from the input value. This is only one way to use function queries in Solr; for another, more integrated, approach, see the section on Function Queries. Example: {!func}log(foo) Function Range Query Parser The FunctionRangeQParser extends the QParserPlugin and creates a range query over a function. This is also referred to as frange, as seen in the examples below. Parameters l The lower bound. This parameter is optional. u The upper bound. This parameter is optional. incl Include the lower bound. This parameter is optional. The default is true. incu Include the upper bound. This parameter is optional. The default is true. Examples {!frange l=1000 u=50000}myfield fq={!frange l=0 u=2.2} sum(user_ranking,editor_ranking) Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 467 of 1195 Both of these examples restrict the results by a range of values found in a declared field or a function query. In the second example, we’re doing a sum calculation, and then defining only values between 0 and 2.2 should be returned to the user. For more information about range queries over functions, see Yonik Seeley’s introductory blog post Ranges over Functions in Solr 1.4. Graph Query Parser The graph query parser does a breadth first, cyclic aware, graph traversal of all documents that are "reachable" from a starting set of root documents identified by a wrapped query. The graph is built according to linkages between documents based on the terms found in from and to fields that you specify as part of the query. Supported field types are point fields with docValues enabled, or string fields with indexed=true or docValues=true.  For string fields which are indexed=false and docValues=true, please refer to the javadocs for DocValuesTermsQuery for it’s performance characteristics so indexed=true will perform better for most use-cases. Graph Query Parameters to The field name of matching documents to inspect to identify outgoing edges for graph traversal. Defaults to edge_ids. from The field name to of candidate documents to inspect to identify incoming graph edges. Defaults to node_id. traversalFilter An optional query that can be supplied to limit the scope of documents that are traversed. maxDepth Integer specifying how deep the breadth first search of the graph should go beginning with the initial query. Defaults to -1 (unlimited). returnRoot Boolean to indicate if the documents that matched the original query (to define the starting points for graph) should be included in the final results. Defaults to true. returnOnlyLeaf Boolean that indicates if the results of the query should be filtered so that only documents with no outgoing edges are returned. Defaults to false. useAutn Boolean that indicates if an Automatons should be compiled for each iteration of the breadth first search, which may be faster for some graphs. Defaults to false. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 468 of 1195 Apache Solr Reference Guide 7.3 Graph Query Limitations The graph parser only works in single node Solr installations, or with SolrCloud collections that use exactly 1 shard. Graph Query Examples To understand how the graph parser works, consider the following Directed Cyclic Graph, containing 8 nodes (A to H) and 9 edges (1 to 9): One way to model this graph as Solr documents, would be to create one document per node, with mutivalued fields identifying the incoming and outgoing edges for each node: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 curl -H 'Content-Type: --data-binary '[ {"id":"A","foo": 7, {"id":"B","foo": 12, {"id":"C","foo": 10, {"id":"D","foo": 20, {"id":"E","foo": 17, {"id":"F","foo": 11, {"id":"G","foo": 7, {"id":"H","foo": 10, ]' Page 469 of 1195 application/json' 'http://localhost:8983/solr/my_graph/update?commit=true' "out_edge":["1","9"], "out_edge":["3","6"], "out_edge":["5","2"], "out_edge":["4","7"], "out_edge":[], "out_edge":[], "out_edge":["8"], "out_edge":[], "in_edge":["4","2"] "in_edge":["1"] "in_edge":["9"] "in_edge":["3","5"] "in_edge":["6"] "in_edge":["7"] "in_edge":[] "in_edge":["8"] }, }, }, }, }, }, }, } With the model shown above, the following query demonstrates a simple traversal of all nodes reachable from node A: http://localhost:8983/solr/my_graph/query?fl=id&q={!graph+from=in_edge+to=out_edge}id:A "response":{"numFound":6,"start":0,"docs":[ { "id":"A" }, { "id":"B" }, { "id":"C" }, { "id":"D" }, { "id":"E" }, { "id":"F" } ] } We can also use the traversalFilter to limit the graph traversal to only nodes with maximum value of 15 in the foo field. In this case that means D, E, and F are excluded – F has a value of foo=11, but it is unreachable because the traversal skipped D: http://localhost:8983/solr/my_graph/query?fl=id&q={!graph+from=in_edge+to=out_edge+traversalFilte r='foo:[*+TO+15]'}id:A ... "response":{"numFound":3,"start":0,"docs":[ { "id":"A" }, { "id":"B" }, { "id":"C" } ] } The examples shown so far have all used a query for a single document ("id:A") as the root node for the graph traversal, but any query can be used to identify multiple documents to use as root nodes. The next example demonstrates using the maxDepth parameter to find all nodes that are at most one edge away from an root node with a value in the foo field less then or equal to 10: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 470 of 1195 Apache Solr Reference Guide 7.3 http://localhost:8983/solr/my_graph/query?fl=id&q={!graph+from=in_edge+to=out_edge+maxDepth=1}foo :[*+TO+10] ... "response":{"numFound":6,"start":0,"docs":[ { "id":"A" }, { "id":"B" }, { "id":"C" }, { "id":"D" }, { "id":"G" }, { "id":"H" } ] } Simplified Models The Document & Field modeling used in the above examples enumerated all of the outgoing and income edges for each node explicitly, to help demonstrate exactly how the "from" and "to" params work, and to give you an idea of what is possible. With multiple sets of fields like these for identifying incoming and outgoing edges, it’s possible to model many independent Directed Graphs that contain some or all of the documents in your collection. But in many cases it can also be possible to drastically simplify the model used. For example, the same graph shown in the diagram above can be modelled by Solr Documents that represent each node and know only the ids of the nodes they link to, without knowing anything about the incoming links: curl -H 'Content-Type: application/json' 'http://localhost:8983/solr/alt_graph/update?commit=true' --data-binary '[ {"id":"A","foo": 7, "out_edge":["B","C"] }, {"id":"B","foo": 12, "out_edge":["E","D"] }, {"id":"C","foo": 10, "out_edge":["A","D"] }, {"id":"D","foo": 20, "out_edge":["A","F"] }, {"id":"E","foo": 17, "out_edge":[] }, {"id":"F","foo": 11, "out_edge":[] }, {"id":"G","foo": 7, "out_edge":["H"] }, {"id":"H","foo": 10, "out_edge":[] } ]' With this alternative document model, all of the same queries demonstrated above can still be executed, simply by changing the “from” parameter to replace the “in_edge” field with the “id” field: http://localhost:8983/solr/alt_graph/query?fl=id&q={!graph+from=id+to=out_edge+maxDepth=1}foo:[*+ TO+10] Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 471 of 1195 ... "response":{"numFound":6,"start":0,"docs":[ { "id":"A" }, { "id":"B" }, { "id":"C" }, { "id":"D" }, { "id":"G" }, { "id":"H" } ] } Join Query Parser JoinQParser extends the QParserPlugin. It allows normalizing relationships between documents with a join operation. This is different from the concept of a join in a relational database because no information is being truly joined. An appropriate SQL analogy would be an "inner query". Examples: Find all products containing the word "ipod", join them against manufacturer docs and return the list of manufacturers: {!join from=manu_id_s to=id}ipod Find all manufacturer docs named "belkin", join them against product docs, and filter the list to only products with a price less than $12: q = {!join from=id to=manu_id_s}compName_s:Belkin fq = price:[* TO 12] The join operation is done on a term basis, so the "from" and "to" fields must use compatible field types. For example: joining between a StrField and a IntPointField will not work, likewise joining between a StrField and a TextField that uses LowerCaseFilterFactory will only work for values that are already lower cased in the string field. Join Parser Scoring You can optionally use the score parameter to return scores of the subordinate query. The values to use for this parameter define the type of aggregation, which are avg (average), max (maximum), min (minimum) total, or none. Score parameter and single value numerics  Specifying score local parameter switches the join algorithm. This might have performance implication on large indices, but it’s more important that this algorithm won’t work for single value numeric field starting from 7.0. Users are encouraged to change field types to string and rebuild indexes during migration. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 472 of 1195 Apache Solr Reference Guide 7.3 Joining Across Collections You can also specify a fromIndex parameter to join with a field from another core or collection. If running in SolrCloud mode, then the collection specified in the fromIndex parameter must have a single shard and a replica on all Solr nodes where the collection you’re joining to has a replica. Let’s consider an example where you want to use a Solr join query to filter movies by directors that have won an Oscar. Specifically, imagine we have two collections with the following fields: movies: id, title, director_id, … movie_directors: id, name, has_oscar, … To filter movies by directors that have won an Oscar using a Solr join on the movie_directors collection, you can send the following filter query to the movies collection: fq={!join from=id fromIndex=movie_directors to=director_id}has_oscar:true Notice that the query criteria of the filter (has_oscar:true) is based on a field in the collection specified using fromIndex. Keep in mind that you cannot return fields from the fromIndex collection using join queries, you can only use the fields for filtering results in the "to" collection (movies). Next, let’s understand how these collections need to be deployed in your cluster. Imagine the movies collection is deployed to a four node SolrCloud cluster and has two shards with a replication factor of two. Specifically, the movies collection has replicas on the following four nodes: node 1: movies_shard1_replica1 node 2: movies_shard1_replica2 node 3: movies_shard2_replica1 node 4: movies_shard2_replica2 To use the movie_directors collection in Solr join queries with the movies collection, it needs to have a replica on each of the four nodes. In other words, movie_directors must have one shard and replication factor of four: node 1: movie_directors_shard1_replica1 node 2: movie_directors_shard1_replica2 node 3: movie_directors_shard1_replica3 node 4: movie_directors_shard1_replica4 At query time, the JoinQParser will access the local replica of the movie_directors collection to perform the join. If a local replica is not available or active, then the query will fail. At this point, it should be clear that since you’re limited to a single shard and the data must be replicated across all nodes where it is needed, this approach works better with smaller data sets where there is a one-to-many relationship between the from collection and the to collection. Moreover, if you add a replica to the to collection, then you also need to add a replica for the from collection. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 473 of 1195 For more information about join queries, see the Solr Wiki page on Joins. Erick Erickson has also written a blog post about join performance titled Solr and Joins. Lucene Query Parser The LuceneQParser extends the QParserPlugin by parsing Solr’s variant on the Lucene QueryParser syntax. This is effectively the same query parser that is used in Lucene. It uses the operators q.op, the default operator ("OR" or "AND") and df, the default field name. Example: {!lucene q.op=AND df=text}myfield:foo +bar -baz For more information about the syntax for the Lucene Query Parser, see the Classic QueryParser javadocs. Learning To Rank Query Parser The LTRQParserPlugin is a special purpose parser for reranking the top results of a simple query using a more complex ranking query which is based on a machine learnt model. Example: {!ltr model=myModel reRankDocs=100} Details about using the LTRQParserPlugin can be found in the Learning To Rank section. Max Score Query Parser The MaxScoreQParser extends the LuceneQParser but returns the Max score from the clauses. It does this by wrapping all SHOULD clauses in a DisjunctionMaxQuery with tie=1.0. Any MUST or PROHIBITED clauses are passed through as-is. Non-boolean queries, e.g., NumericRange falls-through to the LuceneQParser parser behavior. Example: {!maxscore tie=0.01}C OR (D AND E) More Like This Query Parser MLTQParser enables retrieving documents that are similar to a given document. It uses Lucene’s existing MoreLikeThis logic and also works in SolrCloud mode. The document identifier used here is the unique id value and not the Lucene internal document id. The list of returned documents excludes the queried document. This query parser takes the following parameters: qf Specifies the fields to use for similarity. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 474 of 1195 Apache Solr Reference Guide 7.3 mintf Specifies the Minimum Term Frequency, the frequency below which terms will be ignored in the source document. mindf Specifies the Minimum Document Frequency, the frequency at which words will be ignored when they do not occur in at least this many documents. maxdf Specifies the Maximum Document Frequency, the frequency at which words will be ignored when they occur in more than this many documents. minwl Sets the minimum word length below which words will be ignored. maxwl Sets the maximum word length above which words will be ignored. maxqt Sets the maximum number of query terms that will be included in any generated query. maxntp Sets the maximum number of tokens to parse in each example document field that is not stored with TermVector support. boost Specifies if the query will be boosted by the interesting term relevance. It can be either "true" or "false". Examples Find documents like the document with id=1 and using the name field for similarity. {!mlt qf=name}1 Adding more constraints to what qualifies as similar using mintf and mindf. {!mlt qf=name mintf=2 mindf=3}1 Nested Query Parser The NestedParser extends the QParserPlugin and creates a nested query, with the ability for that query to redefine its type via local parameters. This is useful in specifying defaults in configuration and letting clients indirectly reference them. Example: {!query defType=func v=$q1} If the q1 parameter is price, then the query would be a function query on the price field. If the q1 parameter Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 475 of 1195 is \{!lucene}inStock:true}} then a term query is created from the Lucene syntax string that matches documents with inStock=true. These parameters would be defined in solrconfig.xml, in the defaults section: {!lucene}inStock:true For more information about the possibilities of nested queries, see Yonik Seeley’s blog post Nested Queries in Solr. Payload Query Parsers These query parsers utilize payloads encoded on terms during indexing. The main query, for both of these parsers, is parsed straightforwardly from the field type’s query analysis into a SpanQuery. The generated SpanQuery will be either a SpanTermQuery or an ordered, zero slop SpanNearQuery, depending on how many tokens are emitted. Payloads can be encoded on terms using either the DelimitedPayloadTokenFilter or the NumericPayloadTokenFilter. The payload using parsers are: • PayloadScoreQParser • PayloadCheckQParser Payload Score Parser PayloadScoreQParser incorporates each matching term’s numeric (integer or float) payloads into the scores. This parser accepts the following parameters: f The field to use. This parameter is required. func The payload function. The options are: min, max, average, or sum. This parameter is required. operator A search operator. The options are or and phrase, which is the default. This defines if the search query should be an OR query or a phrase query. includeSpanScore If true, multiples the computed payload factor by the score of the original query. If false, the default, the computed payload factor is the score. Examples {!payload_score f=my_field_dpf v=some_term func=max} © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 476 of 1195 Apache Solr Reference Guide 7.3 {!payload_score f=payload_field func=sum operator=or}A B C Payload Check Parser PayloadCheckQParser only matches when the matching terms also have the specified payloads. This parser accepts the following parameters: f The field to use (required). payloads A space-separated list of payloads that must match the query terms (required) Each specified payload will be encoded using the encoder determined from the field type and encoded accordingly for matching. DelimitedPayloadTokenFilter 'identity' encoded payloads also work here, as well as float and integer encoded ones. Example {!payload_check f=words_dps payloads="VERB NOUN"}searching stuff Prefix Query Parser PrefixQParser extends the QParserPlugin by creating a prefix query from the input value. Currently no analysis or value transformation is done to create this prefix query. The parameter is f, the field. The string after the prefix declaration is treated as a wildcard query. Example: {!prefix f=myfield}foo This would be generally equivalent to the Lucene query parser expression myfield:foo*. Raw Query Parser RawQParser extends the QParserPlugin by creating a term query from the input value without any text analysis or transformation. This is useful in debugging, or when raw terms are returned from the terms component (this is not the default). The only parameter is f, which defines the field to search. Example: {!raw f=myfield}Foo Bar Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 477 of 1195 This example constructs the query: TermQuery(Term("myfield","Foo Bar")). For easy filter construction to drill down in faceting, the TermQParserPlugin is recommended. For full analysis on all fields, including text fields, you may want to use the FieldQParserPlugin. Re-Ranking Query Parser The ReRankQParserPlugin is a special purpose parser for Re-Ranking the top results of a simple query using a more complex ranking query. Details about using the ReRankQParserPlugin can be found in the Query Re-Ranking section. Simple Query Parser The Simple query parser in Solr is based on Lucene’s SimpleQueryParser. This query parser is designed to allow users to enter queries however they want, and it will do its best to interpret the query and return results. This parser takes the following parameters: q.operators Comma-separated list of names of parsing operators to enable. By default, all operations are enabled, and this parameter can be used to effectively disable specific operators as needed, by excluding them from the list. Passing an empty string with this parameter disables all operators. Name Operator Description Example query AND + Specifies AND token1+token 2 OR | Specifies OR token1|token 2 NOT - Specifies NOT -token3 PREFIX * Specifies a prefix query term* PHRASE " Creates a phrase "term1 term2" PRECEDENCE ( ) Specifies precedence; tokens inside the parenthesis token1 + (token2 | will be analyzed first. Otherwise, normal order is token3) left to right. ESCAPE \ Put it in front of operators to match them literally © 2018, Apache Software Foundation C\\ Guide Version 7.3 - Published: 2018-03-27 Page 478 of 1195 Apache Solr Reference Guide 7.3 Name Operator Description Example query WHITESPACE space or [\r\t\n] Delimits tokens on whitespace. If not enabled, whitespace splitting will not be performed prior to analysis – usually most desirable. term1 term2 Not splitting whitespace is a unique feature of this parser that enables multi-word synonyms to work. However, it probably actually won’t unless synonyms are configured to normalize instead of expand to all that match a given synonym. Such a configuration requires normalizing synonyms at both index time and query time. Solr’s analysis screen can help here. FUZZY NEAR ~ At the end of terms, specifies a fuzzy query. ~N "N" is optional and may be either "1" or "2" (the default) ~N At the end of phrases, specifies a NEAR query term~1 "term1 term2"~5 q.op Defines the default operator to use if none is defined by the user. Allowed values are AND and OR. OR is used if none is specified. qf A list of query fields and boosts to use when building the query. df Defines the default field if none is defined in the Schema, or overrides the default field if it is already defined. Any errors in syntax are ignored and the query parser will interpret queries as best it can. However, this can lead to odd results in some cases. Spatial Query Parsers There are two spatial QParsers in Solr: geofilt and bbox. But there are other ways to query spatially: using the frange parser with a distance function, using the standard (lucene) query parser with the range syntax to pick the corners of a rectangle, or with RPT and BBoxField you can use the standard query parser but use a special syntax within quotes that allows you to pick the spatial predicate. All these options are documented further in the section Spatial Search. Surround Query Parser The SurroundQParser enables the Surround query syntax, which provides proximity search functionality. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 479 of 1195 There are two positional operators: w creates an ordered span query and n creates an unordered one. Both operators take a numeric value to indicate distance between two terms. The default is 1, and the maximum is 99. Note that the query string is not analyzed in any way. Example: {!surround} 3w(foo, bar) This example finds documents where the terms "foo" and "bar" are no more than 3 terms away from each other (i.e., no more than 2 terms between them). This query parser will also accept boolean operators (AND, OR, and NOT, in either upper- or lowercase), wildcards, quoting for phrase searches, and boosting. The w and n operators can also be expressed in upperor lowercase. The non-unary operators (everything but NOT) support both infix (a AND b AND c) and prefix AND(a, b, c) notation. Switch Query Parser SwitchQParser is a QParserPlugin that acts like a "switch" or "case" statement. The primary input string is trimmed and then prefixed with case. for use as a key to lookup a "switch case" in the parser’s local params. If a matching local param is found the resulting param value will then be parsed as a subquery, and returned as the parse result. The case local param can be optionally be specified as a switch case to match missing (or blank) input strings. The default local param can optionally be specified as a default case to use if the input string does not match any other switch case local params. If default is not specified, then any input which does not match a switch case local param will result in a syntax error. In the examples below, the result of each query is "XXX": {!switch case.foo=XXX case.bar=zzz case.yak=qqq}foo The extra whitespace between } and bar is trimmed automatically. {!switch case.foo=qqq case.bar=XXX case.yak=zzz} bar The result will fallback to the default. {!switch case.foo=qqq case.bar=zzz default=XXX}asdf No input uses the value for case instead. {!switch case=XXX case.bar=zzz case.yak=qqq} © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 480 of 1195 Apache Solr Reference Guide 7.3 A practical usage of this QParserPlugin, is in specifying appends fq params in the configuration of a SearchHandler, to provide a fixed set of filter options for clients using custom parameter names. Using the example configuration below, clients can optionally specify the custom parameters in_stock and shipping to override the default filtering behavior, but are limited to the specific set of legal values (shipping=any|free, in_stock=yes|no|all). yes any {!switch case.all='*:*' case.yes='inStock:true' case.no='inStock:false' v=$in_stock} {!switch case.any='*:*' case.free='shipping_cost:0.0' v=$shipping} Term Query Parser TermQParser extends the QParserPlugin by creating a single term query from the input value equivalent to readableToIndexed(). This is useful for generating filter queries from the external human readable terms returned by the faceting or terms components. The only parameter is f, for the field. Example: {!term f=weight}1.5 For text fields, no analysis is done since raw terms are already returned from the faceting and terms components. To apply analysis to text fields as well, see the Field Query Parser, above. If no analysis or transformation is desired for any type of field, see the Raw Query Parser, above. Terms Query Parser TermsQParser functions similarly to the Term Query Parser but takes in multiple values separated by commas and returns documents matching any of the specified values. This can be useful for generating filter queries from the external human readable terms returned by the faceting or terms components, and may be more efficient in some cases than using the Standard Query Parser to generate a boolean query since the default implementation method avoids scoring. This query parser takes the following parameters: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 481 of 1195 f The field on which to search. This parameter is required. separator Separator to use when parsing the input. If set to " " (a single blank space), will trim additional white space from the input terms. Defaults to a comma (,). method The internal query-building implementation: termsFilter, booleanQuery, automaton, or docValuesTermsFilter. Defaults to termsFilter. Examples {!terms f=tags}software,apache,solr,lucene {!terms f=categoryId method=booleanQuery separator=" "}8 6 7 5309 XML Query Parser The XmlQParserPlugin extends the QParserPlugin and supports the creation of queries from XML. Example: Parameter Value defType xmlparser q shirt plain cotton S M L The XmlQParser implementation uses the SolrCoreParser class which extends Lucene’s CoreParser class. XML elements are mapped to QueryBuilder classes as follows: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 482 of 1195 Apache Solr Reference Guide 7.3 XML element QueryBuilder class BooleanQueryBuilder BoostingTermBuilder ConstantScoreQueryBuilder DisjunctionMaxQueryBuilder MatchAllDocsQueryBuilder RangeQueryBuilder SpanFirstBuilder SpanNearBuilder SpanNotBuilder SpanOrBuilder SpanOrTermsBuilder SpanTermBuilder TermQueryBuilder TermsQueryBuilder UserInputQueryBuilder LegacyNumericRangeQuery(Builder) is deprecated Customizing XML Query Parser You can configure your own custom query builders for additional XML elements. The custom builders need to extend the SolrQueryBuilder or the SolrSpanQueryBuilder class. Example solrconfig.xml snippet: com.mycompany.solr.search.MyCustomQueryBuilder Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 483 of 1195 JSON Request API The JSON Request API allows a JSON body to be passed for the entire search request. The JSON Facet API is part of the JSON Request API, and allows specification of faceted analytics in JSON. Here’s an example of a search request using query parameters only: curl "http://localhost:8983/solr/techproducts/query?q=memory&fq=inStock:true" The same request when passed as JSON in the body: curl http://localhost:8983/solr/techproducts/query -d ' { "query" : "memory", "filter" : "inStock:true" }' Passing JSON via Request Parameter It may sometimes be more convenient to pass the JSON body as a request parameter rather than in the actual body of the HTTP request. Solr treats a json parameter the same as a JSON body. curl http://localhost:8983/solr/techproducts/query -d 'json={"query":"memory"}' Smart Merging of Multiple JSON Parameters Multiple json parameters in a single request are merged before being interpreted. • Single-valued elements are overwritten by the last value. • Multi-valued elements like fields and filter are appended. • Parameters of the form json.= are merged in the appropriate place in the hierarchy. For example a json.facet parameter is the same as facet within the JSON body. • A JSON body, or straight json parameters are always parsed first, meaning that other request parameters come after, and overwrite single valued elements. Smart merging gives the best of both worlds…the structure of JSON with the ability to selectively separate out / decompose parts of the request! Simple Example © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 484 of 1195 Apache Solr Reference Guide 7.3 curl 'http://localhost:8983/solr/techproducts/query?json.limit=5&json.filter="cat:electronics"' -d ' { query: "memory", limit: 10, filter: "inStock:true" }' Is equivalent to: curl http://localhost:8983/solr/techproducts/query -d ' { query: "memory", limit: 5, // this single-valued parameter was overwritten. filter: ["inStock:true","cat:electronics"] // this multi-valued parameter was appended to. }' Facet Example In fact, you don’t even need to start with a JSON body for smart merging to be very useful. Consider the following request composed entirely of request params: curl http://localhost:8983/solr/techproducts/query -d 'q=*:*&rows=1& json.facet.avg_price="avg(price)"& json.facet.top_cats={type:terms,field:"cat",limit:5}' That is equivalent to having the following JSON body or json parameter: { "facet": { "avg_price": "avg(price)", "top_cats": { "type": "terms", "field": "cat", "limit": 5 } } } See the JSON Facet API for more on faceting and analytics commands in specified in JSON. Debugging If you want to see what your merged/parsed JSON looks like, you can turn on debugging (debug=true), and it will come back under the "json" key along with the other debugging information. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 485 of 1195 Passing Parameters via JSON We can also pass normal query request parameters in the JSON body within the params block: curl "http://localhost:8983/solr/techproducts/query?fl=name,price"-d ' { params: { q: "memory", rows: 1 } }' Which is equivalent to curl "http://localhost:8983/solr/techproducts/query?fl=name,price&q=memory&rows=1" Parameters Mapping Right now only some standard query parameters have JSON equivalents. Unmapped parameters can be passed through request parameters or params block as shown above. Table 1. Standard query parameters to JSON field Query parameters JSON field equivalent q query fq filter offset start limit rows sort sort json.facet facet json. Error Detection Because we didn’t pollute the root body of the JSON request with the normal Solr request parameters (they are all contained in the params block), we now have the ability to validate requests and return an error for unknown JSON keys. curl http://localhost:8983/solr/techproducts/query -d ' { query : "memory", fulter : "inStock:true" // oops, we misspelled "filter" }' © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 486 of 1195 Apache Solr Reference Guide 7.3 And we get an error back containing the error string: "Unknown top-level key in JSON request : fulter" Parameter Substitution / Macro Expansion Of course request templating via parameter substitution works fully with JSON request bodies or parameters as well. For example: curl "http://localhost:8983/solr/techproducts/query?FIELD=text&TERM=memory&HOWMANY=10" -d ' { query:"${FIELD}:${TERM}", limit:${HOWMANY} }' JSON Query DSL The JSON Query DSL provides a simple yet powerful query language for the JSON Request API. Structure of JSON Query DSL A JSON query can be: • A valid query string for default deftype (the standard query parser in most cases), as in, title:solr. • A valid local parameters query string, as in, {!dismax qf=myfield}solr rocks. • A JSON object with query parser name and arguments. The special key v in local parameters is replaced by key query in JSON object query, as in this example: { "query-parser-name" : { "param1": "value1", "param2": "value2", "query": "a-json-query", "another-param": "another-json-query" } } Basic Examples The four requests below are equivalent for searching for solr lucene in a field named content: 1. Passing all parameters on URI, with "lucene" as the default query parser. curl -XGET "http://localhost:8983/solr/books/query?q=content:(solr lucene)" Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 487 of 1195 2. Using the JSON Query DSL with valid query string for default deftype, with "lucene" as default query parser. curl -XGET http://localhost:8983/solr/books/query -d ' {"query": "content:(solr lucene)"}' 3. Using JSON Query DSL with valid local parameters query defining the "lucene" query parser. curl -XGET http://localhost:8983/solr/books/query -d ' {"query": "{!lucene df=content v='solr lucene'}"}' 4. Using JSON Query DSL in verbose way, as a valid JSON object with parser name and arguments. curl -XGET http://localhost:8983/solr/books/query -d ' {"query": {"lucene": { "df": "content", "query": "solr lucene" } }}' Note that the JSON query in the examples above is provided under the key query of JSON Request API. Nested Queries Some query parsers accept a query as an argument. JSON Query DSL makes it easier to write and read such complex query. The three requests below are equivalent for wrapping the above example query (searching for solr lucene in field content) with a boost query: 1. Passing all parameters on URI. http://localhost:8983/solr/books/query?q={!boost b=log(popularity) v='{!lucene df=content}(lucene solr)'} 2. Converted into JSON Query DSL with use of local parameters. As you can see, the special key v is replaced by key query. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 488 of 1195 Apache Solr Reference Guide 7.3 curl -XGET http://localhost:8983/solr/books/query -d ' { "query" : { "boost": { "query": {!lucene df=content}(lucene solr), "b": "log(popularity)" } } }' 3. Using a verbose JSON Query DSL without local parameters. curl -XGET http://localhost:8983/solr/books/query -d ' { "query": { "boost": { "query": { "lucene": { "df": "content", "query": "solr lucene" } }, "b": "log(popularity)" } } }' Compound Queries With the support of the BoolQParser, the JSON Query DSL can create a very powerful nested query. This query searches for books where content contains lucene or solr, title contains solr and their ranking must larger than 3.0: curl -XGET http://localhost:8983/solr/books/query -d ' { "query": { "bool": { "must": [ "title:solr", {"lucene": {"df: "content", query: "lucene solr"}} ], "must_not": [ {"frange": {"u": "3.0", query: "ranking"}} ] } } }' Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 489 of 1195 If lucene is the default query parser query, the above can be rewritten in much less verbose way as in: curl -XGET http://localhost:8983/solr/books/query -d ' { "query": { "bool": { "must": [ "title:solr", "content:(lucene solr)" ], "must_not": "{!frange u:3.0}ranking" } } }' Use JSON Query DSL in JSON Request API JSON Query DSL is not only supported with the key query but also with the key filter of the JSON Request API. For example, the above query can be rewritten using filter clause like this: curl -XGET http://localhost:8983/solr/books/query -d ' { "query": { "bool": { "must_not": "{!frange u:3.0}ranking" } }, "filter: [ "title:solr", { "lucene" : {"df: "content", query : "lucene solr" }} ] }' © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 490 of 1195 Apache Solr Reference Guide 7.3 JSON Facet API Facet & Analytics Module The new Facet & Analytics Module exposed via the JSON Facet API is a rewrite of Solr’s previous faceting capabilities, with the following goals: • First class native JSON API to control faceting and analytics ◦ The structured nature of nested sub-facets are more naturally expressed in JSON rather than the flat nanemspace provided by normal query parameters. • First class integrated analytics support • Nest any facet type under any other facet type (such as range facet, field facet, query facet) • Ability to sort facet buckets by any calculated metric • Easier programmatic construction of complex nested facet commands • Support a more canonical response format that is easier for clients to parse • Support a cleaner way to implement distributed faceting • Support better integration with other search features • Full integration with the JSON Request API Faceted Search Faceted search is about aggregating data and calculating metrics about that data. There are two main types of facets: • Facets that partition or categorize data (the domain) into multiple buckets • Facets that calculate data for a given bucket (normally a metric, statistic or analytic function) Metrics Example By default, the domain for facets starts with all documents that match the base query and any filters. Here’s an example that requests various metrics about the root domain: curl http://localhost:8983/solr/techproducts/query -d ' q=memory& fq=inStock:true& json.facet={ "avg_price" : "avg(price)", "num_suppliers" : "unique(manu_exact)", "median_weight" : "percentile(weight,50)" }' The response to the facet request above will start with documents matching the root domain (docs containing "memory" with inStock:true) then calculate and return the requested metrics: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 491 of 1195 "facets" : { "count" : 4, "avg_price" : 109.9950008392334, "num_suppliers" : 3, "median_weight" : 352.0 } Bucketing Facet Example Here’s an example of a bucketing facet, that partitions documents into bucket based on the cat field (short for category), and returns the top 5 buckets: curl http://localhost:8983/solr/techproducts/query -d 'q=*:*& json.facet={ categories : { type : terms, field : cat, // bucket documents based on the "cat" field limit : 3 // retrieve the top 3 buckets ranked by the number of docs in each bucket } }' The response below shows us that 32 documents match the default root domain. and 12 documents have cat:electronics, 4 documents have cat:currency, etc. [...] "facets":{ "count":32, "categories":{ "buckets":[{ "val":"electronics", "count":12}, { "val":"currency", "count":4}, { "val":"memory", "count":3}, ] } } Making a Facet Request In this guide, we will often just present the facet command block: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 492 of 1195 Apache Solr Reference Guide 7.3 { x : "average(mul(price,popularity))" } To execute a facet command block such as this, you’ll need to use the json.facet parameter, and provide at least a base query such as q=*:* curl http://localhost:8983/solr/techproducts/query -d 'q=*:*&json.facet= { x : "avg(mul(price,popularity))" } ' Another option is to use the JSON Request API to provide the entire request in JSON: curl http://localhost:8983/solr/techproducts/query -d ' { query : "*:*", // this is the base query filter : [ "inStock:true" ], // a list of filters facet : { x : "avg(mul(price,popularity))" // and our funky metric of average of price * popularity } } ' JSON Extensions The Noggit JSON parser that is used by Solr accepts a number of JSON extensions such as • bare words can be left unquoted • single line comments using either // or # • Multi-line comments using C style /* comments in here */ • Single quoted strings • Allow backslash escaping of any character • Allow trailing commas and extra commas. Example: [9,4,3,] • Handle nbsp (non-break space, \u00a0) as whitespace. Terms Facet The terms facet (or field facet) buckets the domain based on the unique terms / values of a field. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 493 of 1195 curl http://localhost:8983/solr/techproducts/query -d 'q=*:*& json.facet={ categories:{ terms: { field : cat, // bucket documents based on the "cat" field limit : 5 // retrieve the top 5 buckets ranked by the number of docs in each bucket }' Paramet Description er field The field name to facet over. offset Used for paging, this skips the first N buckets. Defaults to 0. limit Limits the number of buckets returned. Defaults to 10. refine If true, turns on distributed facet refining. This uses a second phase to retrieve selected stats from shards so that every shard contributes to every returned bucket in this facet and any subfacets. This makes stats for returned buckets exact. overrequ Number of buckets beyond the limit to request internally during distributed search. -1 means est default. mincount Only return buckets with a count of at least this number. Defaults to 1. sort Specifies how to sort the buckets produced. “count” specifies document count, “index” sorts by the index (natural) order of the bucket value. One can also sort by any facet function / statistic that occurs in the bucket. The default is “count desc”. This parameter may also be specified in JSON like sort:{count:desc}. The sort order may either be “asc” or “desc” missing A boolean that specifies if a special “missing” bucket should be returned that is defined by documents without a value in the field. Defaults to false. numBuck A boolean. If true, adds “numBuckets” to the response, an integer representing the number of ets buckets for the facet (as opposed to the number of buckets returned). Defaults to false. allBucket A boolean. If true, adds an “allBuckets” bucket to the response, representing the union of all of s the buckets. For multi-valued fields, this is different than a bucket for all of the documents in the domain since a single document can belong to multiple buckets. Defaults to false. prefix Only produce buckets for terms starting with the specified prefix. facet Aggregations, metrics or nested facets that will be calculated for every returned bucket © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 494 of 1195 Apache Solr Reference Guide 7.3 Paramet Description er method This parameter indicates the facet algorithm to use: • "dv" DocValues, collect into ordinal array • "uif" UnInvertedField, collect into ordinal array • "dvhash" DocValues, collect into hash - improves efficiency over high cardinality fields • "enum" TermsEnum then intersect DocSet (stream-able) • "stream" Presently equivalent to "enum" • "smart" Pick the best method for the field type (this is the default) Query Facet The query facet produces a single bucket of documents that match the domain as well as the specified query. An example of the simplest form of the query facet is "query":"query string". { high_popularity : { query : "popularity:[8 TO 10]" } } An expanded form allows for more parameters and a facet command block to specify sub-facets (either nested facets or metrics): { high_popularity : { type: query, q : "popularity:[8 TO 10]", facet : { average_price : "avg(price)" } } } Example response: "high_popularity" : { "count" : 36, "average_price" : 36.75 } Range Facet The range facet produces multiple buckets over a date field or numeric field. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 495 of 1195 Example: { prices : { type: range, field : price, start : 0, end : 100, gap : 20 } } "prices":{ "buckets":[ { "val":0.0, // the bucket value represents the start of each range. 20 "count":5}, { "val":20.0, "count":3}, { "val":40.0, "count":2}, { "val":60.0, "count":1}, { "val":80.0, "count":1} ] } This bucket covers 0- Range Facet Parameters To ease migration, the range facet parameter names and semantics largely mirror facet.range queryparameter style faceting. For example "start" here corresponds to "facet.range.start" in a facet.range command. Paramet Description er field The numeric field or date field to produce range buckets from. start Lower bound of the ranges. end Upper bound of the ranges. gap Size of each range bucket produced. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 496 of 1195 Apache Solr Reference Guide 7.3 Paramet Description er hardend A boolean, which if true means that the last bucket will end at “end” even if it is less than “gap” wide. If false, the last bucket will be “gap” wide, which may extend past “end”. other This parameter indicates that in addition to the counts for each range constraint between start and end, counts should also be computed for… • "before" all records with field values lower then lower bound of the first range • "after" all records with field values greater then the upper bound of the last range • "between" all records with field values between the start and end bounds of all ranges • "none" compute none of this information • "all" shortcut for before, between, and after include By default, the ranges used to compute range faceting between start and end are inclusive of their lower bounds and exclusive of the upper bounds. The “before” range is exclusive and the “after” range is inclusive. This default, equivalent to "lower" below, will not result in double counting at the boundaries. The include parameter may be any combination of the following options: • "lower" all gap based ranges include their lower bound • "upper" all gap based ranges include their upper bound • "edge" the first and last gap ranges include their edge bounds (i.e., lower for the first one, upper for the last one) even if the corresponding upper/lower option is not specified • "outer" the “before” and “after” ranges will be inclusive of their bounds, even if the first or last ranges already include those boundaries. • "all" shorthand for lower, upper, edge, outer facet Aggregations, metrics, or nested facets that will be calculated for every returned bucket Filtering Facets One can filter the domain before faceting via the filter keyword in the domain block of the facet. Example: { categories : { type : terms, field : cat, domain : { filter : "popularity:[5 TO 10]" } } } The value of filter can be a single query to treat as a filter, or a list of filter queries. Each one can be: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 497 of 1195 • a string containing a query in Solr query syntax • a reference to a request parameter containing Solr query syntax, of the form: {param : } Aggregation Functions Aggregation functions, also called facet functions, analytic functions, or metrics, calculate something interesting over a domain (each facet bucket). Aggrega Example tion Description sum sum(sales) summation of numeric values avg avg(popularity) average of numeric values min min(salary) minimum value max max(mul(price,popularity)) maximum value unique unique(author) number of unique values hll hll(author) distributed cardinality estimate via hyper-log-log algorithm percentil percentile(salary,50,75,99,99.9) Percentile estimates via t-digest algorithm. When sorting by this e metric, the first percentile listed is used as the sort value. sumsq sumsq(rent) sum of squares of field or function variance variance(rent) variance of numeric field or function stddev stddev(rent) standard deviation of field or function Numeric aggregation functions such as avg can be on any numeric field, or on another function of multiple numeric fields such as avg(mul(price,popularity)). Facet Sorting The default sort for a field or terms facet is by bucket count descending. We can optionally sort ascending or descending by any facet function that appears in each bucket. { categories:{ type : terms // terms facet creates a bucket for each indexed term in the field field : cat, sort : "x desc", // can also use sort:{x:desc} facet : { x : "avg(price)", // x = average price for each facet bucket y : "max(popularity)" // y = max popularity value in each facet bucket } } } © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 498 of 1195 Apache Solr Reference Guide 7.3 Nested Facets Nested facets, or sub-facets, allow one to nest bucketing facet commands like terms, range, or query facets under other facet commands. The syntax is identical to top-level facets - just add the facet command to the facet command block of the parent facet. Technically, every facet command is actually a sub-facet since we start off with a single facet bucket with a domain defined by the main query and filters. Nested facet example Let’s start off with a simple non-nested terms facet on the genre field: top_genres:{ type: terms field: genre, limit: 5 } Now if we wanted to add a nested facet to find the top 2 authors for each genre bucket: top_genres:{ type: terms, field: genre, limit: 5, facet:{ top_authors:{ type: terms, // nested terms facet on author will be calculated for each parent bucket (genre) field: author, limit: 2 } } } And the response will look something like: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 499 of 1195 [...] "facets":{ "top_genres":{ "buckets":[ { "val":"Fantasy", "count":5432, "top_authors":{ // these are the top authors in the "Fantasy" genre "buckets":[{ "val":"Mercedes Lackey", "count":121}, { "val":"Piers Anthony", "count":98} ] } }, { "val":"Mystery", "count":4322, "top_authors":{ // these are the top authors in the "Mystery" genre "buckets":[{ "val":"James Patterson", "count":146}, { "val":"Patricia Cornwell", "count":132} ] } }, [...] By default "top authors" is defined by simple document count descending, but we could use our aggregation functions to sort by more interesting metrics. References This documentation was originally adapted largely from the following blog pages: http://yonik.com/json-facet-api/ http://yonik.com/solr-facet-functions/ http://yonik.com/solr-subfacets/ http://yonik.com/percentiles-for-solr-faceting/ © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 500 of 1195 Apache Solr Reference Guide 7.3 Faceting Faceting is the arrangement of search results into categories based on indexed terms. Searchers are presented with the indexed terms, along with numerical counts of how many matching documents were found for each term. Faceting makes it easy for users to explore search results, narrowing in on exactly the results they are looking for. General Facet Parameters There are two general parameters for controlling faceting. facet If set to true, this parameter enables facet counts in the query response. If set to false, a blank or missing value, this parameter disables faceting. None of the other parameters listed below will have any effect unless this parameter is set to true. The default value is blank (false). facet.query This parameter allows you to specify an arbitrary query in the Lucene default syntax to generate a facet count. By default, Solr’s faceting feature automatically determines the unique terms for a field and returns a count for each of those terms. Using facet.query, you can override this default behavior and select exactly which terms or expressions you would like to see counted. In a typical implementation of faceting, you will specify a number of facet.query parameters. This parameter can be particularly useful for numeric-range-based facets or prefix-based facets. You can set the facet.query parameter multiple times to indicate that multiple queries should be used as separate facet constraints. To use facet queries in a syntax other than the default syntax, prefix the facet query with the name of the query notation. For example, to use the hypothetical myfunc query parser, you could set the facet.query parameter like so: facet.query={!myfunc}name~fred Field-Value Faceting Parameters Several parameters can be used to trigger faceting based on the indexed terms in a field. When using these parameters, it is important to remember that "term" is a very specific concept in Lucene: it relates to the literal field/value pairs that are indexed after any analysis occurs. For text fields that include stemming, lowercasing, or word splitting, the resulting terms may not be what you expect. If you want Solr to perform both analysis (for searching) and faceting on the full literal strings, use the copyField directive in your Schema to create two versions of the field: one Text and one String. Make sure both are indexed="true". (For more information about the copyField directive, see Documents, Fields, and Schema Design.) Unless otherwise specified, all of the parameters below can be specified on a per-field basis with the syntax Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 501 of 1195 of f..facet. facet.field The facet.field parameter identifies a field that should be treated as a facet. It iterates over each Term in the field and generate a facet count using that Term as the constraint. This parameter can be specified multiple times in a query to select multiple facet fields.  If you do not set this parameter to at least one field in the schema, none of the other parameters described in this section will have any effect. facet.prefix The facet.prefix parameter limits the terms on which to facet to those starting with the given string prefix. This does not limit the query in any way, only the facets that would be returned in response to the query. facet.contains The facet.contains parameter limits the terms on which to facet to those containing the given substring. This does not limit the query in any way, only the facets that would be returned in response to the query. facet.contains.ignoreCase If facet.contains is used, the facet.contains.ignoreCase parameter causes case to be ignored when matching the given substring against candidate facet terms. facet.matches If you want to only return facet buckets for the terms that match a regular expression. facet.sort This parameter determines the ordering of the facet field constraints. There are two options for this parameter. count Sort the constraints by count (highest count first). index Return the constraints sorted in their index order (lexicographic by indexed term). For terms in the ASCII range, this will be alphabetically sorted. The default is count if facet.limit is greater than 0, otherwise, the default is index. facet.limit This parameter specifies the maximum number of constraint counts (essentially, the number of facets for a field that are returned) that should be returned for the facet fields. A negative value means that Solr will return unlimited number of constraint counts. The default value is 100. facet.offset The facet.offset parameter indicates an offset into the list of constraints to allow paging. The default value is 0. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 502 of 1195 Apache Solr Reference Guide 7.3 facet.mincount The facet.mincount parameter specifies the minimum counts required for a facet field to be included in the response. If a field’s counts are below the minimum, the field’s facet is not returned. The default value is 0. facet.missing If set to true, this parameter indicates that, in addition to the Term-based constraints of a facet field, a count of all results that match the query but which have no facet value for the field should be computed and returned in the response. The default value is false. facet.method The facet.method parameter selects the type of algorithm or method Solr should use when faceting a field. The following methods are available. enum Enumerates all terms in a field, calculating the set intersection of documents that match the term with documents that match the query. This method is recommended for faceting multi-valued fields that have only a few distinct values. The average number of values per document does not matter. For example, faceting on a field with U.S. States such as Alabama, Alaska, … Wyoming would lead to fifty cached filters which would be used over and over again. The filterCache should be large enough to hold all the cached filters. fc Calculates facet counts by iterating over documents that match the query and summing the terms that appear in each document. This is currently implemented using an UnInvertedField cache if the field either is multi-valued or is tokenized (according to FieldType.isTokened()). Each document is looked up in the cache to see what terms/values it contains, and a tally is incremented for each value. This method is excellent for situations where the number of indexed values for the field is high, but the number of values per document is low. For multi-valued fields, a hybrid approach is used that uses term filters from the filterCache for terms that match many documents. The letters fc stand for field cache. fcs Per-segment field faceting for single-valued string fields. Enable with facet.method=fcs and control the number of threads used with the threads local parameter. This parameter allows faceting to be faster in the presence of rapid index changes. The default value is fc (except for fields using the BoolField field type and when facet.exists=true is requested) since it tends to use less memory and is faster when a field has many unique terms in the index. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 503 of 1195 facet.enum.cache.minDf This parameter indicates the minimum document frequency (the number of documents matching a term) for which the filterCache should be used when determining the constraint count for that term. This is only used with the facet.method=enum method of faceting. A value greater than zero decreases the filterCache’s memory usage, but increases the time required for the query to be processed. If you are faceting on a field with a very large number of terms, and you wish to decrease memory usage, try setting this parameter to a value between 25 and 50, and run a few tests. Then, optimize the parameter setting as necessary. The default value is 0, causing the filterCache to be used for all terms in the field. facet.exists To cap facet counts by 1, specify facet.exists=true. This parameter can be used with facet.method=enum or when it’s omitted. It can be used only on non-trie fields (such as strings). It may speed up facet counting on large indices and/or high-cardinality facet values. facet.excludeTerms If you want to remove terms from facet counts but keep them in the index, the facet.excludeTerms parameter allows you to do that. facet.overrequest.count and facet.overrequest.ratio In some situations, the accuracy in selecting the "top" constraints returned for a facet in a distributed Solr query can be improved by "over requesting" the number of desired constraints (i.e., facet.limit) from each of the individual shards. In these situations, each shard is by default asked for the top 10 + (1.5 * facet.limit) constraints. In some situations, depending on how your docs are partitioned across your shards and what facet.limit value you used, you may find it advantageous to increase or decrease the amount of overrequesting Solr does. This can be achieved by setting the facet.overrequest.count (defaults to 10) and facet.overrequest.ratio (defaults to 1.5) parameters. facet.threads This parameter will cause loading the underlying fields used in faceting to be executed in parallel with the number of threads specified. Specify as facet.threads=N where N is the maximum number of threads used. Omitting this parameter or specifying the thread count as 0 will not spawn any threads, and only the main request thread will be used. Specifying a negative number of threads will create up to Integer.MAX_VALUE threads. Range Faceting You can use Range Faceting on any date field or any numeric field that supports range queries. This is particularly useful for stitching together a series of range queries (as facet by query) for things like prices. facet.range The facet.range parameter defines the field for which Solr should create range facets. For example: facet.range=price&facet.range=age © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 504 of 1195 Apache Solr Reference Guide 7.3 facet.range=lastModified_dt facet.range.start The facet.range.start parameter specifies the lower bound of the ranges. You can specify this parameter on a per field basis with the syntax of f..facet.range.start. For example: f.price.facet.range.start=0.0&f.age.facet.range.start=10 f.lastModified_dt.facet.range.start=NOW/DAY-30DAYS facet.range.end The facet.range.end specifies the upper bound of the ranges. You can specify this parameter on a per field basis with the syntax of f..facet.range.end. For example: f.price.facet.range.end=1000.0&f.age.facet.range.start=99 f.lastModified_dt.facet.range.end=NOW/DAY+30DAYS facet.range.gap The span of each range expressed as a value to be added to the lower bound. For date fields, this should be expressed using the DateMathParser syntax (such as, facet.range.gap=%2B1DAY … '+1DAY'). You can specify this parameter on a per-field basis with the syntax of f..facet.range.gap. For example: f.price.facet.range.gap=100&f.age.facet.range.gap=10 f.lastModified_dt.facet.range.gap=+1DAY facet.range.hardend The facet.range.hardend parameter is a Boolean parameter that specifies how Solr should handle cases where the facet.range.gap does not divide evenly between facet.range.start and facet.range.end. If true, the last range constraint will have the facet.range.end value as an upper bound. If false, the last range will have the smallest possible upper bound greater then facet.range.end such that the range is the exact width of the specified range gap. The default value for this parameter is false. This parameter can be specified on a per field basis with the syntax f..facet.range.hardend. facet.range.include By default, the ranges used to compute range faceting between facet.range.start and facet.range.end are inclusive of their lower bounds and exclusive of the upper bounds. The "before" range defined with the facet.range.other parameter is exclusive and the "after" range is inclusive. This default, equivalent to "lower" below, will not result in double counting at the boundaries. You can use the facet.range.include parameter to modify this behavior using the following options: • lower: All gap-based ranges include their lower bound. • upper: All gap-based ranges include their upper bound. • edge: The first and last gap ranges include their edge bounds (lower for the first one, upper for the last one) even if the corresponding upper/lower option is not specified. • outer: The "before" and "after" ranges will be inclusive of their bounds, even if the first or last ranges already include those boundaries. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 505 of 1195 • all: Includes all options: lower, upper, edge, and outer. You can specify this parameter on a per field basis with the syntax of f..facet.range.include, and you can specify it multiple times to indicate multiple choices.  To ensure you avoid double-counting, do not choose both lower and upper, do not choose outer, and do not choose all. facet.range.other The facet.range.other parameter specifies that in addition to the counts for each range constraint between facet.range.start and facet.range.end, counts should also be computed for these options: • before: All records with field values lower then lower bound of the first range. • after: All records with field values greater then the upper bound of the last range. • between: All records with field values between the start and end bounds of all ranges. • none: Do not compute any counts. • all: Compute counts for before, between, and after. This parameter can be specified on a per field basis with the syntax of f..facet.range.other. In addition to the all option, this parameter can be specified multiple times to indicate multiple choices, but none will override all other options. facet.range.method The facet.range.method parameter selects the type of algorithm or method Solr should use for range faceting. Both methods produce the same results, but performance may vary. filter This method generates the ranges based on other facet.range parameters, and for each of them executes a filter that later intersects with the main query resultset to get the count. It will make use of the filterCache, so it will benefit of a cache large enough to contain all ranges. dv This method iterates the documents that match the main query, and for each of them finds the correct range for the value. This method will make use of docValues (if enabled for the field) or fieldCache. The dv method is not supported for field type DateRangeField or when using group.facets. The default value for this parameter is filter. Date Ranges & Time Zones  Range faceting on date fields is a common situation where the TZ parameter can be useful to ensure that the "facet counts per day" or "facet counts per month" are based on a meaningful definition of when a given day/month "starts" relative to a particular TimeZone. For more information, see the examples in the Working with Dates section. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 506 of 1195 Apache Solr Reference Guide 7.3 facet.mincount in Range Faceting The facet.mincount parameter, the same one as used in field faceting is also applied to range faceting. When used, no ranges with a count below the minimum will be included in the response. Pivot (Decision Tree) Faceting Pivoting is a summarization tool that lets you automatically sort, count, total or average data stored in a table. The results are typically displayed in a second table showing the summarized data. Pivot faceting lets you create a summary table of the results from a faceting documents by multiple fields. Another way to look at it is that the query produces a Decision Tree, in that Solr tells you "for facet A, the constraints/counts are X/N, Y/M, etc. If you were to constrain A by X, then the constraint counts for B would be S/P, T/Q, etc.". In other words, it tells you in advance what the "next" set of facet results would be for a field if you apply a constraint from the current facet results. facet.pivot The facet.pivot parameter defines the fields to use for the pivot. Multiple facet.pivot values will create multiple "facet_pivot" sections in the response. Separate each list of fields with a comma. facet.pivot.mincount The facet.pivot.mincount parameter defines the minimum number of documents that need to match in order for the facet to be included in results. The default is 1. Using the “bin/solr -e techproducts” example, A query URL like this one will return the data below, with the pivot faceting results found in the section "facet_pivot": http://localhost:8983/solr/techproducts/select?q=*:*&facet.pivot=cat,popularity,inStock &facet.pivot=popularity,cat&facet=true&facet.field=cat&facet.limit=5&rows=0&facet.pivot.mincount= 2 Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 507 of 1195 { "facet_counts":{ "facet_queries":{}, "facet_fields":{ "cat":[ "electronics",14, "currency",4, "memory",3, "connector",2, "graphics card",2]}, "facet_dates":{}, "facet_ranges":{}, "facet_pivot":{ "cat,popularity,inStock":[{ "field":"cat", "value":"electronics", "count":14, "pivot":[{ "field":"popularity", "value":6, "count":5, "pivot":[{ "field":"inStock", "value":true, "count":5}]}] }]}}} Combining Stats Component With Pivots In addition to some of the general local parameters supported by other types of faceting, a stats local parameters can be used with facet.pivot to refer to stats.field instances (by tag) that you would like to have computed for each Pivot Constraint. In the example below, two different (overlapping) sets of statistics are computed for each of the facet.pivot result hierarchies: stats=true stats.field={!tag=piv1,piv2 min=true max=true}price stats.field={!tag=piv2 mean=true}popularity facet=true facet.pivot={!stats=piv1}cat,inStock facet.pivot={!stats=piv2}manu,inStock Results: {"facet_pivot":{ "cat,inStock":[{ "field":"cat", "value":"electronics", "count":12, © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 508 of 1195 Apache Solr Reference Guide 7.3 "pivot":[{ "field":"inStock", "value":true, "count":8, "stats":{ "stats_fields":{ "price":{ "min":74.98999786376953, "max":399.0}}}}, { "field":"inStock", "value":false, "count":4, "stats":{ "stats_fields":{ "price":{ "min":11.5, "max":649.989990234375}}}}], "stats":{ "stats_fields":{ "price":{ "min":11.5, "max":649.989990234375}}}}, { "field":"cat", "value":"currency", "count":4, "pivot":[{ "field":"inStock", "value":true, "count":4, "stats":{ "stats_fields":{ "price":{ "..." "manu,inStock":[{ "field":"manu", "value":"inc", "count":8, "pivot":[{ "field":"inStock", "value":true, "count":7, "stats":{ "stats_fields":{ "price":{ "min":74.98999786376953, "max":2199.0}, "popularity":{ "mean":5.857142857142857}}}}, { "field":"inStock", Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 509 of 1195 "value":false, "count":1, "stats":{ "stats_fields":{ "price":{ "min":479.95001220703125, "max":479.95001220703125}, "popularity":{ "mean":7.0}}}}], "..."}]}}}}]}]}} Combining Facet Queries And Facet Ranges With Pivot Facets A query local parameter can be used with facet.pivot to refer to facet.query instances (by tag) that should be computed for each pivot constraint. Similarly, a range local parameter can be used with facet.pivot to refer to facet.range instances. In the example below, two query facets are computed for h of the facet.pivot result hierarchies: facet=true facet.query={!tag=q1}manufacturedate_dt:[2006-01-01T00:00:00Z TO NOW] facet.query={!tag=q1}price:[0 TO 100] facet.pivot={!query=q1}cat,inStock © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 510 of 1195 Apache Solr Reference Guide 7.3 {"facet_counts": { "facet_queries": { "{!tag=q1}manufacturedate_dt:[2006-01-01T00:00:00Z TO NOW]": 9, "{!tag=q1}price:[0 TO 100]": 7 }, "facet_fields": {}, "facet_dates": {}, "facet_ranges": {}, "facet_intervals": {}, "facet_heatmaps": {}, "facet_pivot": { "cat,inStock": [ { "field": "cat", "value": "electronics", "count": 12, "queries": { "{!tag=q1}manufacturedate_dt:[2006-01-01T00:00:00Z TO NOW]": 9, "{!tag=q1}price:[0 TO 100]": 4 }, "pivot": [ { "field": "inStock", "value": true, "count": 8, "queries": { "{!tag=q1}manufacturedate_dt:[2006-01-01T00:00:00Z TO NOW]": 6, "{!tag=q1}price:[0 TO 100]": 2 } }, "..."]}]}}} In a similar way, in the example below, two range facets are computed for each of the facet.pivot result hierarchies: facet=true facet.range={!tag=r1}manufacturedate_dt facet.range.start=2006-01-01T00:00:00Z facet.range.end=NOW/YEAR facet.range.gap=+1YEAR facet.pivot={!range=r1}cat,inStock {"facet_counts":{ "facet_queries":{}, "facet_fields":{}, "facet_dates":{}, "facet_ranges":{ "manufacturedate_dt":{ "counts":[ Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 511 of 1195 "2006-01-01T00:00:00Z",9, "2007-01-01T00:00:00Z",0, "2008-01-01T00:00:00Z",0, "2009-01-01T00:00:00Z",0, "2010-01-01T00:00:00Z",0, "2011-01-01T00:00:00Z",0, "2012-01-01T00:00:00Z",0, "2013-01-01T00:00:00Z",0, "2014-01-01T00:00:00Z",0], "gap":"+1YEAR", "start":"2006-01-01T00:00:00Z", "end":"2015-01-01T00:00:00Z"}}, "facet_intervals":{}, "facet_heatmaps":{}, "facet_pivot":{ "cat,inStock":[{ "field":"cat", "value":"electronics", "count":12, "ranges":{ "manufacturedate_dt":{ "counts":[ "2006-01-01T00:00:00Z",9, "2007-01-01T00:00:00Z",0, "2008-01-01T00:00:00Z",0, "2009-01-01T00:00:00Z",0, "2010-01-01T00:00:00Z",0, "2011-01-01T00:00:00Z",0, "2012-01-01T00:00:00Z",0, "2013-01-01T00:00:00Z",0, "2014-01-01T00:00:00Z",0], "gap":"+1YEAR", "start":"2006-01-01T00:00:00Z", "end":"2015-01-01T00:00:00Z"}}, "pivot":[{ "field":"inStock", "value":true, "count":8, "ranges":{ "manufacturedate_dt":{ "counts":[ "2006-01-01T00:00:00Z",6, "2007-01-01T00:00:00Z",0, "2008-01-01T00:00:00Z",0, "2009-01-01T00:00:00Z",0, "2010-01-01T00:00:00Z",0, "2011-01-01T00:00:00Z",0, "2012-01-01T00:00:00Z",0, "2013-01-01T00:00:00Z",0, "2014-01-01T00:00:00Z",0], "gap":"+1YEAR", "start":"2006-01-01T00:00:00Z", © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 512 of 1195 Apache Solr Reference Guide 7.3 "end":"2015-01-01T00:00:00Z"}}}, "..."]}]}}} Additional Pivot Parameters Although facet.pivot.mincount deviates in name from the facet.mincount parameter used by field faceting, many of the faceting parameters described above can also be used with pivot faceting: • facet.limit • facet.offset • facet.sort • facet.overrequest.count • facet.overrequest.ratio Interval Faceting Another supported form of faceting is interval faceting. This sounds similar to range faceting, but the functionality is really closer to doing facet queries with range queries. Interval faceting allows you to set variable intervals and count the number of documents that have values within those intervals in the specified field. Even though the same functionality can be achieved by using a facet query with range queries, the implementation of these two methods is very different and will provide different performance depending on the context. If you are concerned about the performance of your searches you should test with both options. Interval faceting tends to be better with multiple intervals for the same fields, while facet query tend to be better in environments where filter cache is more effective (static indexes for example). This method will use docValues if they are enabled for the field, will use fieldCache otherwise. Use these parameters for interval faceting: facet.interval This parameter Indicates the field where interval faceting must be applied. It can be used multiple times in the same request to indicate multiple fields. facet.interval=price&facet.interval=size facet.interval.set This parameter is used to set the intervals for the field, it can be specified multiple times to indicate multiple intervals. This parameter is global, which means that it will be used for all fields indicated with facet.interval unless there is an override for a specific field. To override this parameter on a specific field you can use: f..facet.interval.set, for example: f.price.facet.interval.set=[0,10]&f.price.facet.interval.set=(10,100] Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 513 of 1195 Interval Syntax Intervals must begin with either '(' or '[', be followed by the start value, then a comma (','), the end value, and finally a closing ')' or ']’. For example: • (1,10) -> will include values greater than 1 and lower than 10 • [1,10) -> will include values greater or equal to 1 and lower than 10 • [1,10] -> will include values greater or equal to 1 and lower or equal to 10 The initial and end values cannot be empty. If the interval needs to be unbounded, the special character * can be used for both, start and end, limits. When using this special character, the start syntax options (( and [), and end syntax options () and ]) will be treated the same. [*,*] will include all documents with a value in the field. The interval limits may be strings but there is no need to add quotes. All the text until the comma will be treated as the start limit, and the text after that will be the end limit. For example: [Buenos Aires,New York]. Keep in mind that a string-like comparison will be done to match documents in string intervals (casesensitive). The comparator can’t be changed. Commas, brackets and square brackets can be escaped by using \ in front of them. Whitespaces before and after the values will be omitted. The start limit can’t be grater than the end limit. Equal limits are allowed, this allows you to indicate the specific values that you want to count, like [A,A], [B,B] and [C,Z]. Interval faceting supports output key replacement described below. Output keys can be replaced in both the facet.interval parameter and in the facet.interval.set parameter. For example: &facet.interval={!key=popularity}some_field &facet.interval.set={!key=bad}[0,5] &facet.interval.set={!key=good}[5,*] &facet=true Local Parameters for Faceting The LocalParams syntax allows overriding global settings. It can also provide a method of adding metadata to other parameter values, much like XML attributes. Tagging and Excluding Filters You can tag specific filters and exclude those filters when faceting. This is useful when doing multi-select faceting. Consider the following example query with faceting: q=mainquery&fq=status:public&fq=doctype:pdf&facet=true&facet.field=doctype © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 514 of 1195 Apache Solr Reference Guide 7.3 Because everything is already constrained by the filter doctype:pdf, the facet.field=doctype facet command is currently redundant and will return 0 counts for everything except doctype:pdf. To implement a multi-select facet for doctype, a GUI may want to still display the other doctype values and their associated counts, as if the doctype:pdf constraint had not yet been applied. For example: === Document Type === [ ] Word (42) [x] PDF (96) [ ] Excel(11) [ ] HTML (63) To return counts for doctype values that are currently not selected, tag filters that directly constrain doctype, and exclude those filters when faceting on doctype. q=mainquery&fq=status:public&fq={!tag=dt}doctype:pdf&facet=true&facet.field={!ex=dt}doctype Filter exclusion is supported for all types of facets. Both the tag and ex local parameters may specify multiple values by separating them with commas. Changing the Output Key To change the output key for a faceting command, specify a new name with the key local parameter. For example: facet.field={!ex=dt key=mylabel}doctype The parameter setting above causes the field facet results for the "doctype" field to be returned using the key "mylabel" rather than "doctype" in the response. This can be helpful when faceting on the same field multiple times with different exclusions. Limiting Facet with Certain Terms To limit field facet with certain terms specify them comma separated with terms local parameter. Commas and quotes in terms can be escaped with backslash, as in \,. In this case facet is calculated on a way similar to facet.method=enum, but ignores facet.enum.cache.minDf. For example: facet.field={!terms='alfa,betta,with\,with\',with space'}symbol Related Topics See also Heatmap Faceting (Spatial). BlockJoin Faceting BlockJoin facets allow you to aggregate children facet counts by their parents. It is a common requirement that if a parent document has several children documents, all of them need to increment facet value count only once. This functionality is provided by BlockJoinDocSetFacetComponent, and BlockJoinFacetComponent just an alias for compatibility. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3  Page 515 of 1195 This component is considered experimental, and must be explicitly enabled for a request handler in solrconfig.xml, in the same way as any other search component. This example shows how you could add this search components to solrconfig.xml and define it in request handler: /bjqfacet bjqFacetComponent This component can be added into any search request handler. This component work with distributed search in SolrCloud mode. Documents should be added in children-parent blocks as described in indexing nested child documents. Examples: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 516 of 1195 Apache Solr Reference Guide 7.3 Sample document 1 parent 11 Red XL 6 12 Red XL 7 13 Blue L 5 2 parent 21 Blue XL 6 22 Blue XL 7 23 Red L 5 Queries are constructed the same way as for a Parent Block Join query. For example: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 517 of 1195 http://localhost:8983/solr/bjqfacet?q={!parent which=type_s:parent}SIZE_s:XL&child.facet.field=COLOR_s As a result we should have facets for Red(1) and Blue(1), because matches on children id=11 and id=12 are aggregated into single hit into parent with id=1. The key components of the request shown above are: /bjqfacet? The name of the request handler that has been defined with a block join facet component enabled. q={!parent which=type_s:parent}SIZE_s:XL The mandatory parent query as a main query. The parent query could also be a subordinate clause in a more complex query. &child.facet.field=COLOR_s The child document field, which might be repeated many times with several fields, as necessary. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 518 of 1195 Apache Solr Reference Guide 7.3 Highlighting Highlighting in Solr allows fragments of documents that match the user’s query to be included with the query response. The fragments are included in a special section of the query response (the highlighting section), and the client uses the formatting clues also included to determine how to present the snippets to users. Fragments are a portion of a document field that contains matches from the query and are sometimes also referred to as "snippets" or "passages". Highlighting is extremely configurable, perhaps more than any other part of Solr. There are many parameters each for fragment sizing, formatting, ordering, backup/alternate behavior, and more options that are hard to categorize. Nonetheless, highlighting is very simple to use. Usage Common Highlighter Parameters You only need to set the hl and often hl.fl parameters to get results. The following table documents these and some other supported parameters. Note that many highlighting parameters support per-field overrides, such as: f.title_txt.hl.snippets hl Use this parameter to enable or disable highlighting. The default is false. If you want to use highlighting, you must set this to true. hl.method The highlighting implementation to use. Acceptable values are: unified, original, fastVector. The default is original. See the Choosing a Highlighter section below for more details on the differences between the available highlighters. hl.fl Specifies a list of fields to highlight. Accepts a comma- or space-delimited list of fields for which Solr should generate highlighted snippets. A wildcard of * (asterisk) can be used to match field globs, such as text_* or even * to highlight on all fields where highlighting is possible. When using *, consider adding hl.requireFieldMatch=true. When not defined, the defaults defined for the df query parameter will be used. hl.q A query to use for highlighting. This parameter allows you to highlight different terms than those being used to retrieve documents. When not defined, the query defined with the q parameter will the used. hl.qparser The query parser to use for the hl.q query. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 519 of 1195 When not defined, the query parser defined with the defType query parameter will be used. hl.requireFieldMatch By default, false, all query terms will be highlighted for each field to be highlighted (hl.fl) no matter what fields the parsed query refer to. If set to true, only query terms aligning with the field being highlighted will in turn be highlighted. If the query references fields different from the field being highlighted and they have different text analysis, the query may not highlight query terms it should have and vice versa. The analysis used is that of the field being highlighted (hl.fl), not the query fields. hl.usePhraseHighlighter If set to true, the default, Solr will highlight phrase queries (and other advanced position-sensitive queries) accurately – as phrases. If false, the parts of the phrase will be highlighted everywhere instead of only when it forms the given phrase. hl.highlightMultiTerm If set to true, the default, Solr will highlight wildcard queries (and other MultiTermQuery subclasses). If false, they won’t be highlighted at all. hl.snippets Specifies maximum number of highlighted snippets to generate per field. It is possible for any number of snippets from zero to this value to be generated. The default is 1. hl.fragsize Specifies the approximate size, in characters, of fragments to consider for highlighting. The default is 100. Using 0 indicates that no fragmenting should be considered and the whole field value should be used. hl.tag.pre (hl.simple.pre for the Original Highlighter) Specifies the “tag” to use before a highlighted term. This can be any string, but is most often an HTML or XML tag. The default is . hl.tag.post (hl.simple.post for the Original Highlighter) Specifies the “tag” to use after a highlighted term. This can be any string, but is most often an HTML or XML tag. The default is . hl.encoder If blank, the default, then the stored text will be returned without any escaping/encoding performed by the highlighter. If set to html then special HMTL/XML characters will be encoded (e.g., & becomes &). The pre/post snippet characters are never encoded. hl.maxAnalyzedChars The character limit to look for highlights, after which no highlighting will be done. This is mostly only a performance concern for an analysis based offset source since it’s the slowest. See Schema Options and Performance Considerations. The default is 51200 characters. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 520 of 1195 Apache Solr Reference Guide 7.3 There are more parameters supported as well depending on the highlighter (via hl.method) chosen. Highlighting in the Query Response In the response to a query, Solr includes highlighting data in a section separate from the documents. It is up to a client to determine how to process this response and display the highlights to users. Using the example documents included with Solr, we can see how this might work: In response to a query such as: http://localhost:8983/solr/gettingstarted/select?hl=on&q=apple&hl.fl=manu&fl=id,name,manu,cat we get a response such as this (truncated slightly for space): { "response": { "numFound": 1, "start": 0, "docs": [{ "id": "MA147LL/A", "name": "Apple 60 GB iPod with Video Playback Black", "manu": "Apple Computer Inc.", "cat": [ "electronics", "music" ] }] }, "highlighting": { "MA147LL/A": { "manu": [ "Apple Computer Inc." ] } } } Note the two sections docs and highlighting. The docs section contains the fields of the document requested with the fl parameter of the query (only "id", "name", "manu", and "cat"). The highlighting section includes the ID of each document, and the field that contains the highlighted portion. In this example, we used the hl.fl parameter to say we wanted query terms highlighted in the "manu" field. When there is a match to the query term in that field, it will be included for each document ID in the list. Choosing a Highlighter Solr provides a HighlightComponent (a SearchComponent) and it’s in the default list of components for Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 521 of 1195 search handlers. It offers a somewhat unified API over multiple actual highlighting implementations (or simply "highlighters") that do the business of highlighting. There are many parameters supported by more than one highlighter, and sometimes the implementation details and semantics will be a bit different, so don’t expect identical results when switching highlighters. You should use the hl.method parameter to choose a highlighter but it’s also possible to explicitly configure an implementation by class name in solrconfig.xml. There are four highlighters available that can be chosen at runtime with the hl.method parameter, in order of general recommendation: Unified Highlighter (hl.method=unified) The Unified Highlighter is the newest highlighter (as of Solr 6.4), which stands out as the most flexible and performant of the options. We recommend that you try this highlighter even though it isn’t the default (yet). This highlighter supports the most common highlighting parameters and can handle just about any query accurately, even SpanQueries (e.g., as seen from the surround parser). A strong benefit to this highlighter is that you can opt to configure Solr to put more information in the underlying index to speed up highlighting of large documents; multiple configurations are supported, even on a per-field basis. There is little or no such flexibility for the other highlighters. More on this below. Original Highlighter (hl.method=original, the default) The Original Highlighter, sometimes called the "Standard Highlighter" or "Default Highlighter", is Lucene’s original highlighter – a venerable option with a high degree of customization options. Its ability to highlight just about any query accurately is a strength shared with the Unified Highlighter (they share some code for this in fact). The Original Highlighter will normally analyze stored text on the fly in order to highlight. It will use full term vectors if available, however in this mode it isn’t as fast as the Unified Highlighter or FastVector Highlighter. This highlighter is a good choice for a wide variety of search use-cases. Where it falls short is performance; it’s often twice as slow as the Unified Highlighter. And despite being the most customizable, it doesn’t have a BreakIterator based fragmenter (all the others do), which could pose a challenge for some languages. FastVector Highlighter (hl.method=fastVector) The FastVector Highlighter requires full term vector options (termVectors, termPositions, and termOffsets) on the field, and is optimized with that in mind. It is nearly as configurable as the Original Highlighter with some variability. This highlighter notably supports multi-colored highlighting such that different query words can be denoted in the fragment with different marking, usually expressed as an HTML tag with a unique color. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 522 of 1195 Apache Solr Reference Guide 7.3 This highlighter’s query-representation is less advanced than the Original or Unified Highlighters: for example it will not work well with the surround parser, and there are multiple reported bugs pertaining to queries with stop-words. Note that both the FastVector and Original Highlighters can be used in conjunction in a search request to highlight some fields with one and some the other. In contrast, the other highlighters can only be chosen exclusively. The Unified Highlighter is exclusively configured via search parameters. In contrast, some settings for the Original and FastVector Highlighters are set in solrconfig.xml. There’s a robust example of the latter in the “techproducts” configset. In addition to further information below, more information can be found in the Solr javadocs. Schema Options and Performance Considerations Fundamental to the internals of highlighting are detecting the offsets of the individual words that match the query. Some of the highlighters can run the stored text through the analysis chain defined in the schema, some can look them up from postings, and some can look them up from term vectors. These choices have different trade-offs: • Analysis: Supported by the Unified and Original Highlighters. If you don’t go out of your way to configure the other options below, the highlighter will analyze the stored text on the fly (during highlighting) to calculate offsets. The benefit of this approach is that your index won’t grow larger with any extra data that isn’t strictly necessary for highlighting. The down side is that highlighting speed is roughly linear with the amount of text to process, with a large factor being the complexity of your analysis chain. For "short" text, this is a good choice. Or maybe it’s not short but you’re prioritizing a smaller index and indexing speed over highlighting performance. • Postings: Supported by the Unified Highlighter. Set storeOffsetsWithPositions to true. This adds a moderate amount of extra data to the index but it speeds up highlighting tremendously, especially compared to analysis with longer text fields. However, wildcard queries will fall back to analysis unless "light" term vectors are added. ◦ with Term Vectors (light): Supported only by the Unified Highlighter. To enable this mode set termVectors to true but no other term vector related options on the field being highlighted. This adds even more data to the index than just storeOffsetsWithPositions but not as much as enabling all the extra term vector options. Term Vectors are only accessed by the highlighter when a wildcard query is used and will prevent a fall back to analysis of the stored text. This is definitely the fastest option for highlighting wildcard queries on large text fields. • Term Vectors (full): Supported by the Unified, FastVector, and Original Highlighters. Set termVectors, termPositions, and termOffsets to true, and potentially termPayloads for advanced use cases. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 523 of 1195 This adds substantial weight to the index – similar in size to the compressed stored text. If you are using the Unified Highlighter then this is not a recommended configuration since it’s slower and heavier than postings with light term vectors. However, this could make sense if full term vectors are already needed for another use-case. The Unified Highlighter The Unified Highlighter supports these following additional parameters to the ones listed earlier: hl.offsetSource By default, the Unified Highlighter will usually pick the right offset source (see above). However it may be ambiguous such as during a migration from one offset source to another that hasn’t completed. The offset source can be explicitly configured to one of: ANALYSIS, POSTINGS, POSTINGS_WITH_TERM_VECTORS, or TERM_VECTORS. hl.tag.ellipsis By default, each snippet is returned as a separate value (as is done with the other highlighters). Set this parameter to instead return one string with this text as the delimiter. Note: this is likely to be removed in the future. hl.defaultSummary If true, use the leading portion of the text as a snippet if a proper highlighted snippet can’t otherwise be generated. The default is false. hl.score.k1 Specifies BM25 term frequency normalization parameter 'k1'. For example, it can be set to 0 to rank passages solely based on the number of query terms that match. The default is 1.2. hl.score.b Specifies BM25 length normalization parameter 'b'. For example, it can be set to "0" to ignore the length of passages entirely when ranking. The default is 0.75. hl.score.pivot Specifies BM25 average passage length in characters. The default is 87. hl.bs.language Specifies the breakiterator language for dividing the document into passages. hl.bs.country Specifies the breakiterator country for dividing the document into passages. hl.bs.variant Specifies the breakiterator variant for dividing the document into passages. hl.bs.type Specifies the breakiterator type for dividing the document into passages. Can be SEPARATOR, SENTENCE, WORD*, CHARACTER, LINE, or WHOLE. SEPARATOR is special value that splits text on a user-provided character in hl.bs.separator. The default is SENTENCE. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 524 of 1195 Apache Solr Reference Guide 7.3 hl.bs.separator Indicates which character to break the text on. Use only if you have defined hl.bs.type=SEPARATOR. This is useful when the text has already been manipulated in advance to have a special delineation character at desired highlight passage boundaries. This character will still appear in the text as the last character of a passage. The Original Highlighter The Original Highlighter supports these following additional parameters to the ones listed earlier: hl.mergeContiguous Instructs Solr to collapse contiguous fragments into a single fragment. A value of true indicates contiguous fragments will be collapsed into single fragment. The default value, false, is also the backward-compatible setting. hl.maxMultiValuedToExamine Specifies the maximum number of entries in a multi-valued field to examine before stopping. This can potentially return zero results if the limit is reached before any matches are found. If used with the maxMultiValuedToMatch, whichever limit is reached first will determine when to stop looking. The default is Integer.MAX_VALUE. hl.maxMultiValuedToMatch Specifies the maximum number of matches in a multi-valued field that are found before stopping. If hl.maxMultiValuedToExamine is also defined, whichever limit is reached first will determine when to stop looking. The default is Integer.MAX_VALUE. hl.alternateField Specifies a field to be used as a backup default summary if Solr cannot generate a snippet (i.e., because no terms match). hl.maxAlternateFieldLength Specifies the maximum number of characters of the field to return. Any value less than or equal to 0 means the field’s length is unlimited (the default behavior). This parameter is only used in conjunction with the hl.alternateField parameter. hl.highlightAlternate If set to true, the default, and hl.alternateFieldName is active, Solr will show the entire alternate field, with highlighting of occurrences. If hl.maxAlternateFieldLength=N is used, Solr returns max N characters surrounding the best matching fragment. If set to false, or if there is no match in the alternate field either, the alternate field will be shown without highlighting. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 525 of 1195 hl.formatter Selects a formatter for the highlighted output. Currently the only legal value is simple, which surrounds a highlighted term with a customizable pre- and post-text snippet. hl.simple.pre, hl.simple.post Specifies the text that should appear before (hl.simple.pre) and after (hl.simple.post) a highlighted term, when using the simple formatter. The default is and . hl.fragmenter Specifies a text snippet generator for highlighted text. The standard (default) fragmenter is gap, which creates fixed-sized fragments with gaps for multi-valued fields. Another option is regex, which tries to create fragments that resemble a specified regular expression. hl.regex.slop When using the regex fragmenter (hl.fragmenter=regex), this parameter defines the factor by which the fragmenter can stray from the ideal fragment size (given by hl.fragsize) to accommodate a regular expression. For instance, a slop of 0.2 with hl.fragsize=100 should yield fragments between 80 and 120 characters in length. It is usually good to provide a slightly smaller hl.fragsize value when using the regex fragmenter. The default is 0.6. hl.regex.pattern Specifies the regular expression for fragmenting. This could be used to extract sentences. hl.regex.maxAnalyzedChars Instructs Solr to analyze only this many characters from a field when using the regex fragmenter (after which, the fragmenter produces fixed-sized fragments). The default is 10000. Note, applying a complicated regex to a huge field is computationally expensive. hl.preserveMulti If true, multi-valued fields will return all values in the order they were saved in the index. If false, the default, only values that match the highlight request will be returned. hl.payloads When hl.usePhraseHighlighter is true and the indexed field has payloads but not term vectors (generally quite rare), the index’s payloads will be read into the highlighter’s memory index along with the postings. If this may happen and you know you don’t need them for highlighting (i.e., your queries don’t filter by payload) then you can save a little memory by setting this to false. The Original Highlighter has a plugin architecture that enables new functionality to be registered in solrconfig.xml. The “techproducts” configset shows most of these settings explicitly. You can use it as a guide to provide your own components to include a SolrFormatter, SolrEncoder, and SolrFragmenter. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 526 of 1195 Apache Solr Reference Guide 7.3 The FastVector Highlighter The FastVector Highlighter (FVH) can be used in conjunction with the Original Highlighter if not all fields should be highlighted with the FVH. In such a mode, set hl.method=original and f.yourTermVecField.hl.method=fastVector for all fields that should use the FVH. One annoyance to keep in mind is that the Original Highlighter uses hl.simple.pre whereas the FVH (and other highlighters) use hl.tag.pre. In addition to the initial listed parameters, the following parameters documented for the Original Highlighter above are also supported by the FVH: • hl.alternateField • hl.maxAlternateFieldLength • hl.highlightAlternate And here are additional parameters supported by the FVH: hl.fragListBuilder The snippet fragmenting algorithm. The weighted fragListBuilder uses IDF-weights to order fragments. This fragListBuilder is the default. Other options are single, which returns the entire field contents as one snippet, or simple. You can select a fragListBuilder with this parameter, or modify an existing implementation in solrconfig.xml to be the default by adding "default=true". hl.fragmentsBuilder The fragments builder is responsible for formatting the fragments, which uses and markup by default (if hl.tag.pre and hl.tag.post are not defined). Another pre-configured choice is colored, which is an example of how to use the fragments builder to insert HTML into the snippets for colored highlights if you choose. You can also implement your own if you’d like. You can select a fragments builder with this parameter, or modify an existing implementation in solrconfig.xml to be the default by adding "default=true". hl.boundaryScanner See Using Boundary Scanners with the FastVector Highlighter below. hl.bs.* See Using Boundary Scanners with the FastVector Highlighter below. hl.phraseLimit The maximum number of phrases to analyze when searching for the highest-scoring phrase. The default is 5000. hl.multiValuedSeparatorChar Text to use to separate one value from the next for a multi-valued field. The default is " " (a space). Using Boundary Scanners with the FastVector Highlighter The FastVector Highlighter will occasionally truncate highlighted words. To prevent this, implement a boundary scanner in solrconfig.xml, then use the hl.boundaryScanner parameter to specify the boundary Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 527 of 1195 scanner for highlighting. Solr supports two boundary scanners: breakIterator and simple. The breakIterator Boundary Scanner The breakIterator boundary scanner offers excellent performance right out of the box by taking locale and boundary type into account. In most cases you will want to use the breakIterator boundary scanner. To implement the breakIterator boundary scanner, add this code to the highlighting section of your solrconfig.xml file, adjusting the type, language, and country values as appropriate to your application: WORD en US Possible values for the hl.bs.type parameter are WORD, LINE, SENTENCE, and CHARACTER. The simple Boundary Scanner The simple boundary scanner scans term boundaries for a specified maximum character value (hl.bs.maxScan) and for common delimiters such as punctuation marks (hl.bs.chars). To implement the simple boundary scanner, add this code to the highlighting section of your solrconfig.xml file, adjusting the values as appropriate to your application: 10 .,!?\t\n © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 528 of 1195 Apache Solr Reference Guide 7.3 Spell Checking The SpellCheck component is designed to provide inline query suggestions based on other, similar, terms. The basis for these suggestions can be terms in a field in Solr, externally created text files, or fields in other Lucene indexes. Configuring the SpellCheckComponent Define Spell Check in solrconfig.xml The first step is to specify the source of terms in solrconfig.xml. There are three approaches to spell checking in Solr, discussed below. IndexBasedSpellChecker The IndexBasedSpellChecker uses a Solr index as the basis for a parallel index used for spell checking. It requires defining a field as the basis for the index terms; a common practice is to copy terms from some fields (such as title, body, etc.) to another field created for spell checking. Here is a simple example of configuring solrconfig.xml with the IndexBasedSpellChecker: solr.IndexBasedSpellChecker ./spellchecker content true The first element defines the searchComponent to use the solr.SpellCheckComponent. The classname is the specific implementation of the SpellCheckComponent, in this case solr.IndexBasedSpellChecker. Defining the classname is optional; if not defined, it will default to IndexBasedSpellChecker. The spellcheckIndexDir defines the location of the directory that holds the spellcheck index, while the field defines the source field (defined in the Schema) for spell check terms. When choosing a field for the spellcheck index, it’s best to avoid a heavily processed field to get more accurate results. If the field has many word variations from processing synonyms and/or stemming, the dictionary will be created with those variations in addition to more valid spelling data. Finally, buildOnCommit defines whether to build the spell check index at every commit (that is, every time new documents are added to the index). It is optional, and can be omitted if you would rather set it to false. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 529 of 1195 DirectSolrSpellChecker The DirectSolrSpellChecker uses terms from the Solr index without building a parallel index like the IndexBasedSpellChecker. This spell checker has the benefit of not having to be built regularly, meaning that the terms are always up-to-date with terms in the index. Here is how this might be configured in solrconfig.xml default name solr.DirectSolrSpellChecker internal 0.5 2 1 5 4 0.01 .01 When choosing a field to query for this spell checker, you want one which has relatively little analysis performed on it (particularly analysis such as stemming). Note that you need to specify a field to use for the suggestions, so like the IndexBasedSpellChecker, you may want to copy data from fields like title, body, etc., to a field dedicated to providing spelling suggestions. Many of the parameters relate to how this spell checker should query the index for term suggestions. The distanceMeasure defines the metric to use during the spell check query. The value "internal" uses the default Levenshtein metric, which is the same metric used with the other spell checker implementations. Because this spell checker is querying the main index, you may want to limit how often it queries the index to be sure to avoid any performance conflicts with user queries. The accuracy setting defines the threshold for a valid suggestion, while maxEdits defines the number of changes to the term to allow. Since most spelling mistakes are only 1 letter off, setting this to 1 will reduce the number of possible suggestions (the default, however, is 2); the value can only be 1 or 2. minPrefix defines the minimum number of characters the terms should share. Setting this to 1 means that the spelling suggestions will all start with the same letter, for example. The maxInspections parameter defines the maximum number of possible matches to review before returning results; the default is 5. minQueryLength defines how many characters must be in the query before suggestions are provided; the default is 4. At first, spellchecker analyses incoming query words by looking up them in the index. Only query words, which are absent in index or too rare ones (below maxQueryFrequency) are considered as misspelled and used for finding suggestions. Words which are frequent than maxQueryFrequency bypass spellchecker unchanged. After suggestions for every misspelled word are found they are filtered for enough frequency with thresholdTokenFrequency as boundary value. These parameters (maxQueryFrequency and thresholdTokenFrequency) can be a percentage (such as .01, or 1%) or an absolute value (such as 4). © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 530 of 1195 Apache Solr Reference Guide 7.3 FileBasedSpellChecker The FileBasedSpellChecker uses an external file as a spelling dictionary. This can be useful if using Solr as a spelling server, or if spelling suggestions don’t need to be based on actual terms in the index. In solrconfig.xml, you would define the searchComponent as so: solr.FileBasedSpellChecker file spellings.txt UTF-8 ./spellcheckerFile The differences here are the use of the sourceLocation to define the location of the file of terms and the use of characterEncoding to define the encoding of the terms file.  In the previous example, name is used to name this specific definition of the spellchecker. Multiple definitions can co-exist in a single solrconfig.xml, and the name helps to differentiate them. If only defining one spellchecker, no name is required. WordBreakSolrSpellChecker WordBreakSolrSpellChecker offers suggestions by combining adjacent query terms and/or breaking terms into multiple words. It is a SpellCheckComponent enhancement, leveraging Lucene’s WordBreakSpellChecker. It can detect spelling errors resulting from misplaced whitespace without the use of shingle-based dictionaries and provides collation support for word-break errors, including cases where the user has a mix of single-word spelling errors and word-break errors in the same query. It also provides shard support. Here is how it might be configured in solrconfig.xml: wordbreak solr.WordBreakSolrSpellChecker lowerfilt true true 10 Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 531 of 1195 Some of the parameters will be familiar from the discussion of the other spell checkers, such as name, classname, and field. New for this spell checker is combineWords, which defines whether words should be combined in a dictionary search (default is true); breakWords, which defines if words should be broken during a dictionary search (default is true); and maxChanges, an integer which defines how many times the spell checker should check collation possibilities against the index (default is 10). The spellchecker can be configured with a traditional checker (i.e., DirectSolrSpellChecker). The results are combined and collations can contain a mix of corrections from both spellcheckers. Add It to a Request Handler Queries will be sent to a RequestHandler. If every request should generate a suggestion, then you would add the following to the requestHandler that you are using: true One of the possible parameters is the spellcheck.dictionary to use, and multiples can be defined. With multiple dictionaries, all specified dictionaries are consulted and results are interleaved. Collations are created with combinations from the different spellcheckers, with care taken that multiple overlapping corrections do not occur in the same collation. Here is an example with multiple dictionaries: default wordbreak 20 spellcheck Spell Check Parameters The SpellCheck component accepts the parameters described below. spellcheck This parameter turns on SpellCheck suggestions for the request. If true, then spelling suggestions will be generated. This is required if spell checking is desired. spellcheck.q or q This parameter specifies the query to spellcheck. If spellcheck.q is defined, then it is used; otherwise the original input query is used. The spellcheck.q parameter is intended to be the original query, minus any extra markup like field names, boosts, and so © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 532 of 1195 Apache Solr Reference Guide 7.3 on. If the q parameter is specified, then the SpellingQueryConverter class is used to parse it into tokens; otherwise the WhitespaceTokenizer is used. The choice of which one to use is up to the application. Essentially, if you have a spelling "ready" version in your application, then it is probably better to use spellcheck.q. Otherwise, if you just want Solr to do the job, use the q parameter.  The SpellingQueryConverter class does not deal properly with non-ASCII characters. In this case, you have either to use spellcheck.q, or implement your own QueryConverter. spellcheck.build If set to true, this parameter creates the dictionary to be used for spell-checking. In a typical search application, you will need to build the dictionary before using the spell check. However, it’s not always necessary to build a dictionary first. For example, you can configure the spellchecker to use a dictionary that already exists. The dictionary will take some time to build, so this parameter should not be sent with every request. spellcheck.reload If set to true, this parameter reloads the spellchecker. The results depend on the implementation of SolrSpellChecker.reload(). In a typical implementation, reloading the spellchecker means reloading the dictionary. spellcheck.count This parameter specifies the maximum number of suggestions that the spellchecker should return for a term. If this parameter isn’t set, the value defaults to 1. If the parameter is set but not assigned a number, the value defaults to 5. If the parameter is set to a positive integer, that number becomes the maximum number of suggestions returned by the spellchecker. spellcheck.queryAnalyzerFieldtype A field type from Solr’s schema. The analyzer configured for the provided field type is used by the QueryConverter to tokenize the value for "q" parameter. The field type specified by this parameter should do minimal transformations. It’s usually a best practice to avoid types that aggressively stem or NGram, for instance, since those types of analysis can throw off spell checking. spellcheck.onlyMorePopular If true, Solr will return suggestions that result in more hits for the query than the existing query. Note that this will return more popular suggestions even when the given query term is present in the index and considered "correct". spellcheck.maxResultsForSuggest If, for example, this is set to 5 and the user’s query returns 5 or fewer results, the spellchecker will report "correctlySpelled=false" and also offer suggestions (and collations if requested). Setting this greater than zero is useful for creating "did-you-mean?" suggestions for queries that return a low number of hits. spellcheck.alternativeTermCount Defines the number of suggestions to return for each query term existing in the index and/or dictionary. Presumably, users will want fewer suggestions for words with docFrequency>0. Also, setting this value enables context-sensitive spell suggestions. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 533 of 1195 spellcheck.extendedResults If true, this parameter causes to Solr to return additional information about spellcheck results, such as the frequency of each original term in the index (origFreq) as well as the frequency of each suggestion in the index (frequency). Note that this result format differs from the non-extended one as the returned suggestion for a word is actually an array of lists, where each list holds the suggested term and its frequency. spellcheck.collate If true, this parameter directs Solr to take the best suggestion for each token (if one exists) and construct a new query from the suggestions. For example, if the input query was "jawa class lording" and the best suggestion for "jawa" was "java" and "lording" was "loading", then the resulting collation would be "java class loading". The spellcheck.collate parameter only returns collations that are guaranteed to result in hits if requeried, even when applying original fq parameters. This is especially helpful when there is more than one correction per query.  This only returns a query to be used. It does not actually run the suggested query. spellcheck.maxCollations The maximum number of collations to return. The default is 1. This parameter is ignored if spellcheck.collate is false. spellcheck.maxCollationTries This parameter specifies the number of collation possibilities for Solr to try before giving up. Lower values ensure better performance. Higher values may be necessary to find a collation that can return results. The default value is 0, which is equivalent to not checking collations. This parameter is ignored if spellcheck.collate is false. spellcheck.maxCollationEvaluations This parameter specifies the maximum number of word correction combinations to rank and evaluate prior to deciding which collation candidates to test against the index. This is a performance safety-net in case a user enters a query with many misspelled words. The default is 10000 combinations, which should work well in most situations. spellcheck.collateExtendedResults If true, this parameter returns an expanded response format detailing the collations Solr found. The default value is false and this is ignored if spellcheck.collate is false. spellcheck.collateMaxCollectDocs This parameter specifies the maximum number of documents that should be collected when testing potential collations against the index. A value of 0 indicates that all documents should be collected, resulting in exact hit-counts. Otherwise an estimation is provided as a performance optimization in cases where exact hit-counts are unnecessary – the higher the value specified, the more precise the estimation. The default value for this parameter is 0, but when spellcheck.collateExtendedResults is false, the optimization is always used as if 1 had been specified. spellcheck.collateParam.* Prefix © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 534 of 1195 Apache Solr Reference Guide 7.3 This parameter prefix can be used to specify any additional parameters that you wish to the Spellchecker to use when internally validating collation queries. For example, even if your regular search results allow for loose matching of one or more query terms via parameters like q.op=OR and mm=20% you can specify override params such as spellcheck.collateParam.q.op=AND&spellcheck.collateParam.mm=100% to require that only collations consisting of words that are all found in at least one document may be returned. spellcheck.dictionary This parameter causes Solr to use the dictionary named in the parameter’s argument. The default setting is default. This parameter can be used to invoke a specific spellchecker on a per request basis. spellcheck.accuracy Specifies an accuracy value to be used by the spell checking implementation to decide whether a result is worthwhile or not. The value is a float between 0 and 1. Defaults to Float.MIN_VALUE. spellcheck..key Specifies a key/value pair for the implementation handling a given dictionary. The value that is passed through is just key=value (spellcheck.. is stripped off). For example, given a dictionary called foo, spellcheck.foo.myKey=myValue would result in myKey=myValue being passed through to the implementation handling the dictionary foo. Spell Check Example Using Solr’s bin/solr -e techproducts example, this query shows the results of a simple request that defines a query using the spellcheck.q parameter, and forces the collations to require all input terms must match: http://localhost:8983/solr/techproducts/spell?df=text&spellcheck.q=delll+ultra+sharp&spellch eck=true&spellcheck.collateParam.q.op=AND&wt=xml Results: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 535 of 1195 1 0 5 0 dell 1 1 6 17 0 ultrasharp 1 false dell ultrasharp 1 dell ultrasharp Distributed SpellCheck The SpellCheckComponent also supports spellchecking on distributed indexes. If you are using the SpellCheckComponent on a request handler other than "/select", you must provide the following two parameters: shards Specifies the shards in your distributed indexing configuration. For more information about distributed indexing, see Distributed Search with Index Sharding © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 536 of 1195 Apache Solr Reference Guide 7.3 shards.qt Specifies the request handler Solr uses for requests to shards. This parameter is not required for the /select request handler. For example: http://localhost:8983/solr/techproducts/spell?spellcheck=true&spellcheck.build=true&spellcheck.q= toyata&shards.qt=/spell&shards=solr-shard1:8983/solr/techproducts,solrshard2:8983/solr/techproducts In case of a distributed request to the SpellCheckComponent, the shards are requested for at least five suggestions even if the spellcheck.count parameter value is less than five. Once the suggestions are collected, they are ranked by the configured distance measure (Levenstein Distance by default) and then by aggregate frequency. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 537 of 1195 Query Re-Ranking Query Re-Ranking allows you to run a simple query (A) for matching documents and then re-rank the top N documents using the scores from a more complex query (B). Since the more costly ranking from query B is only applied to the top N documents, it will have less impact on performance then just using the complex query B by itself. The trade off is that documents which score very low using the simple query A may not be considered during the re-ranking phase, even if they would score very highly using query B. Specifying a Ranking Query A Ranking query can be specified using the rq request parameter. The rq parameter must specify a query string that when parsed, produces a RankQuery. Three rank queries are currently included in the Solr distribution. You can also configure a custom QParserPlugin you have written, but most users can just use a parser provided with Solr. Parser QParserPlugin class rerank ReRankQParserPlugin xport ExportQParserPlugin ltr LTRQParserPlugin ReRank Query Parser The rerank parser wraps a query specified by an local parameter, along with additional parameters indicating how many documents should be re-ranked, and how the final scores should be computed: reRankQuery The query string for your complex ranking query - in most cases a variable will be used to refer to another request parameter. This parameter is required. reRankDocs The number of top N documents from the original query that should be re-ranked. This number will be treated as a minimum, and may be increased internally automatically in order to rank enough documents to satisfy the query (i.e., start+rows). The default is 200. reRankWeight A multiplicative factor that will be applied to the score from the reRankQuery for each of the top matching documents, before that score is added to the original score. The default is 2.0. In the example below, the top 1000 documents matching the query "greetings" will be re-ranked using the query "(hi hello hey hiya)". The resulting scores for each of those 1000 documents will be 3 times their score from the "(hi hello hey hiya)", plus the score from the original "greetings" query: q=greetings&rq={!rerank reRankQuery=$rqq reRankDocs=1000 reRankWeight=3}&rqq=(hi+hello+hey+hiya) © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 538 of 1195 Apache Solr Reference Guide 7.3 If a document matches the original query, but does not match the re-ranking query, the document’s original score will remain. LTR Query Parser The ltr stands for Learning To Rank, please see Learning To Rank for more detailed information. Combining Ranking Queries with Other Solr Features The rq parameter and the re-ranking feature in general works well with other Solr features. For example, it can be used in conjunction with the collapse parser to re-rank the group heads after they’ve been collapsed. It also preserves the order of documents elevated by the elevation component. And it even has its own custom explain so you can see how the re-ranking scores were derived when looking at debug information. Learning To Rank With the Learning To Rank (or LTR for short) contrib module you can configure and run machine learned ranking models in Solr. The module also supports feature extraction inside Solr. The only thing you need to do outside Solr is train your own ranking model. Learning to Rank Concepts Re-Ranking Re-Ranking allows you to run a simple query for matching documents and then re-rank the top N documents using the scores from a different, more complex query. This page describes the use of LTR complex queries, information on other rank queries included in the Solr distribution can be found on the Query Re-Ranking page. Learning To Rank Models In information retrieval systems, Learning to Rank is used to re-rank the top N retrieved documents using trained machine learning models. The hope is that such sophisticated models can make more nuanced ranking decisions than standard ranking functions like TF-IDF or BM25. Ranking Model A ranking model computes the scores used to rerank documents. Irrespective of any particular algorithm or implementation, a ranking model’s computation can use three types of inputs: • parameters that represent the scoring algorithm • features that represent the document being scored • features that represent the query for which the document is being scored Feature A feature is a value, a number, that represents some quantity or quality of the document being scored or of the query for which documents are being scored. For example documents often have a 'recency' quality and Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 539 of 1195 'number of past purchases' might be a quantity that is passed to Solr as part of the search query. Normalizer Some ranking models expect features on a particular scale. A normalizer can be used to translate arbitrary feature values into normalized values e.g., on a 0..1 or 0..100 scale. Training Models Feature Engineering The LTR contrib module includes several feature classes as well as support for custom features. Each feature class’s javadocs contain an example to illustrate use of that class. The process of feature engineering itself is then entirely up to your domain expertise and creativity. Feature Class Example parameters External Feature Information field length FieldLengthFeature {"field":"title"} not (yet) supported field value FieldValueFeature {"field":"hits"} not (yet) supported original score OriginalScoreFeature {} not applicable solr query SolrFeature {"q":"{!func} supported recip(ms(NOW,last_modi fied) ,3.16e-11,1,1)"} solr filter query SolrFeature {"fq":["{!terms f=category}book"]} solr query + filter query SolrFeature {"q":"{!func} supported recip(ms(NOW,last_modi fied), 3.16e-11,1,1)", "fq":["{!terms f=category}book"]} value ValueFeature {"value":"${userFromMo supported bile}","required":true } (custom) (custom class extending Feature) supported Normalizer Class Example parameters Identity IdentityNormalizer {} MinMax MinMaxNormalizer {"min":"0", "max":"50" } Standard StandardNormalizer {"avg":"42","std":"6"} (custom) (custom class extending Normalizer) © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 540 of 1195 Apache Solr Reference Guide 7.3 Feature Extraction The ltr contrib module includes a [features transformer] to support the calculation and return of feature values for feature extraction purposes including and especially when you do not yet have an actual reranking model. Feature Selection and Model Training Feature selection and model training take place offline and outside Solr. The ltr contrib module supports two generalized forms of models as well as custom models. Each model class’s javadocs contain an example to illustrate configuration of that class. In the form of JSON files your trained model or models (e.g., different models for different customer geographies) can then be directly uploaded into Solr using provided REST APIs. General form Class Specific examples Linear LinearModel RankSVM, Pranking Multiple Additive Trees MultipleAdditiveTreesModel LambdaMART, Gradient Boosted Regression Trees (GBRT) Neural Network NeuralNetworkModel RankNet (wrapper) DefaultWrapperModel (not applicable) (custom) (custom class extending AdapterModel) (not applicable) (custom) (custom class extending LTRScoringModel) (not applicable) Quick Start with LTR The "techproducts" example included with Solr is pre-configured with the plugins required for learning-torank, but they are disabled by default. To enable the plugins, please specify the solr.ltr.enabled JVM System Property when running the example: bin/solr start -e techproducts -Dsolr.ltr.enabled=true Uploading Features To upload features in a /path/myFeatures.json file, please run: curl -XPUT 'http://localhost:8983/solr/techproducts/schema/feature-store' --data-binary "@/path/myFeatures.json" -H 'Content-type:application/json' To view the features you just uploaded please open the following URL in a browser: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 541 of 1195 http://localhost:8983/solr/techproducts/schema/feature-store/_DEFAULT_ Example: /path/myFeatures.json [ { "name" : "documentRecency", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!func}recip( ms(NOW,last_modified), 3.16e-11, 1, 1)" } }, { "name" : "isBook", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "fq": ["{!terms f=cat}book"] } }, { "name" : "originalScore", "class" : "org.apache.solr.ltr.feature.OriginalScoreFeature", "params" : {} } ] Extracting Features To extract features as part of a query, add [features] to the fl parameter, for example: http://localhost:8983/solr/techproducts/query?q=test&fl=id,score,[features] The output XML will include feature values as a comma-separated list, resembling the output shown here: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 542 of 1195 Apache Solr Reference Guide 7.3 { "responseHeader":{ "status":0, "QTime":0, "params":{ "q":"test", "fl":"id,score,[features]"}}, "response":{"numFound":2,"start":0,"maxScore":1.959392,"docs":[ { "id":"GB18030TEST", "score":1.959392, "[features]":"documentRecency=0.020893794,isBook=0.0,originalScore=1.959392"}, { "id":"UTF8TEST", "score":1.5513437, "[features]":"documentRecency=0.020893794,isBook=0.0,originalScore=1.5513437"}] }} Uploading a Model To upload the model in a /path/myModel.json file, please run: curl -XPUT 'http://localhost:8983/solr/techproducts/schema/model-store' --data-binary "@/path/myModel.json" -H 'Content-type:application/json' To view the model you just uploaded please open the following URL in a browser: http://localhost:8983/solr/techproducts/schema/model-store Example: /path/myModel.json { "class" : "org.apache.solr.ltr.model.LinearModel", "name" : "myModel", "features" : [ { "name" : "documentRecency" }, { "name" : "isBook" }, { "name" : "originalScore" } ], "params" : { "weights" : { "documentRecency" : 1.0, "isBook" : 0.1, "originalScore" : 0.5 } } } Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 543 of 1195 Running a Rerank Query To rerank the results of a query, add the rq parameter to your search, for example: http://localhost:8983/solr/techproducts/query?q=test&rq={!ltr model=myModel reRankDocs=100}&fl=id,score The addition of the rq parameter will not change the output XML of the search. To obtain the feature values computed during reranking, add [features] to the fl parameter, for example: http://localhost:8983/solr/techproducts/query?q=test&rq={!ltr model=myModel reRankDocs=100}&fl=id,score,[features] The output XML will include feature values as a comma-separated list, resembling the output shown here: { "responseHeader":{ "status":0, "QTime":0, "params":{ "q":"test", "fl":"id,score,[features]", "rq":"{!ltr model=myModel reRankDocs=100}"}}, "response":{"numFound":2,"start":0,"maxScore":1.0005897,"docs":[ { "id":"GB18030TEST", "score":1.0005897, "[features]":"documentRecency=0.020893792,isBook=0.0,originalScore=1.959392"}, { "id":"UTF8TEST", "score":0.79656565, "[features]":"documentRecency=0.020893792,isBook=0.0,originalScore=1.5513437"}] }} External Feature Information The ValueFeature and SolrFeature classes support the use of external feature information, efi for short. Uploading Features To upload features in a /path/myEfiFeatures.json file, please run: curl -XPUT 'http://localhost:8983/solr/techproducts/schema/feature-store' --data-binary "@/path/myEfiFeatures.json" -H 'Content-type:application/json' To view the features you just uploaded please open the following URL in a browser: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 544 of 1195 Apache Solr Reference Guide 7.3 http://localhost:8983/solr/techproducts/schema/feature-store/myEfiFeatureStore Example: /path/myEfiFeatures.json [ { "store" : "myEfiFeatureStore", "name" : "isPreferredManufacturer", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "fq" : [ "{!field f=manu}${preferredManufacturer}" ] } }, { "store" : "myEfiFeatureStore", "name" : "userAnswerValue", "class" : "org.apache.solr.ltr.feature.ValueFeature", "params" : { "value" : "${answer:42}" } }, { "store" : "myEfiFeatureStore", "name" : "userFromMobileValue", "class" : "org.apache.solr.ltr.feature.ValueFeature", "params" : { "value" : "${fromMobile}", "required" : true } }, { "store" : "myEfiFeatureStore", "name" : "userTextCat", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!field f=cat}${text}" } } ] As an aside, you may have noticed that the myEfiFeatures.json example uses "store":"myEfiFeatureStore" attributes: read more about feature store in the LTR Lifecycle section of this page. Extracting Features To extract myEfiFeatureStore features as part of a query, add efi.* parameters to the [features] part of the fl parameter, for example: http://localhost:8983/solr/techproducts/query?q=test&fl=id,cat,manu,score,[features store=myEfiFeatureStore efi.text=test efi.preferredManufacturer=Apache efi.fromMobile=1] http://localhost:8983/solr/techproducts/query?q=test&fl=id,cat,manu,score,[features store=myEfiFeatureStore efi.text=test efi.preferredManufacturer=Apache efi.fromMobile=0 efi.answer=13] Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 545 of 1195 Uploading a Model To upload the model in a /path/myEfiModel.json file, please run: curl -XPUT 'http://localhost:8983/solr/techproducts/schema/model-store' --data-binary "@/path/myEfiModel.json" -H 'Content-type:application/json' To view the model you just uploaded please open the following URL in a browser: http://localhost:8983/solr/techproducts/schema/model-store Example: /path/myEfiModel.json { "store" : "myEfiFeatureStore", "name" : "myEfiModel", "class" : "org.apache.solr.ltr.model.LinearModel", "features" : [ { "name" : "isPreferredManufacturer" }, { "name" : "userAnswerValue" }, { "name" : "userFromMobileValue" }, { "name" : "userTextCat" } ], "params" : { "weights" : { "isPreferredManufacturer" : 0.2, "userAnswerValue" : 1.0, "userFromMobileValue" : 1.0, "userTextCat" : 0.1 } } } Running a Rerank Query To obtain the feature values computed during reranking, add [features] to the fl parameter and efi.* parameters to the rq parameter, for example: http://localhost:8983/solr/techproducts/query?q=test&rq={!ltr model=myEfiModel efi.text=test efi.preferredManufacturer=Apache efi.fromMobile=1}&fl=id,cat,manu,score,[features] http://localhost:8983/solr/techproducts/query?q=test&rq={!ltr model=myEfiModel efi.text=test efi.preferredManufacturer=Apache efi.fromMobile=0 efi.answer=13}&fl=id,cat,manu,score,[features] Notice the absence of efi.* parameters in the [features] part of the fl parameter. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 546 of 1195 Apache Solr Reference Guide 7.3 Extracting Features While Reranking To extract features for myEfiFeatureStore features while still reranking with myModel: http://localhost:8983/solr/techproducts/query?q=test&rq={!ltr model=myModel}&fl=id,cat,manu,score,[features store=myEfiFeatureStore efi.text=test efi.preferredManufacturer=Apache efi.fromMobile=1] Notice the absence of efi.* parameters in the rq parameter (because myModel does not use efi feature) and the presence of efi.* parameters in the [features] part of the fl parameter (because myEfiFeatureStore contains efi features). Read more about model evolution in the LTR Lifecycle section of this page. Training Example Example training data and a demo train_and_upload_demo_model.py script can be found in the solr/contrib/ltr/example folder in the Apache lucene-solr Git repository (mirrored on github.com). This example folder is not shipped in the Solr binary release. Installation of LTR The ltr contrib module requires the dist/solr-ltr-*.jar JARs. LTR Configuration Learning-To-Rank is a contrib module and therefore its plugins must be configured in solrconfig.xml. Minimum Requirements • Include the required contrib JARs. Note that by default paths are relative to the Solr core so they may need adjustments to your configuration, or an explicit specification of the $solr.install.dir. • Declaration of the ltr query parser. • Configuration of the feature values cache. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 547 of 1195 • Declaration of the [features] transformer. QUERY_DOC_FV Advanced Options LTRThreadModule A thread module can be configured for the query parser and/or the transformer to parallelize the creation of feature weights. For details, please refer to the LTRThreadModule javadocs. Feature Vector Customization The features transformer returns dense CSV values such as featureA=0.1,featureB=0.2,featureC=0.3,featureD=0.0. For sparse CSV output such as featureA:0.1 featureB:0.2 featureC:0.3 you can customize the feature logger transformer declaration in solrconfig.xml as follows: QUERY_DOC_FV sparse : Implementation and Contributions How does Solr Learning-To-Rank work under the hood? Please refer to the ltr javadocs for an implementation overview. How could I write additional models and/or features? Contributions for further models, features and normalizers are welcome. Related links: • LTRScoringModel javadocs • Feature javadocs • Normalizer javadocs • http://wiki.apache.org/solr/HowToContribute • http://wiki.apache.org/lucene-java/HowToContribute LTR Lifecycle © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 548 of 1195 Apache Solr Reference Guide 7.3 Feature Stores It is recommended that you organise all your features into stores which are akin to namespaces: • Features within a store must be named uniquely. • Across stores identical or similar features can share the same name. • If no store name is specified then the default _DEFAULT_ feature store will be used. To discover the names of all your feature stores: http://localhost:8983/solr/techproducts/schema/feature-store To inspect the content of the commonFeatureStore feature store: http://localhost:8983/solr/techproducts/schema/feature-store/commonFeatureStore Models • A model uses features from exactly one feature store. • If no store is specified then the default _DEFAULT_ feature store will be used. • A model need not use all the features defined in a feature store. • Multiple models can use the same feature store. To extract features for currentFeatureStore 's features: http://localhost:8983/solr/techproducts/query?q=test&fl=id,score,[features store=currentFeatureStore] To extract features for nextFeatureStore features whilst reranking with currentModel based on currentFeatureStore: http://localhost:8983/solr/techproducts/query?q=test&rq={!ltr model=currentModel reRankDocs=100}&fl=id,score,[features store=nextFeatureStore] To view all models: http://localhost:8983/solr/techproducts/schema/model-store To delete the currentModel model: curl -XDELETE 'http://localhost:8983/solr/techproducts/schema/model-store/currentModel'  A feature store may be deleted only when there are no models using it. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 549 of 1195 To delete the currentFeatureStore feature store: curl -XDELETE 'http://localhost:8983/solr/techproducts/schema/feature-store/currentFeatureStore' Using large models With SolrCloud, large models may fail to upload due to the limitation of ZooKeeper’s buffer. In this case, DefaultWrapperModel may help you to separate the model definition from uploaded file. Assuming that you consider to use a large model placed at /path/to/models/myModel.json through DefaultWrapperModel. { "store" : "largeModelsFeatureStore", "name" : "myModel", "class" : ..., "features" : [ ... ], "params" : { ... } } First, add the directory to Solr’s resource paths by Lib Directives: Then, configure DefaultWrapperModel to wrap myModel.json: { "store" : "largeModelsFeatureStore", "name" : "myWrapperModel", "class" : "org.apache.solr.ltr.model.DefaultWrapperModel", "params" : { "resource" : "myModel.json" } } myModel.json will be loaded during the initialization and be able to use by specifying model=myWrapperModel.  No "features" are configured in myWrapperModel because the features of the wrapped model (myModel) will be used; also note that the "store" configured for the wrapper model must match that of the wrapped model i.e., in this example the feature store called largeModelsFeatureStore is used. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 550 of 1195  Apache Solr Reference Guide 7.3 doesn’t work as expected in this case, because SolrResourceLoader considers given resources as JAR if indicates files. Applying Changes The feature store and the model store are both Managed Resources. Changes made to managed resources are not applied to the active Solr components until the Solr collection (or Solr core in single server mode) is reloaded. LTR Examples One Feature Store, Multiple Ranking Models • leftModel and rightModel both use features from commonFeatureStore and the only different between the two models is the weights attached to each feature. • Conventions used: ◦ commonFeatureStore.json file contains features for the commonFeatureStore feature store ◦ leftModel.json file contains model named leftModel ◦ rightModel.json file contains model named rightModel ◦ The model’s features and weights are sorted alphabetically by name, this makes it easy to see what the commonalities and differences between the two models are. ◦ The stores features are sorted alphabetically by name, this makes it easy to lookup features used in the models Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 551 of 1195 Example: /path/commonFeatureStore.json [ { "store" : "commonFeatureStore", "name" : "documentRecency", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!func}recip( ms(NOW,last_modified), 3.16e-11, 1, 1)" } }, { "store" : "commonFeatureStore", "name" : "isBook", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "fq": [ "{!terms f=category}book" ] } }, { "store" : "commonFeatureStore", "name" : "originalScore", "class" : "org.apache.solr.ltr.feature.OriginalScoreFeature", "params" : {} } ] Example: /path/leftModel.json { "store" : "commonFeatureStore", "name" : "leftModel", "class" : "org.apache.solr.ltr.model.LinearModel", "features" : [ { "name" : "documentRecency" }, { "name" : "isBook" }, { "name" : "originalScore" } ], "params" : { "weights" : { "documentRecency" : 0.1, "isBook" : 1.0, "originalScore" : 0.5 } } } © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 552 of 1195 Apache Solr Reference Guide 7.3 Example: /path/rightModel.json { "store" : "commonFeatureStore", "name" : "rightModel", "class" : "org.apache.solr.ltr.model.LinearModel", "features" : [ { "name" : "documentRecency" }, { "name" : "isBook" }, { "name" : "originalScore" } ], "params" : { "weights" : { "documentRecency" : 1.0, "isBook" : 0.1, "originalScore" : 0.5 } } } Model Evolution • linearModel201701 uses features from featureStore201701 • treesModel201702 uses features from featureStore201702 • linearModel201701 and treesModel201702 and their feature stores can co-exist whilst both are needed. • When linearModel201701 has been deleted then featureStore201701 can also be deleted. • Conventions used: ◦ .json file contains features for the feature store ◦ .json file contains model name ◦ a 'generation' id (e.g., YYYYMM year-month) is part of the feature store and model names ◦ The model’s features and weights are sorted alphabetically by name, this makes it easy to see what the commonalities and differences between the two models are. ◦ The stores features are sorted alphabetically by name, this makes it easy to see what the commonalities and differences between the two feature stores are. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 553 of 1195 Example: /path/featureStore201701.json [ { "store" : "featureStore201701", "name" : "documentRecency", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!func}recip( ms(NOW,last_modified), 3.16e-11, 1, 1)" } }, { "store" : "featureStore201701", "name" : "isBook", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "fq": [ "{!terms f=category}book" ] } }, { "store" : "featureStore201701", "name" : "originalScore", "class" : "org.apache.solr.ltr.feature.OriginalScoreFeature", "params" : {} } ] Example: /path/linearModel201701.json { "store" : "featureStore201701", "name" : "linearModel201701", "class" : "org.apache.solr.ltr.model.LinearModel", "features" : [ { "name" : "documentRecency" }, { "name" : "isBook" }, { "name" : "originalScore" } ], "params" : { "weights" : { "documentRecency" : 0.1, "isBook" : 1.0, "originalScore" : 0.5 } } } © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 554 of 1195 Apache Solr Reference Guide 7.3 Example: /path/featureStore201702.json [ { "store" : "featureStore201702", "name" : "isBook", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "fq": [ "{!terms f=category}book" ] } }, { "store" : "featureStore201702", "name" : "originalScore", "class" : "org.apache.solr.ltr.feature.OriginalScoreFeature", "params" : {} } ] Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 555 of 1195 Example: /path/treesModel201702.json { "store" : "featureStore201702", "name" : "treesModel201702", "class" : "org.apache.solr.ltr.model.MultipleAdditiveTreesModel", "features" : [ { "name" : "isBook" }, { "name" : "originalScore" } ], "params" : { "trees" : [ { "weight" : "1", "root" : { "feature" : "isBook", "threshold" : "0.5", "left" : { "value" : "-100" }, "right" : { "feature" : "originalScore", "threshold" : "10.0", "left" : { "value" : "50" }, "right" : { "value" : "75" } } } }, { "weight" : "2", "root" : { "value" : "-10" } } ] } } Additional LTR Resources • "Learning to Rank in Solr" presentation at Lucene/Solr Revolution 2015 in Austin: ◦ Slides: http://www.slideshare.net/lucidworks/learning-to-rank-in-solr-presented-by-michael-nilssondiego-ceccarelli-bloomberg-lp ◦ Video: https://www.youtube.com/watch?v=M7BKwJoh96s © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 556 of 1195 Apache Solr Reference Guide 7.3 Transforming Result Documents Document Transformers can be used to modify the information returned about each documents in the results of a query. Using Document Transformers When executing a request, a document transformer can be used by including it in the fl parameter using square brackets, for example: fl=id,name,score,[shard] Some transformers allow, or require, local parameters which can be specified as key value pairs inside the brackets: fl=id,name,score,[explain style=nl] As with regular fields, you can change the key used when a Transformer adds a field to a document via a prefix: fl=id,name,score,my_val_a:[value v=42 t=int],my_val_b:[value v=7 t=float] The sections below discuss exactly what these various transformers do. Available Transformers [value] - ValueAugmenterFactory Modifies every document to include the exact same value, as if it were a stored field in every document: q=*:*&fl=id,greeting:[value v='hello']&wt=xml The above query would produce results like the following: 1 hello ... By default, values are returned as a String, but a “t” parameter can be specified using a value of int, float, double, or date to force a specific return type: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 557 of 1195 q=*:*&fl=id,my_number:[value v=42 t=int],my_string:[value v=42] In addition to using these request parameters, you can configure additional named instances of ValueAugmenterFactory, or override the default behavior of the existing [value] transformer in your solrconfig.xml file: 5 5 The “value” option forces an explicit value to always be used, while the “defaultValue” option provides a default that can still be overridden using the “v” and “t” local parameters. [explain] - ExplainAugmenterFactory Augments each document with an inline explanation of its score exactly like the information available about each document in the debug section: q=features:cache&fl=id,[explain style=nl] Supported values for style are text, and html, and nl which returns the information as structured data: { "response":{"numFound":2,"start":0,"docs":[ { "id":"6H500F0", "[explain]":{ "match":true, "value":1.052226, "description":"weight(features:cache in 2) [DefaultSimilarity], result of:", "details":[{ }]}}]}} A default style can be configured by specifying an "args" parameter in your configuration: nl [child] - ChildDocTransformerFactory This transformer returns all descendant documents of each parent document matching your query in a flat list nested inside the matching parent document. This is useful when you have indexed nested child documents and want to retrieve the child documents for the relevant parent documents for any type of © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 558 of 1195 Apache Solr Reference Guide 7.3 search query. fl=id,[child parentFilter=doc_type:book childFilter=doc_type:chapter limit=100] Note that this transformer can be used even though the query itself is not a Block Join query. When using this transformer, the parentFilter parameter must be specified, and works the same as in all Block Join Queries, additional optional parameters are: • childFilter - query to filter which child documents should be included, this can be particularly useful when you have multiple levels of hierarchical documents (default: all children) • limit - the maximum number of child documents to be returned per parent document (default: 10) [shard] - ShardAugmenterFactory This transformer adds information about what shard each individual document came from in a distributed request. ShardAugmenterFactory does not support any request parameters, or configuration options. [docid] - DocIdAugmenterFactory This transformer adds the internal Lucene document id to each document – this is primarily only useful for debugging purposes. DocIdAugmenterFactory does not support any request parameters, or configuration options. [elevated] and [excluded] These transformers are available only when using the Query Elevation Component. • [elevated] annotates each document to indicate if it was elevated or not. • [excluded] annotates each document to indicate if it would have been excluded - this is only supported if you also use the markExcludes parameter. fl=id,[elevated],[excluded]&excludeIds=GB18030TEST&elevateIds=6H500F0&markExcludes=true Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 559 of 1195 { "response":{"numFound":32,"start":0,"docs":[ { "id":"6H500F0", "[elevated]":true, "[excluded]":false}, { "id":"GB18030TEST", "[elevated]":false, "[excluded]":true}, { "id":"SP2514N", "[elevated]":false, "[excluded]":false}, ]}} [json] / [xml] These transformers replace field value containing a string representation of a valid XML or JSON structure with the actual raw XML or JSON structure rather than just the string value. Each applies only to the specific writer, such that [json] only applies to wt=json and [xml] only applies to wt=xml. fl=id,source_s:[json]&wt=json [subquery] This transformer executes a separate query per transforming document passing document fields as an input for subquery parameters. It’s usually used with {!join} and {!parent} query parsers, and is intended to be an improvement for [child]. • It must be given an unique name: fl=*,children:[subquery] • There might be a few of them, e.g., fl=*,sons:[subquery],daughters:[subquery]. • Every [subquery] occurrence adds a field into a result document with the given name, the value of this field is a document list, which is a result of executing subquery using document fields as an input. Here is how it looks like in various formats: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 560 of 1195 Apache Solr Reference Guide 7.3 1 vdczoypirs 2 vdczoypirs ... { "response":{ "numFound":2, "start":0, "docs":[ { "id":1, "subject":["parentDocument"], "title":["xrxvomgu"], "children":{ "numFound":1, "start":0, "docs":[ { "id":2, "cat":["childDocument"] } ] }}]}} SolrDocumentList subResults = (SolrDocumentList)doc.getFieldValue("children"); Subquery Result Fields To appear in subquery document list, a field should be specified in subquery’s fl parameter e.g., foo.fl (it is not necessary to specify in main one’s fl). Of course, you can use wildcard in this parameter. For example, if field title should appear in categories subquery, it can be done via one of these ways. fl=...id,categories:[subquery]&categories.fl=title&categories.q=... fl=...id,categories:[subquery]&categories.fl=*&categories.q=... Subquery Parameters Shift If a subquery is declared as fl=*,foo:[subquery], subquery parameters are prefixed with the given name Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 561 of 1195 and period. For example: q=:&fl=*,foo:[subquery]&foo.q=to be continued&foo.rows=10&foo.sort=id desc Document Field as an Input for Subquery Parameters It’s necessary to pass some document field values as a parameter for subquery. It’s supported via implicit row.fieldname parameter, and can be (but might not only) referred via Local Parameters syntax: q=namne:john&fl=name,id,depts:[subquery]&depts.q={!terms f=id v=$row.dept_id}&depts.rows=10 Here departments are retrieved per every employee in search result. We can say that it’s like SQL join ON emp.dept_id=dept.id. Note, when a document field has multiple values they are concatenated with a comma by default. This can be changed with the local parameter foo:[subquery separator=' '], this mimics {!terms} to work smoothly with it. To log substituted subquery request parameters, add the corresponding parameter names, as in depts.logParamsList=q,fl,rows,row.dept_id Cores and Collections in SolrCloud Use foo:[subquery fromIndex=departments] to invoke subquery on another core on the same node, it’s what {!join} does for non-SolrCloud mode. But in case of SolrCloud just (and only) explicitly specify its native parameters like collection, shards for subquery, e.g.: q=:&fl=*,foo:[subquery]&foo.q=cloud&foo.collection=departments  If subquery collection has a different unique key field name (let’s say foo_id at contrast to id in primary collection), add the following parameters to accommodate this difference: foo.fl=id:foo_id&foo.distrib.singlePass=true. Otherwise you’ll get NullPoniterException from QueryComponent.mergeIds. [geo] - Geospatial formatter Formats spatial data from a spatial field using a designated format type name. Two inner parameters are required: f for the field name, and w for the format name. Example: geojson:[geo f=mySpatialField w=GeoJSON]. Normally you’ll simply be consistent in choosing the format type you want by setting the format attribute on the spatial field type to WKT or GeoJSON – see the section Spatial Search for more information. If you are consistent, it’ll come out the way you stored it. This transformer offers a convenience to transform the spatial format to something different on retrieval. In addition, this feature is very useful with the RptWithGeometrySpatialField to avoid double-storage of the potentially large vector geometry. This transformer will detect that field type and fetch the geometry from an internal compact binary representation on disk (in docValues), and then format it as desired. As such, you needn’t mark the field as stored, which would be redundant. In a sense this double-storage between docValues and stored-value storage isn’t unique to spatial but with polygonal geometry it can be a lot of data, and furthermore you’d like to avoid storing it in a verbose format (like GeoJSON or WKT). © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 562 of 1195 Apache Solr Reference Guide 7.3 [features] - LTRFeatureLoggerTransformerFactory The "LTR" prefix stands for Learning To Rank. This transformer returns the values of features and it can be used for feature extraction and feature logging. fl=id,[features store=yourFeatureStore] This will return the values of the features in the yourFeatureStore store. fl=id,[features]&rq={!ltr model=yourModel} If you use [features] together with an Learning-To-Rank reranking query then the values of the features in the reranking model (yourModel) will be returned. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 563 of 1195 Suggester The SuggestComponent in Solr provides users with automatic suggestions for query terms. You can use this to implement a powerful auto-suggest feature in your search application. Although it is possible to use the Spell Checking functionality to power autosuggest behavior, Solr has a dedicated SuggestComponent designed for this functionality. This approach utilizes Lucene’s Suggester implementation and supports all of the lookup implementations available in Lucene. The main features of this Suggester are: • Lookup implementation pluggability • Term dictionary pluggability, giving you the flexibility to choose the dictionary implementation • Distributed support The solrconfig.xml found in Solr’s “techproducts” example has a Suggester implementation configured already. For more on search components, see the section RequestHandlers and SearchComponents in SolrConfig. The “techproducts” example solrconfig.xml has a suggest search component and a /suggest request handler already configured. You can use that as the basis for your configuration, or create it from scratch, as detailed below. Adding the Suggest Search Component The first step is to add a search component to solrconfig.xml and tell it to use the SuggestComponent. Here is some sample code that could be used. mySuggester FuzzyLookupFactory DocumentDictionaryFactory cat price string false Suggester Search Component Parameters The Suggester search component takes several configuration parameters. The choice of the lookup implementation (lookupImpl, how terms are found in the suggestion dictionary) and the dictionary implementation (dictionaryImpl, how terms are stored in the suggestion dictionary) will © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 564 of 1195 Apache Solr Reference Guide 7.3 dictate some of the parameters required. Below are the main parameters that can be used no matter what lookup or dictionary implementation is used. In the following sections additional parameters are provided for each implementation. searchComponent name Arbitrary name for the search component. name A symbolic name for this suggester. You can refer to this name in the URL parameters and in the SearchHandler configuration. It is possible to have multiples of these in one solrconfig.xml file. lookupImpl Lookup implementation. There are several possible implementations, described below in the section Lookup Implementations. If not set, the default lookup is JaspellLookupFactory. dictionaryImpl The dictionary implementation to use. There are several possible implementations, described below in the section Dictionary Implementations. If not set, the default dictionary implementation is HighFrequencyDictionaryFactory. However, if a sourceLocation is used, the dictionary implementation will be FileDictionaryFactory. field A field from the index to use as the basis of suggestion terms. If sourceLocation is empty (meaning any dictionary implementation other than FileDictionaryFactory), then terms from this field in the index will be used. To be used as the basis for a suggestion, the field must be stored. You may want to use copyField rules to create a special 'suggest' field comprised of terms from other fields in documents. In any event, you very likely want a minimal amount of analysis on the field, so an additional option is to create a field type in your schema that only uses basic tokenizers or filters. One option for such a field type is shown here: However, this minimal analysis is not required if you want more analysis to occur on terms. If using the AnalyzingLookupFactory as your lookupImpl, however, you have the option of defining the field type rules to use for index and query time analysis. sourceLocation The path to the dictionary file if using the FileDictionaryFactory. If this value is empty then the main index will be used as a source of terms and weights. storeDir Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 565 of 1195 The location to store the dictionary file. buildOnCommit and buildOnOptimize If true, the lookup data structure will be rebuilt after soft-commit. If false, the default, then the lookup data will be built only when requested by URL parameter suggest.build=true. Use buildOnCommit to rebuild the dictionary with every soft-commit, or buildOnOptimize to build the dictionary only when the index is optimized. Some lookup implementations may take a long time to build, especially with large indexes. In such cases, using buildOnCommit or buildOnOptimize, particularly with a high frequency of softCommits is not recommended; it’s recommended instead to build the suggester at a lower frequency by manually issuing requests with suggest.build=true. buildOnStartup If true, then the lookup data structure will be built when Solr starts or when the core is reloaded. If this parameter is not specified, the suggester will check if the lookup data structure is present on disk and build it if not found. Enabling this to true could lead to the core talking longer to load (or reload) as the suggester data structure needs to be built, which can sometimes take a long time. It’s usually preferred to have this setting set to false, the default, and build suggesters manually issuing requests with suggest.build=true. Lookup Implementations The lookupImpl parameter defines the algorithms used to look up terms in the suggest index. There are several possible implementations to choose from, and some require additional parameters to be configured. AnalyzingLookupFactory A lookup that first analyzes the incoming text and adds the analyzed form to a weighted FST, and then does the same thing at lookup time. This implementation uses the following additional properties: suggestAnalyzerFieldType The field type to use for the query-time and build-time term suggestion analysis. exactMatchFirst If true, the default, exact suggestions are returned first, even if they are prefixes or other strings in the FST have larger weights. preserveSep If true, the default, then a separator between tokens is preserved. This means that suggestions are sensitive to tokenization (e.g., baseball is different from base ball). preservePositionIncrements If true, the suggester will preserve position increments. This means that token filters which leave gaps (for example, when StopFilter matches a stopword) the position would be respected when building the suggester. The default is false. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 566 of 1195 Apache Solr Reference Guide 7.3 FuzzyLookupFactory This is a suggester which is an extension of the AnalyzingSuggester but is fuzzy in nature. The similarity is measured by the Levenshtein algorithm. This implementation uses the following additional properties: exactMatchFirst If true, the default, exact suggestions are returned first, even if they are prefixes or other strings in the FST have larger weights. preserveSep If true, the default, then a separator between tokens is preserved. This means that suggestions are sensitive to tokenization (e.g., baseball is different from base ball). maxSurfaceFormsPerAnalyzedForm The maximum number of surface forms to keep for a single analyzed form. When there are too many surface forms we discard the lowest weighted ones. maxGraphExpansions When building the FST ("index-time"), we add each path through the tokenstream graph as an individual entry. This places an upper-bound on how many expansions will be added for a single suggestion. The default is -1 which means there is no limit. preservePositionIncrements If true, the suggester will preserve position increments. This means that token filters which leave gaps (for example, when StopFilter matches a stopword) the position would be respected when building the suggester. The default is false. maxEdits The maximum number of string edits allowed. The system’s hard limit is 2. The default is 1. transpositions If true, the default, transpositions should be treated as a primitive edit operation. nonFuzzyPrefix The length of the common non fuzzy prefix match which must match a suggestion. The default is 1. minFuzzyLength The minimum length of query before which any string edits will be allowed. The default is 3. unicodeAware If true, the maxEdits, minFuzzyLength, transpositions and nonFuzzyPrefix parameters will be measured in unicode code points (actual letters) instead of bytes. The default is false. AnalyzingInfixLookupFactory Analyzes the input text and then suggests matches based on prefix matches to any tokens in the indexed text. This uses a Lucene index for its dictionary. This implementation uses the following additional properties. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 567 of 1195 indexPath When using AnalyzingInfixSuggester you can provide your own path where the index will get built. The default is analyzingInfixSuggesterIndexDir and will be created in your collection’s data/ directory. minPrefixChars Minimum number of leading characters before PrefixQuery is used (default is 4). Prefixes shorter than this are indexed as character ngrams (increasing index size but making lookups faster). allTermsRequired Boolean option for multiple terms. The default is true, all terms will be required. highlight Highlight suggest terms. Default is true. This implementation supports Context Filtering. BlendedInfixLookupFactory An extension of the AnalyzingInfixSuggester which provides additional functionality to weight prefix matches across the matched documents. You can tell it to score higher if a hit is closer to the start of the suggestion or vice versa. This implementation uses the following additional properties: blenderType Used to calculate weight coefficient using the position of the first matching word. Available options are: position_linear weightFieldValue * (1 - 0.10*position): Matches to the start will be given a higher score. This is the default. position_reciprocal weightFieldValue / (1 + position): Matches to the end will be given a higher score. exponent An optional configuration variable for position_reciprocal to control how fast the score will increase or decrease. Default 2.0. numFactor The factor to multiply the number of searched elements from which results will be pruned. Default is 10. indexPath When using BlendedInfixSuggester you can provide your own path where the index will get built. The default directory name is blendedInfixSuggesterIndexDir and will be created in your collection’s data directory. minPrefixChars Minimum number of leading characters before PrefixQuery is used (the default is 4). Prefixes shorter than this are indexed as character ngrams, which increases index size but makes lookups faster. This implementation supports Context Filtering. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 568 of 1195 Apache Solr Reference Guide 7.3 FreeTextLookupFactory It looks at the last tokens plus the prefix of whatever final token the user is typing, if present, to predict the most likely next token. The number of previous tokens that need to be considered can also be specified. This suggester would only be used as a fallback, when the primary suggester fails to find any suggestions. This implementation uses the following additional properties: suggestFreeTextAnalyzerFieldType The analyzer used at "query-time" and "build-time" to analyze suggestions. This parameter is required. ngrams The max number of tokens out of which singles will be made the dictionary. The default value is 2. Increasing this would mean you want more than the previous 2 tokens to be taken into consideration when making the suggestions. FSTLookupFactory An automaton-based lookup. This implementation is slower to build, but provides the lowest memory cost. We recommend using this implementation unless you need more sophisticated matching results, in which case you should use the Jaspell implementation. This implementation uses the following additional properties: exactMatchFirst If true, the default, exact suggestions are returned first, even if they are prefixes or other strings in the FST have larger weights. weightBuckets The number of separate buckets for weights which the suggester will use while building its dictionary. TSTLookupFactory A simple compact ternary trie based lookup. WFSTLookupFactory A weighted automaton representation which is an alternative to FSTLookup for more fine-grained ranking. WFSTLookup does not use buckets, but instead a shortest path algorithm. Note that it expects weights to be whole numbers. If weight is missing it’s assumed to be 1.0. Weights affect the sorting of matching suggestions when spellcheck.onlyMorePopular=true is selected: weights are treated as "popularity" score, with higher weights preferred over suggestions with lower weights. JaspellLookupFactory A more complex lookup based on a ternary trie from the JaSpell project. Use this implementation if you need more sophisticated matching results. Dictionary Implementations The dictionary implementations define how terms are stored. There are several options, and multiple dictionaries can be used in a single request if necessary. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 569 of 1195 DocumentDictionaryFactory A dictionary with terms, weights, and an optional payload taken from the index. This dictionary implementation takes the following parameters in addition to parameters described for the Suggester generally and for the lookup implementation: weightField A field that is stored or a numeric DocValue field. This parameter is optional. payloadField The payloadField should be a field that is stored. This parameter is optional. contextField Field to be used for context filtering. Note that only some lookup implementations support filtering. DocumentExpressionDictionaryFactory This dictionary implementation is the same as the DocumentDictionaryFactory but allows users to specify an arbitrary expression into the weightExpression tag. This dictionary implementation takes the following parameters in addition to parameters described for the Suggester generally and for the lookup implementation: payloadField The payloadField should be a field that is stored. This parameter is optional. weightExpression An arbitrary expression used for scoring the suggestions. The fields used must be numeric fields. This parameter is required. contextField Field to be used for context filtering. Note that only some lookup implementations support filtering. HighFrequencyDictionaryFactory This dictionary implementation allows adding a threshold to prune out less frequent terms in cases where very common terms may overwhelm other terms. This dictionary implementation takes one parameter in addition to parameters described for the Suggester generally and for the lookup implementation: threshold A value between zero and one representing the minimum fraction of the total documents where a term should appear in order to be added to the lookup dictionary. FileDictionaryFactory This dictionary implementation allows using an external file that contains suggest entries. Weights and payloads can also be used. If using a dictionary file, it should be a plain text file in UTF-8 encoding. You can use both single terms and phrases in the dictionary file. If adding weights or payloads, those should be separated from terms using the © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 570 of 1195 Apache Solr Reference Guide 7.3 delimiter defined with the fieldDelimiter property (the default is '\t', the tab representation). If using payloads, the first line in the file must specify a payload. This dictionary implementation takes one parameter in addition to parameters described for the Suggester generally and for the lookup implementation: fieldDelimiter Specifies the delimiter to be used separating the entries, weights and payloads. The default is tab (\t). Example File acquire accidentally 2.0 accommodate 3.0 Multiple Dictionaries It is possible to include multiple dictionaryImpl definitions in a single SuggestComponent definition. To do this, simply define separate suggesters, as in this example: mySuggester FuzzyLookupFactory DocumentDictionaryFactory cat price string altSuggester DocumentExpressionDictionaryFactory FuzzyLookupFactory product_name ((price * 2) + ln(popularity)) weight price suggest_fuzzy_doc_expr_dict text_en When using these Suggesters in a query, you would define multiple suggest.dictionary parameters in the request, referring to the names given for each Suggester in the search component definition. The response will include the terms in sections for each Suggester. See the Example Usages section below for an example request and response. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 571 of 1195 Adding the Suggest Request Handler After adding the search component, a request handler must be added to solrconfig.xml. This request handler works the same as any other request handler, and allows you to configure default parameters for serving suggestion requests. The request handler definition must incorporate the "suggest" search component defined previously. true 10 suggest Suggest Request Handler Parameters The following parameters allow you to set defaults for the Suggest request handler: suggest=true This parameter should always be true, because we always want to run the Suggester for queries submitted to this handler. suggest.dictionary The name of the dictionary component configured in the search component. This is a mandatory parameter. It can be set in the request handler, or sent as a parameter at query time. suggest.q The query to use for suggestion lookups. suggest.count Specifies the number of suggestions for Solr to return. suggest.cfq A Context Filter Query used to filter suggestions based on the context field, if supported by the suggester. suggest.build If true, it will build the suggester index. This is likely useful only for initial requests; you would probably not want to build the dictionary on every request, particularly in a production system. If you would like to keep your dictionary up to date, you should use the buildOnCommit or buildOnOptimize parameter for the search component. suggest.reload If true, it will reload the suggester index. suggest.buildAll If true, it will build all suggester indexes. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 572 of 1195 Apache Solr Reference Guide 7.3 suggest.reloadAll If true, it will reload all suggester indexes. These properties can also be overridden at query time, or not set in the request handler at all and always sent at query time. Context Filtering  Context filtering (suggest.cfq) is currently only supported by AnalyzingInfixLookupFactory and BlendedInfixLookupFactory, and only when backed by a Document*Dictionary. All other implementations will return unfiltered matches as if filtering was not requested. Example Usages Get Suggestions with Weights This is a basic suggestion using a single dictionary and a single Solr core. Example query: http://localhost:8983/solr/techproducts/suggest?suggest=true&suggest.build=true&suggest.dictionar y=mySuggester&suggest.q=elec In this example, we’ve simply requested the string 'elec' with the suggest.q parameter and requested that the suggestion dictionary be built with suggest.build (note, however, that you would likely not want to build the index on every query - instead you should use buildOnCommit or buildOnOptimize if you have regularly changing documents). Example response: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 573 of 1195 { "responseHeader": { "status": 0, "QTime": 35 }, "command": "build", "suggest": { "mySuggester": { "elec": { "numFound": 3, "suggestions": [ { "term": "electronics and computer1", "weight": 2199, "payload": "" }, { "term": "electronics", "weight": 649, "payload": "" }, { "term": "electronics and stuff2", "weight": 279, "payload": "" } ] } } } } Using Multiple Dictionaries If you have defined multiple dictionaries, you can use them in queries. Example query: http://localhost:8983/solr/techproducts/suggest?suggest=true&suggest.dictionary=mySuggester&sugge st.dictionary=altSuggester&suggest.q=elec In this example we have sent the string 'elec' as the suggest.q parameter and named two suggest.dictionary definitions to be used. Example response: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 574 of 1195 Apache Solr Reference Guide 7.3 { "responseHeader": { "status": 0, "QTime": 3 }, "suggest": { "mySuggester": { "elec": { "numFound": 1, "suggestions": [ { "term": "electronics and computer1", "weight": 100, "payload": "" } ] } }, "altSuggester": { "elec": { "numFound": 1, "suggestions": [ { "term": "electronics and computer1", "weight": 10, "payload": "" } ] } } } } Context Filtering Context filtering lets you filter suggestions by a separate context field, such as category, department or any other token. The AnalyzingInfixLookupFactory and BlendedInfixLookupFactory currently support this feature, when backed by DocumentDictionaryFactory. Add contextField to your suggester configuration. This example will suggest names and allow to filter by category: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 575 of 1195 solrconfig.xml mySuggester AnalyzingInfixLookupFactory DocumentDictionaryFactory name price cat string false Example context filtering suggest query: http://localhost:8983/solr/techproducts/suggest?suggest=true&suggest.build=true&suggest.dictionar y=mySuggester&suggest.q=c&suggest.cfq=memory The suggester will only bring back suggestions for products tagged with 'cat=memory'. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 576 of 1195 Apache Solr Reference Guide 7.3 MoreLikeThis The MoreLikeThis search component enables users to query for documents similar to a document in their result list. It does this by using terms from the original document to find similar documents in the index. There are three ways to use MoreLikeThis. The first, and most common, is to use it as a request handler. In this case, you would send text to the MoreLikeThis request handler as needed (as in when a user clicked on a "similar documents" link). The second is to use it as a search component. This is less desirable since it performs the MoreLikeThis analysis on every document returned. This may slow search results. The final approach is to use it as a request handler but with externally supplied text. This case, also referred to as the MoreLikeThisHandler, will supply information about similar documents in the index based on the text of the input document. How MoreLikeThis Works MoreLikeThis constructs a Lucene query based on terms in a document. It does this by pulling terms from the defined list of fields ( see the mlt.fl parameter, below). For best results, the fields should have stored term vectors in schema.xml. For example: If term vectors are not stored, MoreLikeThis will generate terms from stored fields. A uniqueKey must also be stored in order for MoreLikeThis to work properly. The next phase filters terms from the original document using thresholds defined with the MoreLikeThis parameters. Finally, a query is run with these terms, and any other query parameters that have been defined (see the mlt.qf parameter, below) and a new document set is returned. Common Parameters for MoreLikeThis The table below summarizes the MoreLikeThis parameters supported by Lucene/Solr. These parameters can be used with any of the three possible MoreLikeThis approaches. mlt.fl Specifies the fields to use for similarity. If possible, these should have stored termVectors. mlt.mintf Specifies the Minimum Term Frequency, the frequency below which terms will be ignored in the source document. mlt.mindf Specifies the Minimum Document Frequency, the frequency at which words will be ignored which do not occur in at least this many documents. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 577 of 1195 mlt.maxdf Specifies the Maximum Document Frequency, the frequency at which words will be ignored which occur in more than this many documents. mlt.maxdfpct Specifies the Maximum Document Frequency using a relative ratio to the number of documents in the index. The argument must be an integer between 0 and 100. For example 75 means the word will be ignored if it occurs in more than 75 percent of the documents in the index. mlt.minwl Sets the minimum word length below which words will be ignored. mlt.maxwl Sets the maximum word length above which words will be ignored. mlt.maxqt Sets the maximum number of query terms that will be included in any generated query. mlt.maxntp Sets the maximum number of tokens to parse in each example document field that is not stored with TermVector support. mlt.boost Specifies if the query will be boosted by the interesting term relevance. It can be either "true" or "false". mlt.qf Query fields and their boosts using the same format as that used by the DisMax Query Parser. These fields must also be specified in mlt.fl. Parameters for the MoreLikeThisComponent Using MoreLikeThis as a search component returns similar documents for each document in the response set. In addition to the common parameters, these additional options are available: mlt If set to true, activates the MoreLikeThis component and enables Solr to return MoreLikeThis results. mlt.count Specifies the number of similar documents to be returned for each result. The default value is 5. Parameters for the MoreLikeThisHandler The table below summarizes parameters accessible through the MoreLikeThisHandler. It supports faceting, paging, and filtering using common query parameters, but does not work well with alternate query parsers. mlt.match.include Specifies whether or not the response should include the matched document. If set to false, the response will look like a normal select response. mlt.match.offset Specifies an offset into the main query search results to locate the document on which the MoreLikeThis © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 578 of 1195 Apache Solr Reference Guide 7.3 query should operate. By default, the query operates on the first result for the q parameter. mlt.interestingTerms Controls how the MoreLikeThis component presents the "interesting" terms (the top TF/IDF terms) for the query. Supports three settings. The setting list lists the terms. The setting none lists no terms. The setting details lists the terms along with the boost value used for each term. Unless mlt.boost=true, all terms will have boost=1.0. MoreLikeThis Query Parser The mlt query parser provides a mechanism to retrieve documents similar to a given document, like the handler. More information on the usage of the mlt query parser can be found in the section Other Parsers. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 579 of 1195 Pagination of Results In most search applications, the "top" matching results (sorted by score, or some other criteria) are displayed to some human user. In many applications the UI for these sorted results are displayed to the user in "pages" containing a fixed number of matching results, and users don’t typically look at results past the first few pages worth of results. Basic Pagination In Solr, this basic paginated searching is supported using the start and rows parameters, and performance of this common behaviour can be tuned by utilizing the queryResultCache and adjusting the queryResultWindowSize configuration options based on your expected page sizes. Basic Pagination Examples The easiest way to think about simple pagination, is to simply multiply the page number you want (treating the "first" page number as "0") by the number of rows per page; such as in the following pseudo-code: function fetch_solr_page($page_number, $rows_per_page) { $start = $page_number * $rows_per_page $params = [ q = $some_query, rows = $rows_per_page, start = $start ] return fetch_solr($params) } How Basic Pagination is Affected by Index Updates The start parameter specified in a request to Solr indicates an absolute "offset" in the complete sorted list of matches that the client wants Solr to use as the beginning of the current "page". If an index modification (such as adding or removing documents) which affects the sequence of ordered documents matching a query occurs in between two requests from a client for subsequent pages of results, then it is possible that these modifications can result in the same document being returned on multiple pages, or documents being "skipped" as the result set shrinks or grows. For example, consider an index containing 26 documents like so: id name 1 A 2 B … 26 Z Followed by the following requests & index modifications interleaved: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 580 of 1195 Apache Solr Reference Guide 7.3 • A client requests q=:&rows=5&start=0&sort=name asc ◦ documents with the ids 1-5 will be returned to the client • Document id 3 is deleted • The client requests "page #2" using q=:&rows=5&start=5&sort=name asc ◦ Documents 7-11 will be returned ◦ Document 6 has been skipped, since it is now the 5th document in the sorted set of all matching results – it would be returned on a new request for "page #1" • 3 new documents are now added with the ids 90, 91, and 92; All three documents have a name of A • The client requests "page #3" using q=:&rows=5&start=10&sort=name asc ◦ Documents 9-13 will be returned ◦ Documents 9, 10, and 11 have now been returned on both page #2 and page #3 since they moved farther back in the list of sorted results In typical situations these impacts from index changes on paginated searching don’t significantly affect user experience — either because they happen extremely infrequently in fairly static collections, or because the users recognize that the collection of data is constantly evolving and expect to see documents shift up and down in the result sets. Performance Problems with "Deep Paging" In some situations, the results of a Solr search are not destined for a simple paginated user interface. When you wish to fetch a very large number of sorted results from Solr to feed into an external system, using very large values for the start or rows parameters can be very inefficient. Pagination using start and rows not only require Solr to compute (and sort) in memory all of the matching documents that should be fetched for the current page, but also all of the documents that would have appeared on previous pages. While a request for start=0&rows=1000000 may be obviously inefficient because it requires Solr to maintain & sort in memory a set of 1 million documents, likewise a request for start=999000&rows=1000 is equally inefficient for the same reasons. Solr can’t compute which matching document is the 999001st result in sorted order, without first determining what the first 999000 matching sorted results are. If the index is distributed, which is common when running in SolrCloud mode, then 1 million documents are retrieved from each shard. For a ten shard index, ten million entries must be retrieved and sorted to figure out the 1000 documents that match those query parameters. Fetching A Large Number of Sorted Results: Cursors As an alternative to increasing the "start" parameter to request subsequent pages of sorted results, Solr supports using a "Cursor" to scan through results. Cursors in Solr are a logical concept that doesn’t involve caching any state information on the server. Instead the sort values of the last document returned to the client are used to compute a "mark" representing a logical point in the ordered space of sort values. That "mark" can be specified in the parameters of subsequent requests to tell Solr where to continue. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 581 of 1195 Using Cursors To use a cursor with Solr, specify a cursorMark parameter with the value of *. You can think of this being analogous to start=0 as a way to tell Solr "start at the beginning of my sorted results" except that it also informs Solr that you want to use a Cursor. In addition to returning the top N sorted results (where you can control N using the rows parameter) the Solr response will also include an encoded String named nextCursorMark. You then take the nextCursorMark String value from the response, and pass it back to Solr as the cursorMark parameter for your next request. You can repeat this process until you’ve fetched as many docs as you want, or until the nextCursorMark returned matches the cursorMark you’ve already specified — indicating that there are no more results. Constraints when using Cursors There are a few important constraints to be aware of when using cursorMark parameter in a Solr request 1. cursorMark and start are mutually exclusive parameters. ◦ Your requests must either not include a start parameter, or it must be specified with a value of “0”. 2. sort clauses must include the uniqueKey field (either asc or desc). ◦ If id is your uniqueKey field, then sort params like id asc and name asc, id desc would both work fine, but name asc by itself would not 3. Sorts including Date Math based functions that involve calculations relative to NOW will cause confusing results, since every document will get a new sort value on every subsequent request. This can easily result in cursors that never end, and constantly return the same documents over and over – even if the documents are never updated. In this situation, choose & re-use a fixed value for the NOW request param in all of your cursor requests. Cursor mark values are computed based on the sort values of each document in the result, which means multiple documents with identical sort values will produce identical Cursor mark values if one of them is the last document on a page of results. In that situation, the subsequent request using that cursorMark would not know which of the documents with the identical mark values should be skipped. Requiring that the uniqueKey field be used as a clause in the sort criteria guarantees that a deterministic ordering will be returned, and that every cursorMark value will identify a unique point in the sequence of documents. Cursor Examples Fetch All Docs The pseudo-code shown here shows the basic logic involved in fetching all documents matching a query using a cursor: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 582 of 1195 Apache Solr Reference Guide 7.3 // when fetching all docs, you might as well use a simple id sort // unless you really need the docs to come back in a specific order $params = [ q => $some_query, sort => 'id asc', rows => $r, cursorMark => '*' ] $done = false while (not $done) { $results = fetch_solr($params) // do something with $results if ($params[cursorMark] == $results[nextCursorMark]) { $done = true } $params[cursorMark] = $results[nextCursorMark] } Using SolrJ, this pseudo-code would be: SolrQuery q = (new SolrQuery(some_query)).setRows(r).setSort(SortClause.asc("id")); String cursorMark = CursorMarkParams.CURSOR_MARK_START; boolean done = false; while (! done) { q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark); QueryResponse rsp = solrServer.query(q); String nextCursorMark = rsp.getNextCursorMark(); doCustomProcessingOfResults(rsp); if (cursorMark.equals(nextCursorMark)) { done = true; } cursorMark = nextCursorMark; } If you wanted to do this by hand using curl, the sequence of requests would look something like this: Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 583 of 1195 $ curl '...&rows=10&sort=id+asc&cursorMark=*' { "response":{"numFound":32,"start":0,"docs":[ // ... 10 docs here ... ]}, "nextCursorMark":"AoEjR0JQ"} $ curl '...&rows=10&sort=id+asc&cursorMark=AoEjR0JQ' { "response":{"numFound":32,"start":0,"docs":[ // ... 10 more docs here ... ]}, "nextCursorMark":"AoEpVkRCREIxQTE2"} $ curl '...&rows=10&sort=id+asc&cursorMark=AoEpVkRCREIxQTE2' { "response":{"numFound":32,"start":0,"docs":[ // ... 10 more docs here ... ]}, "nextCursorMark":"AoEmbWF4dG9y"} $ curl '...&rows=10&sort=id+asc&cursorMark=AoEmbWF4dG9y' { "response":{"numFound":32,"start":0,"docs":[ // ... 2 docs here because we've reached the end. ]}, "nextCursorMark":"AoEpdmlld3Nvbmlj"} $ curl '...&rows=10&sort=id+asc&cursorMark=AoEpdmlld3Nvbmlj' { "response":{"numFound":32,"start":0,"docs":[ // no more docs here, and note that the nextCursorMark // matches the cursorMark param we used ]}, "nextCursorMark":"AoEpdmlld3Nvbmlj"} Fetch First N docs, Based on Post Processing Since the cursor is stateless from Solr’s perspective, your client code can stop fetching additional results as soon as you have decided you have enough information: while (! done) { q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark); QueryResponse rsp = solrServer.query(q); String nextCursorMark = rsp.getNextCursorMark(); boolean hadEnough = doCustomProcessingOfResults(rsp); if (hadEnough || cursorMark.equals(nextCursorMark)) { done = true; } cursorMark = nextCursorMark; } © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 584 of 1195 Apache Solr Reference Guide 7.3 How Cursors are Affected by Index Updates Unlike basic pagination, Cursor pagination does not rely on using an absolute "offset" into the completed sorted list of matching documents. Instead, the cursorMark specified in a request encapsulates information about the relative position of the last document returned, based on the absolute sort values of that document. This means that the impact of index modifications is much smaller when using a cursor compared to basic pagination. Consider the same example index described when discussing basic pagination: id name 1 A 2 B … 26 Z • A client requests q=:&rows=5&start=0&sort=name asc, id asc&cursorMark=* ◦ Documents with the ids 1-5 will be returned to the client in order • Document id 3 is deleted • The client requests 5 more documents using the nextCursorMark from the previous response ◦ Documents 6-10 will be returned — the deletion of a document that’s already been returned doesn’t affect the relative position of the cursor • 3 new documents are now added with the ids 90, 91, and 92; All three documents have a name of A • The client requests 5 more documents using the nextCursorMark from the previous response ◦ Documents 11-15 will be returned — the addition of new documents with sort values already past does not affect the relative position of the cursor • Document id 1 is updated to change its 'name' to Q • Document id 17 is updated to change its 'name' to A • The client requests 5 more documents using the nextCursorMark from the previous response ◦ The resulting documents are 16,1,18,19,20 in that order ◦ Because the sort value of document 1 changed so that it is after the cursor position, the document is returned to the client twice ◦ Because the sort value of document 17 changed so that it is before the cursor position, the document has been "skipped" and will not be returned to the client as the cursor continues to progress In a nutshell: When fetching all results matching a query using cursorMark, the only way index modifications can result in a document being skipped, or returned twice, is if the sort value of the document changes.  One way to ensure that a document will never be returned more then once, is to use the uniqueKey field as the primary (and therefore: only significant) sort criterion. In this situation, you will be guaranteed that each document is only returned once, no matter how it may be be modified during the use of the cursor. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 585 of 1195 "Tailing" a Cursor Because Cursor requests are stateless, and the cursorMark values encapsulate the absolute sort values of the last document returned from a search, it’s possible to "continue" fetching additional results from a cursor that has already reached its end. If new documents are added (or existing documents are updated) to the end of the results. You can think of this as similar to using something like "tail -f" in Unix. The most common examples of how this can be useful is when you have a "timestamp" field recording when a document has been added/updated in your index. Client applications can continuously poll a cursor using a sort=timestamp asc, id asc for documents matching a query, and always be notified when a document is added or updated matching the request criteria. Another common example is when you have uniqueKey values that always increase as new documents are created, and you can continuously poll a cursor using sort=id asc to be notified about new documents. The pseudo-code for tailing a cursor is only a slight modification from our early example for processing all docs matching a query: while (true) { $doneForNow = false while (not $doneForNow) { $results = fetch_solr($params) // do something with $results if ($params[cursorMark] == $results[nextCursorMark]) { $doneForNow = true } $params[cursorMark] = $results[nextCursorMark] } sleep($some_configured_delay) }  For certain specialized cases, the /export handler may be an option. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 586 of 1195 Apache Solr Reference Guide 7.3 Collapse and Expand Results The Collapsing query parser and the Expand component combine to form an approach to grouping documents for field collapsing in search results. The Collapsing query parser groups documents (collapsing the result set) according to your parameters, while the Expand component provides access to documents in the collapsed group for use in results display or other processing by a client application. Collapse & Expand can together do what the older Result Grouping (group=true) does for most use-cases but not all. Generally, you should prefer Collapse & Expand.  In order to use these features with SolrCloud, the documents must be located on the same shard. To ensure document co-location, you can define the router.name parameter as compositeId when creating the collection. For more information on this option, see the section Document Routing. Collapsing Query Parser The CollapsingQParser is really a post filter that provides more performant field collapsing than Solr’s standard approach when the number of distinct groups in the result set is high. This parser collapses the result set to a single document per group before it forwards the result set to the rest of the search components. So all downstream components (faceting, highlighting, etc.) will work with the collapsed result set. The CollapsingQParser accepts the following local parameters: field The field that is being collapsed on. The field must be a single valued String, Int or Float-type of field. min or max Selects the group head document for each group based on which document has the min or max value of the specified numeric field or function query. At most only one of the min, max, or sort (see below) parameters may be specified. If none are specified, the group head document of each group will be selected based on the highest scoring document in that group. The default is none. sort Selects the group head document for each group based on which document comes first according to the specified sort string. At most only one of the min, max, (see above) or sort parameters may be specified. If none are specified, the group head document of each group will be selected based on the highest scoring document in that group. The default is none. nullPolicy There are three available null policies: • ignore: removes documents with a null value in the collapse field. This is the default. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 587 of 1195 • expand: treats each document with a null value in the collapse field as a separate group. • collapse: collapses all documents with a null value into a single group using either highest score, or minimum/maximum. The default is ignore. hint Currently there is only one hint available: top_fc, which stands for top level FieldCache. The top_fc hint is only available when collapsing on String fields. top_fc usually provides the best query time speed but takes the longest to warm on startup or following a commit. top_fc will also result in having the collapsed field cached in memory twice if it’s used for faceting or sorting. For very high cardinality (high distinct count) fields, top_fc may not fare so well. The default is none. size Sets the initial size of the collapse data structures when collapsing on a numeric field only. The data structures used for collapsing grow dynamically when collapsing on numeric fields. Setting the size above the number of results expected in the result set will eliminate the resizing cost. The default is 100,000. Sample Usage Syntax Collapse on group_field selecting the document in each group with the highest scoring document: fq={!collapse field=group_field} Collapse on group_field selecting the document in each group with the minimum value of numeric_field: fq={!collapse field=group_field min=numeric_field} Collapse on group_field selecting the document in each group with the maximum value of numeric_field: fq={!collapse field=group_field max=numeric_field} Collapse on group_field selecting the document in each group with the maximum value of a function. Note that the cscore() function can be used with the min/max options to use the score of the current document being collapsed. fq={!collapse field=group_field max=sum(cscore(),numeric_field)} Collapse on group_field with a null policy so that all docs that do not have a value in the group_field will be treated as a single group. For each group, the selected document will be based first on a numeric_field, but ties will be broken by score: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 588 of 1195 Apache Solr Reference Guide 7.3 fq={!collapse field=group_field nullPolicy=collapse sort='numeric_field asc, score desc'} Collapse on group_field with a hint to use the top level field cache: fq={!collapse field=group_field hint=top_fc} The CollapsingQParserPlugin fully supports the QueryElevationComponent. Expand Component The ExpandComponent can be used to expand the groups that were collapsed by the CollapsingQParserPlugin. Example usage with the CollapsingQParserPlugin: q=foo&fq={!collapse field=ISBN} In the query above, the CollapsingQParserPlugin will collapse the search results on the ISBN field. The main search results will contain the highest ranking document from each book. The ExpandComponent can now be used to expand the results so you can see the documents grouped by ISBN. For example: q=foo&fq={!collapse field=ISBN}&expand=true The “expand=true” parameter turns on the ExpandComponent. The ExpandComponent adds a new section to the search output labeled “expanded”. Inside the expanded section there is a map with each group head pointing to the expanded documents that are within the group. As applications iterate the main collapsed result set, they can access the expanded map to retrieve the expanded groups. The ExpandComponent has the following parameters: expand.sort Orders the documents within the expanded groups. The default is score desc. expand.rows The number of rows to display in each group. The default is 5 rows. expand.q Overrides the main query (q), determines which documents to include in the main group. The default is to use the main query. expand.fq Overrides main filter queries (fq), determines which documents to include in the main group. The default is to use the main filter queries. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 589 of 1195 Result Grouping Result Grouping groups documents with a common field value into groups and returns the top documents for each group. For example, if you searched for "DVD" on an electronic retailer’s e-commerce site, you might be returned three categories such as "TV and Video", "Movies", and "Computers" with three results per category. In this case, the query term "DVD" appeared in all three categories, so Solr groups them together in order to increase relevancy for the user. Prefer Collapse & Expand instead  Solr’s Collapse and Expand feature is newer and mostly overlaps with Result Grouping. There are features unique to both, and they have different performance characteristics. That said, in most cases Collapse and Expand is preferable to Result Grouping. Result Grouping is separate from Faceting. Though it is conceptually similar, faceting returns all relevant results and allows the user to refine the results based on the facet category. For example, if you search for "shoes" on a footwear retailer’s e-commerce site, Solr would return all results for that query term, along with selectable facets such as "size," "color," "brand," and so on. You can however combine grouping with faceting. Grouped faceting supports facet.field and facet.range but currently doesn’t support date and pivot faceting. The facet counts are computed based on the first group.field parameter, and other group.field parameters are ignored. Grouped faceting differs from non grouped facets (sum of all facets) == (total of products with that property) as shown in the following example: Object 1 • name: Phaser 4620a • ppm: 62 • product_range: 6 Object 2 • name: Phaser 4620i • ppm: 65 • product_range: 6 Object 3 • name: ML6512 • ppm: 62 • product_range: 7 If you ask Solr to group these documents by "product_range", then the total amount of groups is 2, but the facets for ppm are 2 for 62 and 1 for 65. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 590 of 1195 Apache Solr Reference Guide 7.3 Grouping Parameters Result Grouping takes the following request parameters. Any number of these request parameters can be included in a single request: group If true, query results will be grouped. group.field The name of the field by which to group results. The field must be single-valued, and either be indexed or a field type that has a value source and works in a function query, such as ExternalFileField. It must also be a string-based field, such as StrField or TextField group.func Group based on the unique values of a function query.  This option does not work with distributed searches. group.query Return a single group of documents that match the given query. rows The number of groups to return. The default value is 10. start Specifies an initial offset for the list of groups. group.limit Specifies the number of results to return for each group. The default value is 1. group.offset Specifies an initial offset for the document list of each group. sort Specifies how Solr sorts the groups relative to each other. For example, sort=popularity desc will cause the groups to be sorted according to the highest popularity document in each group. The default value is score desc. group.sort Specifies how Solr sorts documents within each group. The default behavior if group.sort is not specified is to use the same effective value as the sort parameter. group.format If this parameter is set to simple, the grouped documents are presented in a single flat list, and the start and rows parameters affect the numbers of documents instead of groups. An alternate value for this parameter is grouped. group.main If true, the result of the first field grouping command is used as the main result list in the response, using group.format=simple. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 591 of 1195 group.ngroups If true, Solr includes the number of groups that have matched the query in the results. The default value is false. See below for Distributed Result Grouping Caveats when using sharded indexes. group.truncate If true, facet counts are based on the most relevant document of each group matching the query. The default value is false. group.facet Determines whether to compute grouped facets for the field facets specified in facet.field parameters. Grouped facets are computed based on the first specified group. As with normal field faceting, fields shouldn’t be tokenized (otherwise counts are computed for each token). Grouped faceting supports single and multivalued fields. Default is false.  There can be a heavy performance cost to this option. See below for Distributed Result Grouping Caveats when using sharded indexes. group.cache.percent Setting this parameter to a number greater than 0 enables caching for result grouping. Result Grouping executes two searches; this option caches the second search. The default value is 0. The maximum value is 100. Testing has shown that group caching only improves search time with Boolean, wildcard, and fuzzy queries. For simple queries like term or "match all" queries, group caching degrades performance. Any number of group commands (e.g., group.field, group.func, group.query, etc.) may be specified in a single request. Grouping Examples All of the following sample queries work with Solr’s “bin/solr -e techproducts” example. Grouping Results by Field In this example, we will group results based on the manu_exact field, which specifies the manufacturer of the items in the sample dataset. http://localhost:8983/solr/techproducts/select?fl=id,name&q=solr+memory&group=true&group.fie ld=manu_exact © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 592 of 1195 Apache Solr Reference Guide 7.3 { "..." "grouped":{ "manu_exact":{ "matches":6, "groups":[{ "groupValue":"Apache Software Foundation", "doclist":{"numFound":1,"start":0,"docs":[ { "id":"SOLR1000", "name":"Solr, the Enterprise Search Server"}] }}, { "groupValue":"Corsair Microsystems Inc.", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"VS1GB400C3", "name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail"}] }}, { "groupValue":"A-DATA Technology Inc.", "doclist":{"numFound":1,"start":0,"docs":[ { "id":"VDBDB1A16", "name":"A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEM"}] }}, { "groupValue":"Canon Inc.", "doclist":{"numFound":1,"start":0,"docs":[ { "id":"0579B002", "name":"Canon PIXMA MP500 All-In-One Photo Printer"}] }}, { "groupValue":"ASUS Computer Inc.", "doclist":{"numFound":1,"start":0,"docs":[ { "id":"EN7800GTX/2DHTV/256M", "name":"ASUS Extreme N7800GTX/2DHTV (256 MB)"}] } }]}}} The response indicates that there are six total matches for our query. For each of the five unique values of group.field, Solr returns a docList for that groupValue such that the numFound indicates the total number of documents in that group, and the top documents are returned according to the implicit default group.limit=1 and group.sort=score desc parameters. The resulting groups are then sorted by the score of the top document within each group based on the implicit sort=score desc, and the number of groups returned is limited to the implicit rows=10. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 593 of 1195 We can run the same query with the request parameter group.main=true. This will format the results as a single flat document list. This flat format does not include as much information as the normal result grouping query results – notably the numFound in each group – but it may be easier for existing Solr clients to parse. http://localhost:8983/solr/techproducts/select?fl=id,name,manufacturer&q=solr+memory&group=t rue&group.field=manu_exact&group.main=true { "responseHeader":{ "status":0, "QTime":1, "params":{ "fl":"id,name,manufacturer", "indent":"true", "q":"solr memory", "group.field":"manu_exact", "group.main":"true", "group":"true"}}, "grouped":{}, "response":{"numFound":6,"start":0,"docs":[ { "id":"SOLR1000", "name":"Solr, the Enterprise Search Server"}, { "id":"VS1GB400C3", "name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail"}, { "id":"VDBDB1A16", "name":"A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEM"}, { "id":"0579B002", "name":"Canon PIXMA MP500 All-In-One Photo Printer"}, { "id":"EN7800GTX/2DHTV/256M", "name":"ASUS Extreme N7800GTX/2DHTV (256 MB)"}] } } Grouping by Query In this example, we will use the group.query parameter to find the top three results for "memory" in two different price ranges: 0.00 to 99.99, and over 100. http://localhost:8983/solr/techproducts/select?indent=true&fl=name,price&q=memory&group=true &group.query=price:0+TO+99.99&group.query=price:[100+TO+*]&group.limit=3 © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 594 of 1195 Apache Solr Reference Guide 7.3 { "responseHeader":{ "status":0, "QTime":42, "params":{ "fl":"name,price", "indent":"true", "q":"memory", "group.limit":"3", "group.query":["price:[0 TO 99.99]", "price:[100 TO *]"], "group":"true"}}, "grouped":{ "price:[0 TO 99.99]":{ "matches":5, "doclist":{"numFound":1,"start":0,"docs":[ { "name":"CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail", "price":74.99}] }}, "price:[100 TO *]":{ "matches":5, "doclist":{"numFound":3,"start":0,"docs":[ { "name":"CORSAIR XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) Dual Channel Kit System Memory - Retail", "price":185.0}, { "name":"Canon PIXMA MP500 All-In-One Photo Printer", "price":179.99}, { "name":"ASUS Extreme N7800GTX/2DHTV (256 MB)", "price":479.95}] } } } } In this case, Solr found five matches for "memory," but only returns four results grouped by price. This is because one result for "memory" did not have a price assigned to it. Distributed Result Grouping Caveats Grouping is supported for distributed searches, with some caveats: • Currently group.func is is not supported in any distributed searches • group.ngroups and group.facet require that all documents in each group must be co-located on the same shard in order for accurate counts to be returned. Document routing via composite keys can be a useful solution in many situations. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 595 of 1195 Result Clustering The clustering (or cluster analysis) plugin attempts to automatically discover groups of related search hits (documents) and assign human-readable labels to these groups. By default in Solr, the clustering algorithm is applied to the search result of each single query -— this is called an on-line clustering. While Solr contains an extension for full-index clustering (off-line clustering) this section will focus on discussing on-line clustering only. Clusters discovered for a given query can be perceived as dynamic facets. This is beneficial when regular faceting is difficult (field values are not known in advance) or when the queries are exploratory in nature. Take a look at the Carrot2 project’s demo page to see an example of search results clustering in action (the groups in the visualization have been discovered automatically in search results to the right, there is no external information involved). The query issued to the system was Solr. It seems clear that faceting could not yield a similar set of groups, although the goals of both techniques are similar—to let the user explore the set of search results and either rephrase the query or narrow the focus to a subset of current documents. Clustering is also similar to Result Grouping in that it can help to look deeper into search results, beyond the top few hits. Clustering Concepts Each document passed to the clustering component is composed of several logical parts: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 596 of 1195 Apache Solr Reference Guide 7.3 • a unique identifier, • origin URL, • the title, • the main content, • a language code of the title and content. The identifier part is mandatory, everything else is optional but at least one of the text fields (title or content) will be required to make the clustering process reasonable. It is important to remember that logical document parts must be mapped to a particular schema and its fields. The content (text) for clustering can be sourced from either a stored text field or context-filtered using a highlighter, all these options are explained below in the configuration section. A clustering algorithm is the actual logic (implementation) that discovers relationships among the documents in the search result and forms human-readable cluster labels. Depending on the choice of the algorithm the clusters may (and probably will) vary. Solr comes with several algorithms implemented in the open source Carrot2 project, commercial alternatives also exist. Clustering Quick Start Example The “techproducts” example included with Solr is pre-configured with all the necessary components for result clustering — but they are disabled by default. To enable the clustering component contrib and a dedicated search handler configured to use it, specify a JVM System Property when running the example: bin/solr start -e techproducts -Dsolr.clustering.enabled=true You can now try out the clustering handler by opening the following URL in a browser: http://localhost:8983/solr/techproducts/clustering?q=:&rows=100&wt=xml The output XML should include search hits and an array of automatically discovered clusters at the end, resembling the output shown here: 0 299 GB18030TEST Test with some GB18030 encoded characters No accents here 这是一个功能 This is a feature (translated) 这份文件是很有光泽 Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 597 of 1195 This document is very shiny (translated) 0.0 0,USD true 1448955395025403904 1.0 DDR 3.9599865057283354 TWINX2048-3200PRO VS1GB400C3 VDBDB1A16 iPod 11.959228467119022 F8V7067-APL-KIT IW-02 MA147LL/A Other Topics 0.0 true adata apple asus ati © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 598 of 1195 Apache Solr Reference Guide 7.3 There were a few clusters discovered for this query (*:*), separating search hits into various categories: DDR, iPod, Hard Drive, etc. Each cluster has a label and score that indicates the "goodness" of the cluster. The score is algorithm-specific and is meaningful only in relation to the scores of other clusters in the same set. In other words, if cluster A has a higher score than cluster B, cluster A should be of better quality (have a better label and/or more coherent document set). Each cluster has an array of identifiers of documents belonging to it. These identifiers correspond to the uniqueKey field declared in the schema. Depending on the quality of input documents, some clusters may not make much sense. Some documents may be left out and not be clustered at all; these will be assigned to the synthetic Other Topics group, marked with the other-topics property set to true (see the XML dump above for an example). The score of the other topics group is zero. Installing the Clustering Contrib The clustering contrib extension requires dist/solr-clustering-*.jar and all JARs under contrib/clustering/lib. Clustering Configuration Declaration of the Clustering Search Component and Request Handler Clustering extension is a search component and must be declared in solrconfig.xml. Such a component can be then appended to a request handler as the last component in the chain (because it requires search results which must be previously fetched by the search component). An example configuration could look as shown below. 1. Include the required contrib JARs. Note that by default paths are relative to the Solr core so they may need adjustments to your configuration, or an explicit specification of the $solr.install.dir. 2. Declaration of the search component. Each component can also declare multiple clustering pipelines ("engines"), which can be selected at runtime by passing clustering.engine=(engine name) URL parameter. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 599 of 1195 lingo org.carrot2.clustering.lingo.LingoClusteringAlgorithm stc org.carrot2.clustering.stc.STCClusteringAlgorithm 3. A request handler to which we append the clustering component declared above. true true name="carrot.url">id name="carrot.title">doctitle name="carrot.snippet">content 100 *,score clustering Configuration Parameters of the Clustering Component The following parameters of each clustering engine or the entire clustering component (depending where they are declared) are available. clustering When true, clustering component is enabled. clustering.engine Declares which clustering engine to use. If not present, the first declared engine will become the default © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 600 of 1195 Apache Solr Reference Guide 7.3 one. clustering.results When true, the component will perform clustering of search results (this should be enabled). clustering.collection When true, the component will perform clustering of the whole document index (this section does not cover full-index clustering). At the engine declaration level, the following parameters are supported. carrot.algorithm The algorithm class. carrot.resourcesDir Algorithm-specific resources and configuration files (stop words, other lexical resources, default settings). By default points to conf/clustering/carrot2/ carrot.outputSubClusters If true and the algorithm supports hierarchical clustering, sub-clusters will also be emitted. Default value: true. carrot.numDescriptions Maximum number of per-cluster labels to return (if the algorithm assigns more than one label to a cluster). The carrot.algorithm parameter should contain a fully qualified class name of an algorithm supported by the Carrot2 framework. Currently, the following algorithms are available: • org.carrot2.clustering.lingo.LingoClusteringAlgorithm (open source) • org.carrot2.clustering.stc.STCClusteringAlgorithm (open source) • org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm (open source) • com.carrotsearch.lingo3g.Lingo3GClusteringAlgorithm (commercial) For a comparison of characteristics of these algorithms see the following links: • http://doc.carrot2.org/#section.advanced-topics.fine-tuning.choosing-algorithm • http://project.carrot2.org/algorithms.html • http://carrotsearch.com/lingo3g-comparison.html The question of which algorithm to choose depends on the amount of traffic (STC is faster than Lingo, but arguably produces less intuitive clusters, Lingo3G is the fastest algorithm but is not free or open source), expected result (Lingo3G provides hierarchical clusters, Lingo and STC provide flat clusters), and the input data (each algorithm will cluster the input slightly differently). There is no one answer which algorithm is "the best". Contextual and Full Field Clustering The clustering engine can apply clustering to the full content of (stored) fields or it can run an internal highlighter pass to extract context-snippets before clustering. Highlighting is recommended when the Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 601 of 1195 logical snippet field contains a lot of content (this would affect clustering performance). Highlighting can also increase the quality of clustering because the content passed to the algorithm will be more focused around the query (it will be query-specific context). The following parameters control the internal highlighter. carrot.produceSummary When true the clustering component will run a highlighter pass on the content of logical fields pointed to by carrot.title and carrot.snippet. Otherwise full content of those fields will be clustered. carrot.fragSize The size, in characters, of the snippets (aka fragments) created by the highlighter. If not specified, the default highlighting fragsize (hl.fragsize) will be used. carrot.summarySnippets The number of summary snippets to generate for clustering. If not specified, the default highlighting snippet count (hl.snippets) will be used. Logical to Document Field Mapping As already mentioned in Clustering Concepts, the clustering component clusters "documents" consisting of logical parts that need to be mapped onto physical schema of data stored in Solr. The field mapping attributes provide a connection between fields and logical document parts. Note that the content of title and snippet fields must be stored so that it can be retrieved at search time. carrot.title The field (alternatively comma- or space-separated list of fields) that should be mapped to the logical document’s title. The clustering algorithms typically give more weight to the content of the title field compared to the content (snippet). For best results, the field should contain concise, noise-free content. If there is no clear title in your data, you can leave this parameter blank. carrot.snippet The field (alternatively comma- or space-separated list of fields) that should be mapped to the logical document’s main content. If this mapping points to very large content fields the performance of clustering may drop significantly. An alternative then is to use query-context snippets for clustering instead of full field content. See the description of the carrot.produceSummary parameter for details. carrot.url The field that should be mapped to the logical document’s content URL. Leave blank if not required. Clustering Multilingual Content The field mapping specification can include a carrot.lang parameter, which defines the field that stores ISO 639-1 code of the language in which the title and content of the document are written. This information can be stored in the index based on apriori knowledge of the documents' source or a language detection filter applied at indexing time. All algorithms inside the Carrot2 framework will accept ISO codes of languages defined in LanguageCode enum. The language hint makes it easier for clustering algorithms to separate documents from different languages on input and to pick the right language resources for clustering. If you do have multi-lingual query results (or query results in a language different than English), it is strongly advised to map the language field © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 602 of 1195 Apache Solr Reference Guide 7.3 appropriately. carrot.lang The field that stores ISO 639-1 code of the language of the document’s text fields. carrot.lcmap A mapping of arbitrary strings into ISO 639 two-letter codes used by carrot.lang. The syntax of this parameter is the same as langid.map.lcmap, for example: langid.map.lcmap=japanese:ja polish:pl english:en The default language can also be set using Carrot2-specific algorithm attributes (in this case the MultilingualClustering.defaultLanguage attribute). Tweaking Algorithm Settings The algorithms that come with Solr are using their default settings which may be inadequate for all data sets. All algorithms have lexical resources and resources (stop words, stemmers, parameters) that may require tweaking to get better clusters (and cluster labels). For Carrot2-based algorithms it is probably best to refer to a dedicated tuning application called Carrot2 Workbench (screenshot below). From this application one can export a set of algorithm attributes as an XML file, which can be then placed under the location pointed to by carrot.resourcesDir. Providing Defaults for Clustering The default attributes for all engines (algorithms) declared in the clustering component are placed under carrot.resourcesDir and with an expected file name of engineName-attributes.xml. So for an engine named lingo and the default value of carrot.resourcesDir, the attributes would be read from a file in conf/clustering/carrot2/lingo-attributes.xml. An example XML file changing the default language of documents to Polish is shown below. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 603 of 1195 Tweaking Algorithms at Query-Time The clustering component and Carrot2 clustering algorithms can accept query-time attribute overrides. Note that certain things (for example lexical resources) can only be initialized once (at startup, via the XML configuration files). An example query that changes the LingoClusteringAlgorithm.desiredClusterCountBase parameter for the Lingo algorithm: http://localhost:8983/solr/techproducts/clustering?q=*:*&rows=100&LingoClusteringAlgorithm.desire dClusterCountBase=20 The clustering engine (the algorithm declared in solrconfig.xml) can also be changed at runtime by passing clustering.engine=name request attribute: http://localhost:8983/solr/techproducts/clustering?q=*:*&rows=100&clustering.engine=kmeans Performance Considerations with Dynamic Clustering Dynamic clustering of search results comes with two major performance penalties: • Increased cost of fetching a larger-than-usual number of search results (50, 100 or more documents), • Additional computational cost of the clustering itself. For simple queries, the clustering time will usually dominate the fetch time. If the document content is very long the retrieval of stored content can become a bottleneck. The performance impact of clustering can be lowered in several ways: • feed less content to the clustering algorithm by enabling carrot.produceSummary attribute, • perform clustering on selected fields (titles only) to make the input smaller, • use a faster algorithm (STC instead of Lingo, Lingo3G instead of STC), • tune the performance attributes related directly to a specific algorithm. Some of these techniques are described in Apache SOLR and Carrot2 integration strategies document, available at http://carrot2.github.io/solr-integration-strategies. The topic of improving performance is also © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 604 of 1195 Apache Solr Reference Guide 7.3 included in the Carrot2 manual at http://doc.carrot2.org/#section.advanced-topics.fine-tuning.performance. Additional Resources The following resources provide additional information about the clustering component in Solr and its potential applications. • Apache Solr and Carrot2 integration strategies: http://carrot2.github.io/solr-integration-strategies • Clustering and Visualization of Solr search results (video from Berlin BuzzWords conference, 2011): http://vimeo.com/26616444 Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 605 of 1195 Spatial Search Solr supports location data for use in spatial/geospatial searches. Using spatial search, you can: • Index points or other shapes • Filter search results by a bounding box or circle or by other shapes • Sort or boost scoring by distance between points, or relative area between rectangles • Generate a 2D grid of facet count numbers for heatmap generation or point-plotting. There are four main field types available for spatial search: • LatLonPointSpatialField • LatLonType (now deprecated) and its non-geodetic twin PointType • SpatialRecursivePrefixTreeFieldType (RPT for short), including RptWithGeometrySpatialField, a derivative • BBoxField LatLonPointSpatialField is the ideal field type for the most common use-cases for lat-lon point data. It replaces LatLonType which still exists for backwards compatibility. RPT offers some more features for more advanced/custom use cases and options like polygons and heatmaps. RptWithGeometrySpatialField is for indexing and searching non-point data though it can do points too. It can’t do sorting/boosting. BBoxField is for indexing bounding boxes, querying by a box, specifying a search predicate (Intersects,Within,Contains,Disjoint,Equals), and a relevancy sort/boost like overlapRatio or simply the area. Some esoteric details that are not in this guide can be found at http://wiki.apache.org/solr/SpatialSearch. LatLonPointSpatialField Here’s how LatLonPointSpatialField (LLPSF) should usually be configured in the schema: LLPSF supports toggling indexed, stored, docValues, and multiValued. LLPSF internally uses a 2dimensional Lucene "Points" (BDK tree) index when "indexed" is enabled (the default). When "docValues" is enabled, a latitude and longitudes pair are bit-interleaved into 64 bits and put into Lucene DocValues. The accuracy of the docValues data is about a centimeter. Indexing Points For indexing geodetic points (latitude and longitude), supply it in "lat,lon" order (comma separated). For indexing non-geodetic points, it depends. Use x y (a space) if RPT. For PointType however, use x,y (a © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 606 of 1195 Apache Solr Reference Guide 7.3 comma). If you’d rather use a standard industry format, Solr supports WKT and GeoJSON. However it’s much bulkier than the raw coordinates for such simple data. (Not supported by the deprecated LatLonType or PointType) Indexing GeoJSON and WKT Using the bin/post tool: bin/post -type "application/json" -url "http://localhost:8983/solr/mycollection/update?format=geojson" /path/to/geojson.file The key parameter to pass in with your request is: format The format of the file to pass in. Accepted values: WKT or geojson. Searching with Query Parsers There are two spatial Solr "query parsers" for geospatial search: geofilt and bbox. They take the following parameters: d The radial distance, usually in kilometers. RPT & BBoxField can set other units via the setting distanceUnits. pt The center point using the format "lat,lon" if latitude & longitude. Otherwise, "x,y" for PointType or "x y" for RPT field types. sfield A spatial indexed field. score (Advanced option; not supported by LatLonType (deprecated) or PointType) If the query is used in a scoring context (e.g., as the main query in q), this local parameter determines what scores will be produced. Valid values are: • none: A fixed score of 1.0. (the default) • kilometers: distance in kilometers between the field value and the specified center point • miles: distance in miles between the field value and the specified center point • degrees: distance in degrees between the field value and the specified center point • distance: distance between the field value and the specified center point in the distanceUnits configured for this field • recipDistance: 1 / the distance Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3  Page 607 of 1195 Don’t use this for indexed non-point shapes (e.g., polygons). The results will be erroneous. And with RPT, it’s only recommended for multi-valued point data, as the implementation doesn’t scale very well and for single-valued fields, you should instead use a separate non-RPT field purely for distance sorting. When used with BBoxField, additional options are supported: • overlapRatio: The relative overlap between the indexed shape & query shape. • area: haversine based area of the overlapping shapes expressed in terms of the distanceUnits configured for this field • area2D: cartesian coordinates based area of the overlapping shapes expressed in terms of the distanceUnits configured for this field filter (Advanced option; not supported by LatLonType (deprecated) or PointType). If you only want the query to score (with the above score local parameter), not filter, then set this local parameter to false. geofilt The geofilt filter allows you to retrieve results based on the geospatial distance (AKA the "great circle distance") from a given point. Another way of looking at it is that it creates a circular shape filter. For example, to find all documents within five kilometers of a given lat/lon point, you could enter &q= :&fq={!geofilt sfield=store}&pt=45.15,-93.85&d=5. This filter returns all results within a circle of the given radius around the initial point: bbox The bbox filter is very similar to geofilt except it uses the bounding box of the calculated circle. See the blue box in the diagram below. It takes the same parameters as geofilt. Here’s a sample query: &q=*:*&fq={!bbox sfield=store}&pt=45.15,-93.85&d=5 The rectangular shape is faster to compute and so it’s sometimes used as an alternative to geofilt when it’s acceptable to return points outside of the radius. However, if the ideal goal is a circle but you want it to run faster, then instead consider using the RPT field and try a large distErrPct value like 0.1 (10% radius). This will return results outside the radius but it will do so somewhat uniformly around the shape. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 608 of 1195  Apache Solr Reference Guide 7.3 When a bounding box includes a pole, the bounding box ends up being a "bounding bowl" (a spherical cap) that includes all values north of the lowest latitude of the circle if it touches the north pole (or south of the highest latitude if it touches the south pole). Filtering by an Arbitrary Rectangle Sometimes the spatial search requirement calls for finding everything in a rectangular area, such as the area covered by a map the user is looking at. For this case, geofilt and bbox won’t cut it. This is somewhat of a trick, but you can use Solr’s range query syntax for this by supplying the lower-left corner as the start of the range and the upper-right corner as the end of the range. Here’s an example: &q=*:*&fq=store:[45,-94 TO 46,-93] LatLonType (deprecated) does not support rectangles that cross the dateline. For RPT and BBoxField, if you are non-geospatial coordinates (geo="false") then you must quote the points due to the space, e.g., "x y". Optimizing: Cache or Not It’s most common to put a spatial query into an "fq" parameter – a filter query. By default, Solr will cache the query in the filter cache. If you know the filter query (be it spatial or not) is fairly unique and not likely to get a cache hit then specify cache="false" as a local-param as seen in the following example. The only spatial types which stand to benefit from this technique are LatLonPointSpatialField and LatLonType (deprecated). Enable docValues on the field (if it isn’t already). LatLonType (deprecated) additionally requires a cost="100" (or more) localparam. &q=...mykeywords...&fq=...someotherfilters...&fq={!geofilt cache=false}&sfield=store&pt=45.15,93.85&d=5 LLPSF does not support Solr’s "PostFilter". Distance Sorting or Boosting (Function Queries) There are four distance function queries: • geodist, see below, usually the most appropriate; Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 609 of 1195 • dist, to calculate the p-norm distance between multi-dimensional vectors; • hsin, to calculate the distance between two points on a sphere; • sqedist, to calculate the squared Euclidean distance between two points. For more information about these function queries, see the section on Function Queries. geodist geodist is a distance function that takes three optional parameters: (sfield,latitude,longitude). You can use the geodist function to sort results by distance or score return results. For example, to sort your results by ascending distance, use a request like: &q=*:*&fq={!geofilt}&sfield=store&pt=45.15,-93.85&d=50&sort=geodist() asc To return the distance as the document score, use a request like: &q={!func}geodist()&sfield=store&pt=45.15,-93.85&sort=score+asc&fl=*,score More Spatial Search Examples Here are a few more useful examples of what you can do with spatial search in Solr. Use as a Sub-Query to Expand Search Results Here we will query for results in Jacksonville, Florida, or within 50 kilometers of 45.15,-93.85 (near Buffalo, Minnesota): &q=*:*&fq=(state:"FL" AND city:"Jacksonville") OR {!geofilt}&sfield=store&pt=45.15,93.85&d=50&sort=geodist()+asc Facet by Distance To facet by distance, you can use the frange query parser: &q=*:*&sfield=store&pt=45.15,-93.85&facet.query={!frange l=0 u=5}geodist()&facet.query={!frange l=5.001 u=3000}geodist() There are other ways to do it too, like using a {!geofilt} in each facet.query. Boost Nearest Results Using the DisMax or Extended DisMax, you can combine spatial search with the boost function to boost the nearest results: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 610 of 1195 Apache Solr Reference Guide 7.3 &q.alt=*:*&fq={!geofilt}&sfield=store&pt=45.15,93.85&d=50&bf=recip(geodist(),2,200,20)&sort=score desc RPT RPT refers to either SpatialRecursivePrefixTreeFieldType (aka simply RPT) and an extended version: RptWithGeometrySpatialField (aka RPT with Geometry). RPT offers several functional improvements over LatLonPointSpatialField: • Non-geodetic – geo=false general x & y (not latitude and longitude) — if desired • Query by polygons and other complex shapes, in addition to circles & rectangles • Ability to index non-point shapes (e.g., polygons) as well as points – see RptWithGeometrySpatialField • Heatmap grid faceting RPT shares various features in common with LatLonPointSpatialField. Some are listed here: • Latitude/Longitude indexed point data; possibly multi-valued • Fast filtering with geofilt, bbox filters, and range query syntax (dateline crossing is supported) • Well-Known-Text (WKT) shape syntax (required for specifying polygons & other complex shapes), and GeoJSON too. In addition to indexing and searching, this works with the wt=geojson (GeoJSON Solr response-writer) and [geo f=myfield] (geo Solr document-transformer). • Sort/boost via geodist — although not recommended  Although RPT supports distance sorting/boosting, it is so inefficient at doing this that it might be removed in the future. Fortunately, you can use LatLonPointSpatialField as well as RPT. Use LLPSF for the distance sorting/boosting; it only needs to have docValues for this; the index attribute can be disabled as it won’t be used. Schema Configuration for RPT To use RPT, the field type must be registered and configured in schema.xml. There are many options for this field type. name The name of the field type. class This should be solr.SpatialRecursivePrefixTreeFieldType. But be aware that the Lucene spatial module includes some other so-called "spatial strategies" other than RPT, notably TermQueryPT*, BBox, PointVector*, and SerializedDV. Solr requires a field type to parallel these in order to use them. The asterisked ones have them. spatialContextFactory This is a Java class name to an internal extension point governing support for shape definitions & parsing. If you require polygon support, set this to JTS – an alias for org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory; otherwise it can be omitted. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 611 of 1195 See important info below about JTS. (note: prior to Solr 6, the "org.locationtech.spatial4j" part was "com.spatial4j.core" and there used to be no convenience JTS alias) geo If true, the default, latitude and longitude coordinates will be used and the mathematical model will generally be a sphere. If false, the coordinates will be generic X & Y on a 2D plane using Euclidean/Cartesian geometry. format Defines the shape syntax/format to be used. Defaults to WKT but GeoJSON is another popular format. Spatial4j governs this feature and supports other formats. If a given shape is parseable as "lat,lon" or "x y" then that is always supported. distanceUnits This is used to specify the units for distance measurements used throughout the use of this field. This can be degrees, kilometers or miles. It is applied to nearly all distance measurements involving the field: maxDistErr, distErr, d, geodist and the score when score is distance, area, or area2d. However, it doesn’t affect distances embedded in WKT strings, (e.g., BUFFER(POINT(200 10),0.2)), which are still in degrees. distanceUnits defaults to either kilometers if geo is true, or degrees if geo is false. distanceUnits replaces the units attribute; which is now deprecated and mutually exclusive with this attribute. distErrPct Defines the default precision of non-point shapes (both index & query), as a fraction between 0.0 (fully precise) to 0.5. The closer this number is to zero, the more accurate the shape will be. However, more precise indexed shapes use more disk space and take longer to index. Bigger distErrPct values will make queries faster but less accurate. At query time this can be overridden in the query syntax, such as to 0.0 so as to not approximate the search shape. The default for the RPT field is 0.025.  For RPTWithGeometrySpatialField (see below), there’s always complete accuracy with the serialized geometry and so this doesn’t control accuracy so much as it controls the trade-off of how big the index should be. distErrPct defaults to 0.15 for that field. maxDistErr Defines the highest level of detail required for indexed data. If left blank, the default is one meter – just a bit less than 0.000009 degrees. This setting is used internally to compute an appropriate maxLevels (see below). worldBounds Defines the valid numerical ranges for x and y, in the format of ENVELOPE(minX, maxX, maxY, minY). If geo="true", the standard lat-lon world boundaries are assumed. If geo=false, you should define your boundaries. distCalculator Defines the distance calculation algorithm. If geo=true, haversine is the default. If geo=false, cartesian © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 612 of 1195 Apache Solr Reference Guide 7.3 will be the default. Other possible values are lawOfCosines, vincentySphere and cartesian^2. prefixTree Defines the spatial grid implementation. Since a PrefixTree (such as RecursivePrefixTree) maps the world as a grid, each grid cell is decomposed to another set of grid cells at the next level. If geo=true then the default prefix tree is geohash, otherwise it’s quad. Geohash has 32 children at each level, quad has 4. Geohash can only be used for geo=true as it’s strictly geospatial. A third choice is packedQuad, which is generally more efficient than quad, provided there are many levels — perhaps 20 or more. maxLevels Sets the maximum grid depth for indexed data. Instead, it’s usually more intuitive to compute an appropriate maxLevels by specifying maxDistErr. And there are others: normWrapLongitude, datelineRule, validationRule, autoIndex, allowMultiOverlap, precisionModel. For further info, see notes below about spatialContextFactory implementations referenced above, especially the link to the JTS based one. Standard Shapes The RPT field types support a set of standard shapes: points, circles (aka buffered points), envelopes (aka rectangles or bounding boxes), line strings, polygons, and "multi" variants of these. The envelopes and line strings are Euclidean/cartesian (flat 2D) shapes. Underlying Solr is the Spatial4j library which implements them. To support other shapes, you can configure the spatialContextFactory attribute on the field type to reference other options. Two are available: JTS and Geo3D. JTS and Polygons (flat) The JTS Topology Suite is a popular computational geometry library with a Euclidean/cartesian (flat 2D) model. It supports a variety of shapes including polygons, buffering shapes, and some invalid polygon repair fall-backs. With the help of Spatial4j, included with Solr, the polygons support dateline (anti-meridian) crossing. Unfortunately Solr cannot include JTS due to its LGPL license. You must download it (a JAR file) and put that in a special location internal to Solr: SOLR_INSTALL/server/solr-webapp/webapp/WEB-INF/lib/. You can readily download it here: https://repo1.maven.org/maven2/com/vividsolutions/jts-core/. It will not work if placed in other more typical Solr lib directories, unfortunately. JTS’s license is expected to be transitioned to BSD by the end of 2017. Set the spatialContextFactory attribute on the field type to JTS. When activated, there are additional configuration attributes available; see org.locationtech.spatial4j.context.jts.JtsSpatialContextFactory for the Javadocs, and remember to look at the superclass’s options as well. One option in particular you should most likely enable is autoIndex (i.e., use JTS’s PreparedGeometry) as it’s been shown to be a major performance boost for non-trivial polygons. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 613 of 1195 Once the field type has been defined, define a field that uses it. Here’s an example polygon query for a field "geo" that can be either solr.SpatialRecursivePrefixTreeFieldType or RptWithGeometrySpatialField: &q=*:*&fq={!field f=geo}Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))) Inside the parenthesis following the search predicate is the shape definition. The format of that shape is governed by the format attribute on the field type, defaulting to WKT. If you prefer GeoJSON, you can specify that instead. Beyond this Reference Guide and Spatila4j’s docs, there are some details that remain at the Solr Wiki at http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4. Geo3D and Polygons (on the ellipsoid) Geo3D is the colloquial name of the Lucene spatial-3d module, included with Solr. It’s a computational geometry library implementing a variety of shapes (including polygons) on a sphere or WGS84 ellipsoid. Geo3D is particularly suited for spatial applications where the geometries cover large distances across the globe. Geo3D is named as-such due to its internal implementation that uses geocentric coordinates (X,Y,Z), not for 3-dimensional geometry, which it does not support. Despite these internal details, you still supply latitude and longitude as you would normally in Solr. Set the spatialContextFactory attribute on the field type to Geo3D. Once the field type has been defined, define a field that uses it. RptWithGeometrySpatialField The RptWithGeometrySpatialField field type is a derivative of SpatialRecursivePrefixTreeFieldType that also stores the original geometry internally in Lucene DocValues, which it uses to achieve accurate search. It can also be used for indexed point fields. The Intersects predicate (the default) is particularly fast, since many search results can be returned as an accurate hit without requiring a geometry check. This field type is configured just like RPT except that the default distErrPct is 0.15 (higher than 0.025) because the grid squares are purely for performance and not to fundamentally represent the shape. An optional in-memory cache can be defined in solrconfig.xml, which should be done when the data tends © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 614 of 1195 Apache Solr Reference Guide 7.3 to have shapes with many vertices. Assuming you name your field "geom", you can configure an optional cache in solrconfig.xml by adding the following – notice the suffix of the cache name: When using this field type, you will likely not want to mark the field as stored because it’s redundant with the DocValues data and surely larger because of the formatting (be it WKT or GeoJSON). To retrieve the spatial data in search results from DocValues, use the [geo] transformer — Transforming Result Documents. Heatmap Faceting The RPT field supports generating a 2D grid of facet counts for documents having spatial data in each grid cell. For high-detail grids, this can be used to plot points, and for lesser detail it can be used for heatmap generation. The grid cells are determined at index-time based on RPT’s configuration. At facet counting time, the indexed cells in the region of interest are traversed and a grid of counters corresponding to each cell are incremented. Solr can return the data in a straight-forward 2D array of integers or in a PNG which compresses better for larger data sets but must be decoded. The heatmap feature is accessed from Solr’s faceting feature. As a part of faceting, it supports the key local parameter as well as excluding tagged filter queries, just like other types of faceting do. This allows multiple heatmaps to be returned on the same field with different filters. facet Set to true to enable faceting. facet.heatmap The field name of type RPT. facet.heatmap.geom The region to compute the heatmap on, specified using the rectangle-range syntax or WKT. It defaults to the world. ex: ["-180 -90" TO "180 90"]. facet.heatmap.gridLevel A specific grid level, which determines how big each grid cell is. Defaults to being computed via distErrPct (or distErr). facet.heatmap.distErrPct A fraction of the size of geom used to compute gridLevel. Defaults to 0.15. It’s computed the same as a similarly named parameter for RPT. facet.heatmap.distErr A cell error distance used to pick the grid level indirectly. It’s computed the same as a similarly named parameter for RPT. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 615 of 1195 facet.heatmap.format The format, either ints2D (default) or png.  You’ll experiment with different distErrPct values (probably 0.10 - 0.20) with various input geometries till the default size is what you’re looking for. The specific details of how it’s computed isn’t important. For high-detail grids used in point-plotting (loosely one cell per pixel), set distErr to be the number of decimal-degrees of several pixels or so of the map being displayed. Also, you probably don’t want to use a geohash-based grid because the cell orientation between grid levels flip-flops between being square and rectangle. Quad is consistent and has more levels, albeit at the expense of a larger index. Here’s some sample output in JSON (with "…" inserted for brevity): {gridLevel=6,columns=64,rows=64,minX=-180.0,maxX=180.0,minY=-90.0,maxY=90.0, counts_ints2D=[[0, 0, 2, 1, ....],[1, 1, 3, 2, ...],...]} The output shows the gridLevel which is interesting since it’s often computed from other parameters. If an interface being developed allows an explicit resolution increase/decrease feature then subsequent requests can specify the gridLevel explicitly. The minX, maxX, minY, maxY reports the region where the counts are. This is the minimally enclosing bounding rectangle of the input geom at the target grid level. This may wrap the dateline. The columns and rows values are how many columns and rows that the output rectangle is to be divided by evenly. Note: Don’t divide an on-screen projected map rectangle evenly to plot these rectangles/points since the cell data is in the coordinate space of decimal degrees if geo=true or whatever units were given if geo=false. This could be arranged to be the same as an on-screen map but won’t necessarily be. The counts_ints2D key has a 2D array of integers. The initial outer level is in row order (top-down), then the inner arrays are the columns (left-right). If any array would be all zeros, a null is returned instead for efficiency reasons. The entire value is null if there is no matching spatial data. If format=png then the output key is counts_png. It’s a base-64 encoded string of a 4-byte PNG. The PNG logically holds exactly the same data that the ints2D format does. Note that the alpha channel byte is flipped to make it easier to view the PNG for diagnostic purposes, since otherwise counts would have to exceed 2^24 before it becomes non-opague. Thus counts greater than this value will become opaque. BBoxField The BBoxField field type indexes a single rectangle (bounding box) per document field and supports searching via a bounding box. It supports most spatial search predicates, it has enhanced relevancy modes based on the overlap or area between the search rectangle and the indexed rectangle. It’s particularly useful for its relevancy modes. To configure it in the schema, use a configuration like this: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 616 of 1195 Apache Solr Reference Guide 7.3 BBoxField is actually based off of 4 instances of another field type referred to by numberType. It also uses a boolean to flag a dateline cross. Assuming you want to use the relevancy feature, docValues is required. Some of the attributes are in common with the RPT field like geo, units, worldBounds, and spatialContextFactory because they share some of the same spatial infrastructure. To index a box, add a field value to a bbox field that’s a string in the WKT/CQL ENVELOPE syntax. Example: ENVELOPE(-10, 20, 15, 10) which is minX, maxX, maxY, minY order. The parameter ordering is unintuitive but that’s what the spec calls for. Alternatively, you could provide a rectangular polygon in WKT (or GeoJSON if you set set format="GeoJSON"). To search, you can use the {!bbox} query parser, or the range syntax e.g., [10,-10 TO 15,20], or the ENVELOPE syntax wrapped in parenthesis with a leading search predicate. The latter is the only way to choose a predicate other than Intersects. For example: &q={!field f=bbox}Contains(ENVELOPE(-10, 20, 15, 10)) Now to sort the results by one of the relevancy modes, use it like this: &q={!field f=bbox score=overlapRatio}Intersects(ENVELOPE(-10, 20, 15, 10)) The score local parameter can be one of overlapRatio, area, and area2D. area scores by the document area using surface-of-a-sphere (assuming geo=true) math, while area2D uses simple width * height. overlapRatio computes a [0-1] ranged score based on how much overlap exists relative to the document’s area and the query area. The javadocs of BBoxOverlapRatioValueSource have more info on the formula. There is an additional parameter queryTargetProportion that allows you to weight the query side of the formula to the index (target) side of the formula. You can also use &debug=results to see useful score computation info. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 617 of 1195 The Terms Component The Terms Component provides access to the indexed terms in a field and the number of documents that match each term. This can be useful for building an auto-suggest feature or any other feature that operates at the term level instead of the search or document level. Retrieving terms in index order is very fast since the implementation directly uses Lucene’s TermEnum to iterate over the term dictionary. In a sense, this search component provides fast field-faceting over the whole index, not restricted by the base query or any filters. The document frequencies returned are the number of documents that match the term, including any documents that have been marked for deletion but not yet removed from the index. Configuring the Terms Component By default, the Terms Component is already configured in solrconfig.xml for each collection. Defining the Terms Component Defining the Terms search component is straightforward: simply give it a name and use the class solr.TermsComponent. This makes the component available for use, but by itself will not be useable until included with a request handler. Using the Terms Component in a Request Handler The terms component is included with the /terms request handler, which is among Solr’s out-of-the-box request handlers - see Implicit RequestHandlers. Note that the defaults for this request handler set the parameter "terms" to true, which allows terms to be returned on request. The parameter "distrib" is set to false, which allows this handler to be used only on a single Solr core. You could add this component to another handler if you wanted to, and pass "terms=true" in the HTTP request in order to get terms back. If it is only defined in a separate handler, you must use that handler when querying in order to get terms and not regular documents as results. Terms Component Parameters The parameters below allow you to control what terms are returned. You can also configure any of these with the request handler if you’d like to set them permanently. Or, you can add them to the query request. These parameters are: terms If set to true, enables the Terms Component. By default, the Terms Component is off (false). Example: terms=true © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 618 of 1195 Apache Solr Reference Guide 7.3 terms.fl Specifies the field from which to retrieve terms. This parameter is required if terms=true. Example: terms.fl=title terms.list Fetches the document frequency for a comma delimited list of terms. Terms are always returned in index order. If terms.ttf is set to true, also returns their total term frequency. If multiple terms.fl are defined, these statistics will be returned for each term in each requested field. Example: terms.list=termA,termB,termC terms.limit Specifies the maximum number of terms to return. The default is 10. If the limit is set to a number less than 0, then no maximum limit is enforced. Although this is not required, either this parameter or terms.upper must be defined. Example: terms.limit=20 terms.lower Specifies the term at which to start. If not specified, the empty string is used, causing Solr to start at the beginning of the field. Example: terms.lower=orange terms.lower.incl If set to true, includes the lower-bound term (specified with terms.lower in the result set. Example: terms.lower.incl=false terms.mincount Specifies the minimum document frequency to return in order for a term to be included in a query response. Results are inclusive of the mincount (that is, >= mincount). Example: terms.mincount=5 terms.maxcount Specifies the maximum document frequency a term must have in order to be included in a query response. The default setting is -1, which sets no upper bound. Results are inclusive of the maxcount (that is, <= maxcount). Example: terms.maxcount=25 terms.prefix Restricts matches to terms that begin with the specified string. Example: terms.prefix=inter terms.raw If set to true, returns the raw characters of the indexed term, regardless of whether it is human-readable. For instance, the indexed form of numeric numbers is not human-readable. Example: terms.raw=true Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 619 of 1195 terms.regex Restricts matches to terms that match the regular expression. Example: terms.regex=.*pedist terms.regex.flag Defines a Java regex flag to use when evaluating the regular expression defined with terms.regex. See http://docs.oracle.com/javase/tutorial/essential/regex/pattern.html for details of each flag. Valid options are: • case_insensitive • comments • multiline • literal • dotall • unicode_case • canon_eq • unix_lines Example: terms.regex.flag=case_insensitive terms.stats Include index statistics in the results. Currently returns only the numDocs for a collection. When combined with terms.list it provides enough information to compute inverse document frequency (IDF) for a list of terms. terms.sort Defines how to sort the terms returned. Valid options are count, which sorts by the term frequency, with the highest term frequency first, or index, which sorts in index order. Example: terms.sort=index terms.ttf If set to true, returns both df (docFreq) and ttf (totalTermFreq) statistics for each requested term in terms.list. In this case, the response format is: 22 73 terms.upper Specifies the term to stop at. Although this parameter is not required, either this parameter or terms.limit must be defined. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 620 of 1195 Apache Solr Reference Guide 7.3 Example: terms.upper=plum terms.upper.incl If set to true, the upper bound term is included in the result set. The default is false. Example: terms.upper.incl=true The response to a terms request is a list of the terms and their document frequency values. You may also be interested in the TermsComponent javadoc. Terms Component Examples All of the following sample queries work with Solr’s “bin/solr -e techproducts” example. Get Top 10 Terms This query requests the first ten terms in the name field: http://localhost:8983/solr/techproducts/terms?terms.fl=name&wt=xml Results: 0 2 5 3 3 3 3 3 3 3 3 3 Get First 10 Terms Starting with Letter 'a' This query requests the first ten terms in the name field, in index order (instead of the top 10 results by document count): Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 621 of 1195 http://localhost:8983/solr/techproducts/terms?terms.fl=name&terms.lower=a&terms.sort=index&wt=xml Results: 0 0 1 1 1 1 1 1 1 1 1 1 SolrJ Invocation SolrQuery query = new SolrQuery(); query.setRequestHandler("/terms"); query.setTerms(true); query.setTermsLimit(5); query.setTermsLower("s"); query.setTermsPrefix("s"); query.addTermsField("terms_s"); query.setTermsMinCount(1); QueryRequest request = new QueryRequest(query); List terms = request.process(getSolrClient()).getTermsResponse().getTerms("terms_s"); Using the Terms Component for an Auto-Suggest Feature If the Suggester doesn’t suit your needs, you can use the Terms component in Solr to build a similar feature for your own search application. Simply submit a query specifying whatever characters the user has typed so far as a prefix. For example, if the user has typed "at", the search engine’s interface would submit the following query: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 622 of 1195 Apache Solr Reference Guide 7.3 http://localhost:8983/solr/techproducts/terms?terms.fl=name&terms.prefix=at&wt=xml Result: 0 1 1 1 You can use the parameter omitHeader=true to omit the response header from the query response, like in this example, which also returns the response in JSON format: http://localhost:8983/solr/techproducts/terms?terms.fl=name&terms.prefix=at&omitHeader=true Result: { "terms": { "name": [ "ata", 1, "ati", 1 ] } } Distributed Search Support The TermsComponent also supports distributed indexes. For the /terms request handler, you must provide the following two parameters: shards Specifies the shards in your distributed indexing configuration. For more information about distributed indexing, see Distributed Search with Index Sharding. shards.qt Specifies the request handler Solr uses for requests to shards. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 623 of 1195 The Term Vector Component The TermVectorComponent is a search component designed to return additional information about documents matching your search. For each document in the response, the TermVectorCcomponent can return the term vector, the term frequency, inverse document frequency, position, and offset information. Term Vector Component Configuration The TermVectorComponent is not enabled implicitly in Solr - it must be explicitly configured in your solrconfig.xml file. The examples on this page show how it is configured in Solr’s “techproducts” example: bin/solr -e techproducts To enable the this component, you need to configure it using a searchComponent element: A request handler must then be configured to use this component name. In the techproducts example, the component is associated with a special request handler named /tvrh, that enables term vectors by default using the tv=true parameter; but you can associate it with any request handler: true tvComponent Once your handler is defined, you may use in conjunction with any schema (that has a uniqueKeyField) to fetch term vectors for fields configured with the termVector attribute, such as in the techproducts sample schema. For example: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 624 of 1195 Apache Solr Reference Guide 7.3 Invoking the Term Vector Component The example below shows an invocation of this component using the above configuration: http://localhost:8983/solr/techproducts/tvrh?q=*:*&start=0&rows=10&fl=id,includes&wt=xml Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 625 of 1195 ... GB18030TEST EN7800GTX/2DHTV/256M 100-435805 3007WFP SOLR1000 0579B002 UTF8TEST 9885A004 adata apple Term Vector Request Parameters The example below shows some of the available request parameters for this component: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 626 of 1195 Apache Solr Reference Guide 7.3 http://localhost:8983/solr/techproducts/tvrh?q=includes:[* TO *]&rows=10&indent=true&tv=true&tv.tf=true&tv.df=true&tv.positions=true&tv.offsets=true&tv.payload s=true&tv.fl=includes tv If true, the Term Vector Component will run. tv.docIds For a given comma-separated list of Lucene document IDs (not the Solr Unique Key), term vectors will be returned. tv.fl For a given comma-separated list of fields, term vectors will be returned. If not specified, the fl parameter is used. tv.all If true, all the boolean parameters listed below (tv.df, tv.offsets, tv.positions, tv.payloads, tv.tf and tv.tf_idf) will be enabled. tv.df If true, returns the Document Frequency (DF) of the term in the collection. This can be computationally expensive. tv.offsets If true, returns offset information for each term in the document. tv.positions If true, returns position information. tv.payloads If true, returns payload information. tv.tf If true, returns document term frequency info for each term in the document. tv.tf_idf If true, calculates TF / DF (i.e.,: TF * IDF) for each term. Please note that this is a literal calculation of "Term Frequency multiplied by Inverse Document Frequency" and not a classical TF-IDF similarity measure. This parameter requires both tv.tf and tv.df to be "true". This can be computationally expensive. (The results are not shown in example output) To see an example of TermVector component output, see the Wiki page: http://wiki.apache.org/solr/ TermVectorComponentExampleOptions For schema requirements, see also the section Field Properties by Use Case. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 627 of 1195 SolrJ and the Term Vector Component Neither the SolrQuery class nor the QueryResponse class offer specific method calls to set Term Vector Component parameters or get the "termVectors" output. However, there is a patch for it: SOLR-949. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 628 of 1195 Apache Solr Reference Guide 7.3 The Stats Component The Stats component returns simple statistics for numeric, string, and date fields within the document set. The sample queries in this section assume you are running the “techproducts” example included with Solr: bin/solr -e techproducts Stats Component Parameters The Stats Component accepts the following parameters: stats If true, then invokes the Stats component. stats.field Specifies a field for which statistics should be generated. This parameter may be invoked multiple times in a query in order to request statistics on multiple fields. Local Parameters may be used to indicate which subset of the supported statistics should be computed, and/or that statistics should be computed over the results of an arbitrary numeric function (or query) instead of a simple field name. See the examples below. Stats Component Example The query below demonstrates computing stats against two different fields numeric fields, as well as stats over the results of a termfreq() function call using the text field: http://localhost:8983/solr/techproducts/select?q=*:*&wt=xml&stats=true&stats.field={!func}termfre q('text','memory')&stats.field=price&stats.field=popularity&rows=0&indent=true Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 629 of 1195 0.0 3.0 32 0 10.0 22.0 0.3125 0.7803018439949604 0.0 2199.0 16 16 5251.270030975342 6038619.175900028 328.20437693595886 536.3536996709846 0.0 10.0 15 17 85.0 603.0 5.666666666666667 2.943920288775949 Statistics Supported The table below explains the statistics supported by the Stats component. Not all statistics are supported for all field types, and not all statistics are computed by default (see Local Parameters with the Stats Component below for details) min The minimum value of the field/function in all documents in the set. This statistic is computed for all field types and is computed by default. max The maximum value of the field/function in all documents in the set. This statistic is computed for all field © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 630 of 1195 Apache Solr Reference Guide 7.3 types and is computed by default. sum The sum of all values of the field/function in all documents in the set. This statistic is computed for numeric and date field types and is computed by default. count The number of values found in all documents in the set for this field/function. This statistic is computed for all field types and is computed by default. missing The number of documents in the set which do not have a value for this field/function. This statistic is computed for all field types and is computed by default. sumOfSquares Sum of all values squared (a by product of computing stddev). This statistic is computed for numeric and date field types and is computed by default. mean The average (v1 + v2 …. + vN)/N. This statistic is computed for numeric and date field types and is computed by default. stddev Standard deviation, measuring how widely spread the values in the data set are. This statistic is computed for numeric and date field types and is computed by default. percentiles A list of percentile values based on cut-off points specified by the parameter value, such as 1,99,99.9. These values are an approximation, using the t-digest algorithm. This statistic is computed for numeric field types and is not computed by default. distinctValues The set of all distinct values for the field/function in all of the documents in the set. This calculation can be very expensive for fields that do not have a tiny cardinality. This statistic is computed for all field types but is not computed by default. countDistinct The exact number of distinct values in the field/function in all of the documents in the set. This calculation can be very expensive for fields that do not have a tiny cardinality. This statistic is computed for all field types but is not computed by default. cardinality A statistical approximation (currently using the HyperLogLog algorithm) of the number of distinct values in the field/function in all of the documents in the set. This calculation is much more efficient then using the countDistinct option, but may not be 100% accurate. Input for this option can be floating point number between 0.0 and 1.0 indicating how aggressively the algorithm should try to be accurate: 0.0 means use as little memory as possible; 1.0 means use as much memory as needed to be as accurate as possible. true is supported as an alias for 0.3. This statistic is computed for all field types but is not computed by default. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 631 of 1195 Local Parameters with the Stats Component Similar to the Facet Component, the stats.field parameter supports local parameters for: • Tagging & Excluding Filters: stats.field={!ex=filterA}price • Changing the Output Key: stats.field={!key=my_price_stats}price • Tagging stats for use with facet.pivot: stats.field={!tag=my_pivot_stats}price Local parameters can also be used to specify individual statistics by name, overriding the set of statistics computed by default, e.g., stats.field={!min=true max=true percentiles='99,99.9,99.99'}price.  If any supported statistics are specified via local parameters, then the entire set of default statistics is overridden and only the requested statistics are computed. Additional "Expert" local params are supported in some cases for affecting the behavior of some statistics: • percentiles ◦ tdigestCompression - a positive numeric value defaulting to 100.0 controlling the compression factor of the T-Digest. Larger values means more accuracy, but also uses more memory. • cardinality ◦ hllPreHashed - a boolean option indicating that the statistics are being computed over a "long" field that has already been hashed at index time – allowing the HLL computation to skip this step. ◦ hllLog2m - an integer value specifying an explicit "log2m" value to use, overriding the heuristic value determined by the cardinality local param and the field type – see the java-hll documentation for more details ◦ hllRegwidth - an integer value specifying an explicit "regwidth" value to use, overriding the heuristic value determined by the cardinality local param and the field type – see the java-hll documentation for more details Examples with Local Parameters Here we compute some statistics for the price field. The min, max, mean, 90th, and 99th percentile price values are computed against all products that are in stock (q=: and fq=inStock:true), and independently all of the default statistics are computed against all products regardless of whether they are in stock or not (by excluding that filter). http://localhost:8983/solr/techproducts/select?q=*:*&fq={!tag=stock_check}inStock:true&stats=true &stats.field={!ex=stock_check+key=instock_prices+min=true+max=true+mean=true+percentiles='90,99'} price&stats.field={!key=all_prices}price&rows=0&indent=true&wt=xml © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 632 of 1195 Apache Solr Reference Guide 7.3 0.0 2199.0 328.20437693595886 564.9700012207031 1966.6484985351556 0.0 2199.0 12 5 4089.880027770996 5385249.921747174 340.823335647583 602.3683083752779 The Stats Component and Faceting Sets of stats.field parameters can be referenced by 'tag' when using Pivot Faceting to compute multiple statistics at every level (i.e.: field) in the tree of pivot constraints. For more information and a detailed example, please see Combining Stats Component With Pivots. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 633 of 1195 The Query Elevation Component The Query Elevation Component lets you configure the top results for a given query regardless of the normal Lucene scoring. This is sometimes called "sponsored search", "editorial boosting", or "best bets." This component matches the user query text to a configured map of top results. The text can be any string or non-string IDs, as long as it’s indexed. Although this component will work with any QueryParser, it makes the most sense to use with DisMax or eDisMax. The Query Elevation Component also supports distributed searching. All of the sample configuration and queries used in this section assume you are running Solr’s “techproducts” example: bin/solr -e techproducts Configuring the Query Elevation Component You can configure the Query Elevation Component in the solrconfig.xml file. Search components like QueryElevationComponent may be added to any request handler; a dedicated request handler is used here for brevity. string elevate.xml explicit elevator Optionally, in the Query Elevation Component configuration you can also specify the following to distinguish editorial results from "normal" results: foo The Query Elevation Search Component takes the following parameters: queryFieldType Specifies which fieldType should be used to analyze the incoming text. For example, it may be © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 634 of 1195 Apache Solr Reference Guide 7.3 appropriate to use a fieldType with a LowerCaseFilter. config-file Path to the file that defines query elevation. This file must exist in /conf/ or /. If the file exists in the conf/ directory it will be loaded once at startup. If it exists in the data/ directory, it will be reloaded for each IndexReader. forceElevation By default, this component respects the requested sort parameter: if the request asks to sort by date, it will order the results by date. If forceElevation=true (the default), results will first return the boosted docs, then order by date. The elevate.xml File Elevated query results can be configured in an external XML file specified in the config-file argument. An elevate.xml file might look like this: In this example, the query "foo bar" would first return documents 1, 2 and 3, then whatever normally appears for the same query. For the query "ipod", it would first return "MA147LL/A", and would make sure that "IW-02" is not in the result set. If documents to be elevated are not defined in the elevate.xml file, they should be passed in at query time with the elevateIds parameter. Using the Query Elevation Component The enableElevation Parameter For debugging it may be useful to see results with and without the elevated docs. To hide results, use enableElevation=false: http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&debugQuery=true&enableElevation=tr ue Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 635 of 1195 http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&debugQuery=true&enableElevation=fa lse The forceElevation Parameter You can force elevation during runtime by adding forceElevation=true to the query URL: http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&debugQuery=true&enableElevation=tr ue&forceElevation=true The exclusive Parameter You can force Solr to return only the results specified in the elevation file by adding exclusive=true to the URL: http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&debugQuery=true&exclusive=true Document Transformers and the markExcludes Parameter The [elevated] Document Transformer can be used to annotate each document with information about whether or not it was elevated: http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&fl=id,[elevated] Likewise, it can be helpful when troubleshooting to see all matching documents – including documents that the elevation configuration would normally exclude. This is possible by using the markExcludes=true parameter, and then using the [excluded] transformer: http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&markExcludes=true&fl=id,[elevated] ,[excluded] The elevateIds and excludeIds Parameters When the elevation component is in use, the pre-configured list of elevations for a query can be overridden at request time to use the unique keys specified in these request parameters. For example, in the request below documents 3007WFP and 9885A004 will be elevated, and document IW-02 will be excluded — regardless of what elevations or exclusions are configured for the query "cable" in elevate.xml: http://localhost:8983/solr/techproducts/elevate?q=cable&df=text&excludeIds=IW02&elevateIds=3007WFP,9885A004 If either one of these parameters is specified at request time, the the entire elevation configuration for the © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 636 of 1195 Apache Solr Reference Guide 7.3 query is ignored. For example, in the request below documents IW-02 and F8V7067-APL-KIT will be elevated, and no documents will be excluded – regardless of what elevations or exclusions are configured for the query "ipod" in elevate.xml: http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&elevateIds=IW-02,F8V7067-APL-KIT The fq Parameter with Elevation Query elevation respects the standard filter query (fq) parameter. That is, if the query contains the fq parameter, all results will be within that filter even if elevate.xml adds other documents to the result set. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 637 of 1195 Response Writers A Response Writer generates the formatted response of a search. Solr supports a variety of Response Writers to ensure that query responses can be parsed by the appropriate language or application. The wt parameter selects the Response Writer to be used. The list below describe shows the most common settings for the wt parameter, with links to further sections that discuss them in more detail. • csv • geojson • javabin • json • php • phps • python • ruby • smile • velocity • xlsx • xml • xslt JSON Response Writer The default Solr Response Writer is the JsonResponseWriter, which formats output in JavaScript Object Notation (JSON), a lightweight data interchange format specified in specified in RFC 4627. If you do not set the wt parameter in your request, you will get JSON by default. Here is a sample response for a simple query like q=id:VS1GB400C3: © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 638 of 1195 Apache Solr Reference Guide 7.3 { "responseHeader":{ "zkConnected":true, "status":0, "QTime":7, "params":{ "q":"id:VS1GB400C3"}}, "response":{"numFound":1,"start":0,"maxScore":2.3025851,"docs":[ { "id":"VS1GB400C3", "name":["CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail"], "manu":["Corsair Microsystems Inc."], "manu_id_s":"corsair", "cat":["electronics", "memory"], "price":[74.99], "popularity":[7], "inStock":[true], "store":["37.7752,-100.0232"], "manufacturedate_dt":"2006-02-13T15:26:37Z", "payloads":["electronics|4.0 memory|2.0"], "_version_":1549728120626479104}] }} The default mime type for the JSON writer is application/json, however this can be overridden in the solrconfig.xml - such as in this example from the “techproducts” configuration: text/plain JSON-Specific Parameters json.nl This parameter controls the output format of NamedLists, where order is more important than access by name. NamedList is currently used for field faceting data. The json.nl parameter takes the following values: flat The default. NamedList is represented as a flat array, alternating names and values. With input of NamedList("a"=1, "bar"="foo", null=3, null=null), the output would be ["a",1, Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 639 of 1195 "bar","foo", null,3, null,null]. map NamedList is represented as a JSON object. Although this is the simplest mapping, a NamedList can have optional keys, repeated keys, and preserves order. Using a JSON object (essentially a map or hash) for a NamedList results in the loss of some information. With input of NamedList("a"=1, "bar"="foo", null=3, null=null), the output would be {"a":1, "bar":"foo", "":3, "":null}. arrarr NamedList is represented as an array of two element arrays. With input of NamedList("a"=1, "bar"="foo", null=3, null=null), the output would be [["a",1], ["bar","foo"], [null,3], [null,null]]. arrmap NamedList is represented as an array of JSON objects. With input of NamedList("a"=1, "bar"="foo", null=3, null=null), the output would be [{"a":1}, {"b":2}, 3, null]. arrntv NamedList is represented as an array of Name Type Value JSON objects. With input of NamedList("a"=1, "bar"="foo", null=3, null=null), the output would be [{"name":"a","type":"int","value":1}, {"name":"bar","type":"str","value":"foo"}, {"name":null,"type":"int","value":3}, {"name":null,"type":"null","value":null}]. json.wrf json.wrf=function adds a wrapper-function around the JSON response, useful in AJAX with dynamic script tags for specifying a JavaScript callback function. • http://www.xml.com/pub/a/2005/12/21/json-dynamic-script-tag.html • http://www.theurer.cc/blog/2005/12/15/web-services-json-dump-your-proxy/ Standard XML Response Writer The XML Response Writer is the most general purpose and reusable Response Writer currently included with Solr. It is the format used in most discussions and documentation about the response of Solr queries. Note that the XSLT Response Writer can be used to convert the XML produced by this writer to other vocabularies or text-based formats. The behavior of the XML Response Writer can be driven by the following query parameters. version The version parameter determines the XML protocol used in the response. Clients are strongly encouraged to always specify the protocol version, so as to ensure that the format of the response they receive does not change unexpectedly if the Solr server is upgraded and a new default format is introduced. © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 640 of 1195 Apache Solr Reference Guide 7.3 The only currently supported version value is 2.2. The format of the responseHeader changed to use the same structure as the rest of the response. The default value is the latest supported. stylesheet The stylesheet parameter can be used to direct Solr to include a declaration in the XML response it returns. The default behavior is not to return any stylesheet declaration at all.  Use of the stylesheet parameter is discouraged, as there is currently no way to specify external stylesheets, and no stylesheets are provided in the Solr distributions. This is a legacy parameter, which may be developed further in a future release. indent If the indent parameter is used, and has a non-blank value, then Solr will make some attempts at indenting its XML response to make it more readable by humans. The default behavior is not to indent. XSLT Response Writer The XSLT Response Writer applies an XML stylesheet to output. It can be used for tasks such as formatting results for an RSS feed. tr Parameter The XSLT Response Writer accepts one parameter: the tr parameter, which identifies the XML transformation to use. The transformation must be found in the Solr conf/xslt directory. The Content-Type of the response is set according to the statement in the XSLT transform, for example: XSLT Configuration The example below, from the sample_techproducts_configs config set in the Solr distribution, shows how the XSLT Response Writer is configured. 5 A value of 5 for xsltCacheLifetimeSeconds is good for development, to see XSLT changes quickly. For production you probably want a much higher value. Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 641 of 1195 Binary Response Writer This is a custom binary format used by Solr for inter-node communication as well as client-server communication. SolrJ uses this as the default for indexing as well as querying. See Client APIs for more details. GeoJSON Response Writer Returns Solr results in GeoJSON augmented with Solr-specific JSON. To use this, set wt=geojson and geojson.field to the name of a spatial Solr field. Not all spatial fields types are supported, and you’ll get an error if you use an unsupported one. Python Response Writer Solr has an optional Python response format that extends its JSON output in the following ways to allow the response to be safely evaluated by the python interpreter: • true and false changed to True and False • Python unicode strings are used where needed • ASCII output (with unicode escapes) is used for less error-prone interoperability • newlines are escaped • null changed to None PHP Response Writer and PHP Serialized Response Writer Solr has a PHP response format that outputs an array (as PHP code) which can be evaluated. Setting the wt parameter to php invokes the PHP Response Writer. Example usage: $code = file_get_contents('http://localhost:8983/solr/techproducts/select?q=iPod&wt=php'); eval("$result = " . $code . ";"); print_r($result); Solr also includes a PHP Serialized Response Writer that formats output in a serialized array. Setting the wt parameter to phps invokes the PHP Serialized Response Writer. Example usage: $serializedResult = file_get_contents( 'http://localhost:8983/solr/techproducts/select?q=iPod&wt=phps'); $result = unserialize($serializedResult); print_r($result); © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 642 of 1195 Apache Solr Reference Guide 7.3 Ruby Response Writer Solr has an optional Ruby response format that extends its JSON output in the following ways to allow the response to be safely evaluated by Ruby’s interpreter: • Ruby’s single quoted strings are used to prevent possible string exploits. • \ and ' are the only two characters escaped. • Unicode escapes are not used. Data is written as raw UTF-8. • nil used for null. • => is used as the key/value separator in maps. Here is a simple example of how one may query Solr using the Ruby response format: require 'net/http' h = Net::HTTP.new('localhost', 8983) hresp, data = h.get('/solr/techproducts/select?q=iPod&wt=ruby', nil) rsp = eval(data) puts 'number of matches = ' + rsp['response']['numFound'].to_s #print out the name field for each returned document rsp['response']['docs'].each { |doc| puts 'name field = ' + doc['name'\] } CSV Response Writer The CSV response writer returns a list of documents in comma-separated values (CSV) format. Other information that would normally be included in a response, such as facet information, is excluded. The CSV response writer supports multi-valued fields, as well aspseudo-fields, and the output of this CSV format is compatible with Solr’s CSV update format. CSV Parameters These parameters specify the CSV format that will be returned. You can accept the default values or specify your own. Parameter Default Value csv.encapsulator " csv.escape None csv.separator , csv.header Defaults to true. If false, Solr does not print the column headers. csv.newline \n Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Parameter Default Value csv.null Defaults to a zero length string. Use this parameter when a document has no value for a particular field. Page 643 of 1195 Multi-Valued Field CSV Parameters These parameters specify how multi-valued fields are encoded. Per-field overrides for these values can be done using f..csv.separator=|. Parameter Default Value csv.mv.encapsulator None csv.mv.escape \ csv.mv.separator Defaults to the csv.separator value. CSV Writer Example http://localhost:8983/solr/techproducts/select?q=ipod&fl=id,cat,name,popularity,price,score& wt=csv returns: id,cat,name,popularity,price,score IW-02,"electronics,connector",iPod & iPod Mini USB 2.0 Cable,1,11.5,0.98867977 F8V7067-APL-KIT,"electronics,connector",Belkin Mobile Power Cord for iPod w/ Dock,1,19.95,0.6523595 MA147LL/A,"electronics,music",Apple 60 GB iPod with Video Playback Black,10,399.0,0.2446348 Velocity Response Writer The VelocityResponseWriter processes the Solr response and request context through Apache Velocity templating. See the Velocity Response Writer section for details. Smile Response Writer The Smile format is a JSON-compatible binary format, described in detail here: http://wiki.fasterxml.com/ SmileFormat. XLSX Response Writer Use this to get the response as a spreadsheet in the .xlsx (Microsoft Excel) format. It accepts parameters in the form colwidth. and colname. which helps you customize the column © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 644 of 1195 Apache Solr Reference Guide 7.3 widths and column names. This response writer has been added as part of the extraction library, and will only work if the extraction contrib is present in the server classpath. Defining the classpath with the lib directive is not sufficient. Instead, you will need to copy the necessary .jars to the Solr webapp’s lib directory manually. You can run these commands from your $SOLR_INSTALL directory: cp contrib/extraction/lib/*.jar server/solr-webapp/webapp/WEB-INF/lib/ cp dist/solr-cell-6.3.0.jar server/solr-webapp/webapp/WEB-INF/lib/ Once the libraries are in place, you can add wt=xlsx to your request, and results will be returned as an XLSX sheet. Velocity Response Writer The VelocityResponseWriter is an optional plugin available in the contrib/velocity directory. It powers the /browse user interfaces when using configurations such as "_default", "techproducts", and "example/files". Its JAR and dependencies must be added (via or solr/home lib inclusion), and must be registered in solrconfig.xml like this: ${velocity.template.base.dir:} velocity-init.properties true false com.example.MyCustomTool --> The above example shows the optional initialization and custom tool parameters used by VelocityResponseWriter; these are detailed in the following table. These initialization parameters are only specified in the writer registration in solrconfig.xml, not as request-time parameters. See further below for request-time parameters. Configuration & Usage VelocityResponseWriter Initialization Parameters template.base.dir If specified and exists as a file system directory, a file resource loader will be added for this directory. Templates in this directory will override "solr" resource loader templates. init.properties.file Specifies a properties file name which must exist in the Solr conf/ directory (not under a velocity/ Guide Version 7.3 - Published: 2018-03-27 © 2018, Apache Software Foundation Apache Solr Reference Guide 7.3 Page 645 of 1195 subdirectory) or root of a JAR file in a . params.resource.loader.enabled The "params" resource loader allows templates to be specified in Solr request parameters. For example: http://localhost:8983/solr/gettingstarted/select?q=\*:*&wt=velocity&v.template=custom&v.templat e.custom=CUSTOM%3A%20%23core_name where v.template=custom says to render a template called "custom" and the value of v.template.custom is the custom template. This is false by default; it’d be a niche, unusual, use case to need this enabled. solr.resource.loader.enabled The "solr" resource loader is the only template loader registered by default. Templates are served from resources visible to the SolrResourceLoader under a velocity/ subdirectory. The VelocityResponseWriter itself has some built-in templates (in its JAR file, under velocity/) that are available automatically through this loader. These built-in templates can be overridden when the same template name is in conf/velocity/ or by using the template.base.dir option. tools External "tools" can be specified as list of string name/value (tool name / class name) pairs. Tools, in the Velocity context, are simply Java objects. Tool classes are constructed using a no-arg constructor (or a single-SolrCore-arg constructor if it exists) and added to the Velocity context with the specified name. A custom registered tool can override the built-in context objects with the same name, except for $request, $response, $page, and $debug (these tools are designed to not be overridden). VelocityResponseWriter Request Parameters v.template Specifies the name of the template to render. v.layout Specifies a template name to use as the layout around the main, v.template, specified template. The main template is rendered into a string value included into the layout rendering as $content. v.layout.enabled Determines if the main template should have a layout wrapped around it. The default is true, but requires v.layout to specified as well. v.contentType Specifies the content type used in the HTTP response. If not specified, the default will depend on whether v.json is specified or not. The default without v.json=wrf: text/html;charset=UTF-8. The default with v.json=wrf: application/json;charset=UTF-8. v.json Specifies a function name to wrap around the response rendered as JSON. If specified, the content type © 2018, Apache Software Foundation Guide Version 7.3 - Published: 2018-03-27 Page 646 of 1195 Apache Solr Reference Guide 7.3 used in the response will be "application/json;charset=UTF-8", unless overridden by v.contentType. Output will be in this format (with v.json=wrf): wrf("result":"") v.locale Locale to use with the $resource tool and other LocaleConfig implementing tools. The default locale is Locale.ROOT. Localized resources are loaded from standard Java resource bundles named resources[_locale-code].properties. Resource bundles can be added by providing a JAR file visible by the SolrResourceLoader with resource bundles under a velocity sub-directory. Resource bundles are not loadable under conf/, as only the class loader aspect of SolrResourceLoader can be used here. v.template.template_name When the "params" resource loader is enabled, templates can be specified as part of the Solr request. VelocityResponseWriter Context Objects Context Reference Description request SolrQueryRequest javadocs response QueryResponse most of the time, but in some cases where QueryResponse doesn’t like the request handler’s output (AnalysisRequestHandler, for example, causes a ClassCastException parsing "response"), the response will be a SolrResponseBase object. esc A Velocity EscapeTool instance date A Velocity ComparisonDateTool instance list A Velocity ListTool instance math A Velocity MathTool instance number A Velocity NumberTool instance sort A Velocity SortTool instance display A Velocity DisplayTool instance resource A Velocity ResourceTool instance engine The current VelocityEngine instance page An instance of Solr’s PageTool (only included if the response is a QueryResponse where paging makes sense) debug A shortcut to the debug part of the response, or null if debug is not on. This is handy for having debug-only sections in a template using #if($debug)…#end content The rendered output of the main template, when rendering the layout (v.layout.enabled=true and v.layout=