Apache Solr Reference Guide: For 7.7 Ref Guide
User Manual:
Open the PDF directly: View PDF
.
Page Count: 1431
| Download | |
| Open PDF In Browser | View PDF |
Apache Solr Reference Guide
For Solr 7.7
Written by the Apache Lucene/Solr Project
Published 2019-03-04
Table of Contents
Apache Solr Reference Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
About This Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Hosts and Port Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Directory Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
API Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Special Inline Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Solr Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
A Quick Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Solr System Requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Installing Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Deployment and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Solr Control Script Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Solr Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Taking Solr to Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Making and Restoring Backups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Running Solr on HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
SolrCloud on AWS EC2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Upgrading a Solr Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Solr Upgrade Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Using the Solr Administration User Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Overview of the Solr Admin UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Logging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Cloud Screens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Collections / Core Admin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Java Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Thread Dump. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Suggestions Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Collection-Specific Tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Core-Specific Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Documents, Fields, and Schema Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Overview of Documents, Fields, and Schema Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Solr Field Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Defining Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Copying Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Dynamic Fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Other Schema Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Schema API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Putting the Pieces Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
DocValues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Schemaless Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Understanding Analyzers, Tokenizers, and Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
Using Analyzers, Tokenizers, and Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Analyzers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
About Tokenizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
About Filters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Tokenizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Filter Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
CharFilterFactories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Language Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Phonetic Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Running Your Analyzer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Indexing and Basic Data Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
Indexing Using Client APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
Introduction to Solr Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Post Tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
Uploading Data with Index Handlers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
Uploading Data with Solr Cell using Apache Tika . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
Uploading Structured Data Store Data with the Data Import Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Updating Parts of Documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
Detecting Languages During Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
De-Duplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
Content Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
Overview of Searching in Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
Velocity Search UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
Query Syntax and Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
JSON Request API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
JSON Facet API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
Faceting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
Highlighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
Spell Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
Query Re-Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
Transforming Result Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
Suggester . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
MoreLikeThis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
Pagination of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
Collapse and Expand Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
Result Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
Result Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Spatial Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
The Terms Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
The Term Vector Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
The Stats Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
The Query Elevation Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662
The Tagger Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666
Response Writers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
Near Real Time Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
RealTime Get . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686
Exporting Result Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
Parallel SQL Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692
Analytics Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728
Streaming Expressions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759
Stream Language Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 760
Types of Streaming Expressions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762
Stream Source Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763
Stream Decorator Reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777
Stream Evaluator Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808
Math Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 859
Graph Traversal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995
SolrCloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1008
Getting Started with SolrCloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009
How SolrCloud Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013
SolrCloud Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025
SolrCloud Configuration and Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1030
Rule-based Replica Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106
Cross Data Center Replication (CDCR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1110
SolrCloud Autoscaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1133
Colocating Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1183
Legacy Scaling and Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185
Introduction to Scaling and Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186
Distributed Search with Index Sharding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1187
Index Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1191
Combining Distribution and Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1201
Merging Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1203
The Well-Configured Solr Instance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1204
Configuring solrconfig.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205
Solr Cores and solr.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1248
Configuration APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1264
Implicit RequestHandlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296
Solr Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1303
JVM Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1308
v2 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1310
Monitoring Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1314
Metrics Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1315
Metrics History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1331
MBean Request Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1344
Configuring Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1345
Using JMX with Solr. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1349
Monitoring Solr with Prometheus and Grafana. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1351
Performance Statistics Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1360
Securing Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366
Authentication and Authorization Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367
Enabling SSL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1394
Client APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1403
Introduction to Client APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1404
Choosing an Output Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1405
Client API Lineup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1406
Using JavaScript. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1407
Using Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1408
Using SolrJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1409
Using Solr From Ruby. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1415
Further Assistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1417
Solr Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1418
Solr Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1419
Errata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1424
Errata For This Documentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1425
How to Contribute to Solr Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1426
Apache Solr Reference Guide 7.7
Page 1 of 1426
Licenses
Licensed to the Apache Software Foundation (ASF) under one or more contributor
license agreements. See the NOTICE file distributed with this work for additional
information regarding copyright ownership. The ASF licenses this file to you under
the Apache License, Version 2.0 (the "License"); you may not use this file except in
compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0.
Unless required by applicable law or agreed to in writing, software distributed under the License is
distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied. See the License for the specific language governing permissions and limitations under the License.
Apache and the Apache feather logo are trademarks of The Apache Software Foundation. Apache Lucene,
Apache Solr and their respective logos are trademarks of the Apache Software Foundation. Please see the
Apache Trademark Policy for more information.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 2 of 1426
Apache Solr Reference Guide 7.7
Apache Solr Reference Guide
Welcome to Apache Solr™, the open source solution for search and analytics.
Solr is the fast open source search platform built on Apache Lucene™ that provides scalable indexing and
search, as well as faceting, hit highlighting and advanced analysis/tokenization capabilities. Solr and Lucene
are managed by the Apache Software Foundation.
This Reference Guide is the official Solr documentation, written and published by Lucene/Solr committers.
The Guide includes the following sections:
Getting Started with Solr
The Getting Started section guides you through the installation and setup of Solr. A detailed tutorial
for first-time users shows many of Solr’s features.
Using the Solr Administration User Interface: This section introduces the Web-based interface for
administering Solr. From your browser you can view configuration files, submit queries, view logfile
settings and Java environment settings, and monitor and control distributed configurations.
Deploying Solr
Deployment and Operations: Once you have Solr configured, you want to deploy it to production and
keep it up to date. This section includes information about how to take Solr to production, run it in
HDFS or AWS, and information about upgrades and managing Solr from the command line.
Monitoring Solr: Solr includes options for keeping an eye on the performance of your Solr cluster with
the web-based administration console, through the command line interface, or using REST APIs.
Indexing Documents
Indexing and Basic Data Operations: This section describes the indexing process and basic index
operations, such as commit, optimize, and rollback.
Documents, Fields, and Schema Design: This section describes how Solr organizes data in the index.
It explains how a Solr schema defines the fields and field types which Solr uses to organize data within
the document files it indexes.
Understanding Analyzers, Tokenizers, and Filters: This section explains how Solr prepares text for
indexing and searching. Analyzers parse text and produce a stream of tokens, lexical units used for
indexing and searching. Tokenizers break field data down into tokens. Filters perform other
transformational or selective work on token streams.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 3 of 1426
Searching Documents
Searching: This section presents an overview of the search process in Solr. It describes the main
components used in searches, including request handlers, query parsers, and response writers. It lists
the query parameters that can be passed to Solr, and it describes features such as boosting and
faceting, which can be used to fine-tune search results.
Streaming Expressions: A stream processing language for Solr, with a suite of functions to perform
many types of queries and parallel execution tasks.
Client APIs: This section tells you how to access Solr through various client APIs, including JavaScript,
JSON, and Ruby.
Scaling Solr
SolrCloud: This section describes SolrCloud, which provides comprehensive distributed capabilities.
Legacy Scaling and Distribution: This section tells you how to grow a Solr distribution by dividing a
large index into sections called shards, which are then distributed across multiple servers, or by
replicating a single index across multiple services.
Advanced Configuration
Securing Solr: When planning how to secure Solr, you should consider which of the available features
or approaches are right for you.
The Well-Configured Solr Instance: This section discusses performance tuning for Solr. It begins with
an overview of the solrconfig.xml file, then tells you how to configure cores with solr.xml, how to
configure the Lucene index writer, and more.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 4 of 1426
Apache Solr Reference Guide 7.7
About This Guide
This guide describes all of the important features and functions of Apache Solr.
Solr is free to download from http://lucene.apache.org/solr/.
Designed to provide high-level documentation, this guide is intended to be more encyclopedic and less of a
cookbook. It is structured to address a broad spectrum of needs, ranging from new developers getting
started to well-experienced developers extending their application or troubleshooting. It will be of use at
any point in the application life cycle, for whenever you need authoritative information about Solr.
The material as presented assumes that you are familiar with some basic search concepts and that you can
read XML. It does not assume that you are a Java programmer, although knowledge of Java is helpful when
working directly with Lucene or when developing custom extensions to a Lucene/Solr installation.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 5 of 1426
Hosts and Port Examples
The default port when running Solr is 8983. The samples, URLs and screenshots in this guide may show
different ports, because the port number that Solr uses is configurable.
If you have not customized your installation of Solr, please make sure that you use port 8983 when following
the examples, or configure your own installation to use the port numbers shown in the examples. For
information about configuring port numbers, see the section Monitoring Solr.
Similarly, URL examples use localhost throughout; if you are accessing Solr from a location remote to the
server hosting Solr, replace localhost with the proper domain or IP where Solr is running.
For example, we might provide a sample query like:
http://localhost:8983/solr/gettingstarted/select?q=brown+cow
There are several items in this URL you might need to change locally. First, if your server is running at
"www.example.com", you’ll replace "localhost" with the proper domain. If you aren’t using port 8983, you’ll
replace that also. Finally, you’ll want to replace "gettingstarted" (the collection or core name) with the
proper one in use in your implementation. The URL would then become:
http://www.example.com/solr/mycollection/select?q=brown+cow
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 6 of 1426
Apache Solr Reference Guide 7.7
Directory Paths
Path information is given relative to solr.home, which is the location under the main Solr installation where
Solr’s collections and their conf and data directories are stored.
In many cases, this is is in the server/solr directory of your installation. However, there can be exceptions,
particularly if your installation has customized this.
In several cases of this Guide, our examples are built from the the "techproducts" example (i.e., you have
started Solr with the command bin/solr -e techproducts). In this case, solr.home will be a sub-directory
of the example/ directory created for you automatically.
See also the section Solr Home for further details on what is contained in this directory.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 7 of 1426
API Examples
Solr has two styles of APIs that currently co-exist. The first has grown somewhat organically as Solr has
developed over time, but the second, referred to as the "V2 API", redesigns many of the original APIs with a
modernized and self-documenting API interface.
In many cases, but not all, the parameters and outputs of API calls are the same between the two styles. In
all cases the paths and endpoints used are different.
Throughout this Guide, we have added examples of both styles with sections labeled "V1 API" and "V2 API".
As of the 7.2 version of this Guide, these examples are not yet complete - more coverage will be added as
future versions of the Guide are released.
The section V2 API provides more information about how to work with the new API structure, including how
to disable it if you choose to do so.
All APIs return a response header that includes the status of the request and the time to process it. Some
APIs will also include the parameters used for the request. Many of the examples in this Guide omit this
header information, which you can do locally by adding the parameter omitHeader=true to any request.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 8 of 1426
Apache Solr Reference Guide 7.7
Special Inline Notes
Special notes are included throughout these pages. There are several types of notes:
Information blocks provide additional information that’s useful for you to know.
Important blocks provide information that we want to make sure you are aware of.
Tip blocks provide helpful tips.
Caution blocks provide details on scenarios or configurations you should be careful with.
Warning blocks are used to warn you from a possibly dangerous change or action.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 9 of 1426
Getting Started
Solr makes it easy for programmers to develop sophisticated, high-performance
search applications with advanced features.
This section introduces you to the basic Solr architecture and features to help you get up and running
quickly. It covers the following topics:
Solr Tutorial: This tutorial covers getting Solr up and running
A Quick Overview: A high-level overview of how Solr works.
Solr System Requirements: Solr System Requirement
Installing Solr: A walkthrough of the Solr installation process.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 10 of 1426
Apache Solr Reference Guide 7.7
Solr Tutorial
This tutorial covers getting Solr up and running, ingesting a variety of data sources into Solr collections, and
getting a feel for the Solr administrative and search interfaces.
The tutorial is organized into three sections that each build on the one before it. The first exercise will ask
you to start Solr, create a collection, index some basic documents, and then perform some searches.
The second exercise works with a different set of data, and explores requesting facets with the dataset.
The third exercise encourages you to begin to work with your own data and start a plan for your
implementation.
Finally, we’ll introduce spatial search and show you how to get your Solr instance back into a clean state.
Before You Begin
To follow along with this tutorial, you will need…
1. To meet the system requirements
2. An Apache Solr release download. This tutorial is designed for Apache Solr 7.7.
For best results, please run the browser showing this tutorial and the Solr server on the same machine so
tutorial links will correctly point to your Solr server.
Unpack Solr
Begin by unzipping the Solr release and changing your working directory to the subdirectory where Solr was
installed. For example, with a shell in UNIX, Cygwin, or MacOS:
~$ ls solr*
solr-7.7.0.zip
~$ unzip -q solr-7.7.0.zip
~$ cd solr-7.7.0/
If you’d like to know more about Solr’s directory layout before moving to the first exercise, see the section
Directory Layout for details.
Exercise 1: Index Techproducts Example Data
This exercise will walk you through how to start Solr as a two-node cluster (both nodes on the same
machine) and create a collection during startup. Then you will index some sample data that ships with Solr
and do some basic searches.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 11 of 1426
Launch Solr in SolrCloud Mode
To launch Solr, run: bin/solr start -e cloud on Unix or MacOS; bin\solr.cmd start -e cloud on
Windows.
This will start an interactive session that will start two Solr "servers" on your machine. This command has an
option to run without prompting you for input (-noprompt), but we want to modify two of the defaults so we
won’t use that option now.
solr-7.7.0:$ ./bin/solr start -e cloud
Welcome to the SolrCloud example!
This interactive session will help you launch a SolrCloud cluster on your local workstation.
To begin, how many Solr nodes would you like to run in your local cluster? (specify 1-4 nodes)
[2]:
The first prompt asks how many nodes we want to run. Note the [2] at the end of the last line; that is the
default number of nodes. Two is what we want for this example, so you can simply press enter.
Ok, let's start up 2 Solr nodes for your example SolrCloud cluster.
Please enter the port for node1 [8983]:
This will be the port that the first node runs on. Unless you know you have something else running on port
8983 on your machine, accept this default option also by pressing enter. If something is already using that
port, you will be asked to choose another port.
Please enter the port for node2 [7574]:
This is the port the second node will run on. Again, unless you know you have something else running on
port 8983 on your machine, accept this default option also by pressing enter. If something is already using
that port, you will be asked to choose another port.
Solr will now initialize itself and start running on those two nodes. The script will print the commands it uses
for your reference.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 12 of 1426
Apache Solr Reference Guide 7.7
Starting up 2 Solr nodes for your example SolrCloud cluster.
Creating Solr home directory /solr-7.7.0/example/cloud/node1/solr
Cloning /solr-7.7.0/example/cloud/node1 into
/solr-7.7.0/example/cloud/node2
Starting up Solr on port 8983 using command:
"bin/solr" start -cloud -p 8983 -s "example/cloud/node1/solr"
Waiting up to 180 seconds to see Solr running on port 8983 [\]
Started Solr server on port 8983 (pid=34942). Happy searching!
Starting up Solr on port 7574 using command:
"bin/solr" start -cloud -p 7574 -s "example/cloud/node2/solr" -z localhost:9983
Waiting up to 180 seconds to see Solr running on port 7574 [\]
Started Solr server on port 7574 (pid=35036). Happy searching!
INFO - 2017-07-27 12:28:02.835; org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider;
Cluster at localhost:9983 ready
Notice that two instances of Solr have started on two nodes. Because we are starting in SolrCloud mode, and
did not define any details about an external ZooKeeper cluster, Solr launches its own ZooKeeper and
connects both nodes to it.
After startup is complete, you’ll be prompted to create a collection to use for indexing data.
Now let's create a new collection for indexing documents in your 2-node cluster.
Please provide a name for your new collection: [gettingstarted]
Here’s the first place where we’ll deviate from the default options. This tutorial will ask you to index some
sample data included with Solr, called the "techproducts" data. Let’s name our collection "techproducts" so
it’s easy to differentiate from other collections we’ll create later. Enter techproducts at the prompt and hit
enter.
How many shards would you like to split techproducts into? [2]
This is asking how many shards you want to split your index into across the two nodes. Choosing "2" (the
default) means we will split the index relatively evenly across both nodes, which is a good way to start.
Accept the default by hitting enter.
How many replicas per shard would you like to create? [2]
A replica is a copy of the index that’s used for failover (see also the Solr Glossary definition). Again, the
default of "2" is fine to start with here also, so accept the default by hitting enter.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 13 of 1426
Please choose a configuration for the techproducts collection, available options are:
_default or sample_techproducts_configs [_default]
We’ve reached another point where we will deviate from the default option. Solr has two sample sets of
configuration files (called a configSet) available out-of-the-box.
A collection must have a configSet, which at a minimum includes the two main configuration files for Solr:
the schema file (named either managed-schema or schema.xml), and solrconfig.xml. The question here is
which configSet you would like to start with. The _default is a bare-bones option, but note there’s one
whose name includes "techproducts", the same as we named our collection. This configSet is specifically
designed to support the sample data we want to use, so enter sample_techproducts_configs at the prompt
and hit enter.
At this point, Solr will create the collection and again output to the screen the commands it issues.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 14 of 1426
Apache Solr Reference Guide 7.7
Uploading /solr-7.7.0/server/solr/configsets/_default/conf for config techproducts to ZooKeeper
at localhost:9983
Connecting to ZooKeeper at localhost:9983 ...
INFO - 2017-07-27 12:48:59.289; org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider;
Cluster at localhost:9983 ready
Uploading /solr-7.7.0/server/solr/configsets/sample_techproducts_configs/conf for config
techproducts to ZooKeeper at localhost:9983
Creating new collection 'techproducts' using command:
http://localhost:8983/solr/admin/collections?action=CREATE&name=techproducts&numShards=2&replicat
ionFactor=2&maxShardsPerNode=2&collection.configName=techproducts
{
"responseHeader":{
"status":0,
"QTime":5460},
"success":{
"192.168.0.110:7574_solr":{
"responseHeader":{
"status":0,
"QTime":4056},
"core":"techproducts_shard1_replica_n1"},
"192.168.0.110:8983_solr":{
"responseHeader":{
"status":0,
"QTime":4056},
"core":"techproducts_shard2_replica_n2"}}}
Enabling auto soft-commits with maxTime 3 secs using the Config API
POSTing request to Config API: http://localhost:8983/solr/techproducts/config
{"set-property":{"updateHandler.autoSoftCommit.maxTime":"3000"}}
Successfully set-property updateHandler.autoSoftCommit.maxTime to 3000
SolrCloud example running, please visit: http://localhost:8983/solr
Congratulations! Solr is ready for data!
You can see that Solr is running by launching the Solr Admin UI in your web browser: http://localhost:8983/
solr/. This is the main starting point for administering Solr.
Solr will now be running two "nodes", one on port 7574 and one on port 8983. There is one collection
created automatically, techproducts, a two shard collection, each with two replicas.
The Cloud tab in the Admin UI diagrams the collection nicely:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 15 of 1426
SolrCloud Diagram
Index the Techproducts Data
Your Solr server is up and running, but it doesn’t contain any data yet, so we can’t do any queries.
Solr includes the bin/post tool in order to facilitate indexing various types of documents easily. We’ll use
this tool for the indexing examples below.
You’ll need a command shell to run some of the following examples, rooted in the Solr install directory; the
shell from where you launched Solr works just fine.
Currently the bin/post tool does not have a comparable Windows script, but the
underlying Java program invoked is available. We’ll show examples below for Windows, but
you can also see the Windows section of the Post Tool documentation for more details.
The data we will index is in the example/exampledocs directory. The documents are in a mix of document
formats (JSON, CSV, etc.), and fortunately we can index them all at once:
Linux/Mac
solr-7.7.0:$ bin/post -c techproducts example/exampledocs/*
Windows
C:\solr-7.7.0> java -jar -Dc=techproducts -Dauto example\exampledocs\post.jar
example\exampledocs\*
You should see output similar to the following:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 16 of 1426
Apache Solr Reference Guide 7.7
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/techproducts/update...
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file books.csv (text/csv) to [base]
POSTing file books.json (application/json) to [base]/json/docs
POSTing file gb18030-example.xml (application/xml) to [base]
POSTing file hd.xml (application/xml) to [base]
POSTing file ipod_other.xml (application/xml) to [base]
POSTing file ipod_video.xml (application/xml) to [base]
POSTing file manufacturers.xml (application/xml) to [base]
POSTing file mem.xml (application/xml) to [base]
POSTing file money.xml (application/xml) to [base]
POSTing file monitor.xml (application/xml) to [base]
POSTing file monitor2.xml (application/xml) to [base]
POSTing file more_books.jsonl (application/json) to [base]/json/docs
POSTing file mp500.xml (application/xml) to [base]
POSTing file post.jar (application/octet-stream) to [base]/extract
POSTing file sample.html (text/html) to [base]/extract
POSTing file sd500.xml (application/xml) to [base]
POSTing file solr-word.pdf (application/pdf) to [base]/extract
POSTing file solr.xml (application/xml) to [base]
POSTing file test_utf8.sh (application/octet-stream) to [base]/extract
POSTing file utf8-example.xml (application/xml) to [base]
POSTing file vidcard.xml (application/xml) to [base]
21 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/techproducts/update...
Time spent: 0:00:00.822
Congratulations again! You have data in your Solr!
Now we’re ready to start searching.
Basic Searching
Solr can be queried via REST clients, curl, wget, Chrome POSTMAN, etc., as well as via native clients available
for many programming languages.
The Solr Admin UI includes a query builder interface via the Query tab for the techproducts collection (at
http://localhost:8983/solr/#/techproducts/query). If you click the [ Execute Query ] button without
changing anything in the form, you’ll get 10 documents in JSON format:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 17 of 1426
Query Screen
The URL sent by the Admin UI to Solr is shown in light grey near the top right of the above screenshot. If you
click on it, your browser will show you the raw response.
To use curl, give the same URL shown in your browser in quotes on the command line:
curl "http://localhost:8983/solr/techproducts/select?indent=on&q=*:*"
What’s happening here is that we are using Solr’s query parameter (q) with a special syntax that requests all
documents in the index (*:*). All of the documents are not returned to us, however, because of the default
for a parameter called rows, which you can see in the form is 10. You can change the parameter in the UI or
in the defaults if you wish.
Solr has very powerful search options, and this tutorial won’t be able to cover all of them. But we can cover
some of the most common types of queries.
Search for a Single Term
To search for a term, enter it as the q parameter value in the Solr Admin UI Query screen, replacing *:* with
the term you want to find.
Enter "foundation" and hit [ Execute Query ] again.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 18 of 1426
Apache Solr Reference Guide 7.7
If you prefer curl, enter something like this:
curl "http://localhost:8983/solr/techproducts/select?q=foundation"
You’ll see something like this:
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":8,
"params":{
"q":"foundation"}},
"response":{"numFound":4,"start":0,"maxScore":2.7879646,"docs":[
{
"id":"0553293354",
"cat":["book"],
"name":"Foundation",
"price":7.99,
"price_c":"7.99,USD",
"inStock":true,
"author":"Isaac Asimov",
"author_s":"Isaac Asimov",
"series_t":"Foundation Novels",
"sequence_i":1,
"genre_s":"scifi",
"_version_":1574100232473411586,
"price_c____l_ns":799}]
}}
The response indicates that there are 4 hits ("numFound":4). We’ve only included one document the above
sample output, but since 4 hits is lower than the rows parameter default of 10 to be returned, you should see
all 4 of them.
Note the responseHeader before the documents. This header will include the parameters you have set for
the search. By default it shows only the parameters you have set for this query, which in this case is only
your query term.
The documents we got back include all the fields for each document that were indexed. This is, again,
default behavior. If you want to restrict the fields in the response, you can use the fl parameter, which takes
a comma-separated list of field names. This is one of the available fields on the query form in the Admin UI.
Put "id" (without quotes) in the "fl" box and hit [ Execute Query ] again. Or, to specify it with curl:
curl "http://localhost:8983/solr/techproducts/select?q=foundation&fl=id"
You should only see the IDs of the matching records returned.
Field Searches
All Solr queries look for documents using some field. Often you want to query across multiple fields at the
same time, and this is what we’ve done so far with the "foundation" query. This is possible with the use of
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 19 of 1426
copy fields, which are set up already with this set of configurations. We’ll cover copy fields a little bit more in
Exercise 2.
Sometimes, though, you want to limit your query to a single field. This can make your queries more efficient
and the results more relevant for users.
Much of the data in our small sample data set is related to products. Let’s say we want to find all the
"electronics" products in the index. In the Query screen, enter "electronics" (without quotes) in the q box
and hit [ Execute Query ]. You should get 14 results, such as:
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":6,
"params":{
"q":"electronics"}},
"response":{"numFound":14,"start":0,"maxScore":1.5579545,"docs":[
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable",
"manu":"Belkin",
"manu_id_s":"belkin",
"cat":["electronics",
"connector"],
"features":["car power adapter for iPod, white"],
"weight":2.0,
"price":11.5,
"price_c":"11.50,USD",
"popularity":1,
"inStock":false,
"store":"37.7752,-122.4232",
"manufacturedate_dt":"2006-02-14T23:55:59Z",
"_version_":1574100232554151936,
"price_c____l_ns":1150}]
}}
This search finds all documents that contain the term "electronics" anywhere in the indexed fields. However,
we can see from the above there is a cat field (for "category"). If we limit our search for only documents
with the category "electronics", the results will be more precise for our users.
Update your query in the q field of the Admin UI so it’s cat:electronics. Now you get 12 results:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 20 of 1426
Apache Solr Reference Guide 7.7
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":6,
"params":{
"q":"cat:electronics"}},
"response":{"numFound":12,"start":0,"maxScore":0.9614112,"docs":[
{
"id":"SP2514N",
"name":"Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133",
"manu":"Samsung Electronics Co. Ltd.",
"manu_id_s":"samsung",
"cat":["electronics",
"hard drive"],
"features":["7200RPM, 8MB cache, IDE Ultra ATA-133",
"NoiseGuard, SilentSeek technology, Fluid Dynamic Bearing (FDB) motor"],
"price":92.0,
"price_c":"92.0,USD",
"popularity":6,
"inStock":true,
"manufacturedate_dt":"2006-02-13T15:26:37Z",
"store":"35.0752,-97.032",
"_version_":1574100232511160320,
"price_c____l_ns":9200}]
}}
Using curl, this query would look like this:
curl "http://localhost:8983/solr/techproducts/select?q=cat:electronics"
Phrase Search
To search for a multi-term phrase, enclose it in double quotes: q="multiple terms here". For example,
search for "CAS latency" by entering that phrase in quotes to the q box in the Admin UI.
If you’re following along with curl, note that the space between terms must be converted to "+" in a URL, as
so:
curl "http://localhost:8983/solr/techproducts/select?q=\"CAS+latency\""
We get 2 results:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 21 of 1426
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":7,
"params":{
"q":"\"CAS latency\""}},
"response":{"numFound":2,"start":0,"maxScore":5.937691,"docs":[
{
"id":"VDBDB1A16",
"name":"A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory
- OEM",
"manu":"A-DATA Technology Inc.",
"manu_id_s":"corsair",
"cat":["electronics",
"memory"],
"features":["CAS latency 3,
2.7v"],
"popularity":0,
"inStock":true,
"store":"45.18414,-93.88141",
"manufacturedate_dt":"2006-02-13T15:26:37Z",
"payloads":"electronics|0.9 memory|0.1",
"_version_":1574100232590852096},
{
"id":"TWINX2048-3200PRO",
"name":"CORSAIR XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) Dual
Channel Kit System Memory - Retail",
"manu":"Corsair Microsystems Inc.",
"manu_id_s":"corsair",
"cat":["electronics",
"memory"],
"features":["CAS latency 2, 2-3-3-6 timing, 2.75v, unbuffered, heat-spreader"],
"price":185.0,
"price_c":"185.00,USD",
"popularity":5,
"inStock":true,
"store":"37.7752,-122.4232",
"manufacturedate_dt":"2006-02-13T15:26:37Z",
"payloads":"electronics|6.0 memory|3.0",
"_version_":1574100232584560640,
"price_c____l_ns":18500}]
}}
Combining Searches
By default, when you search for multiple terms and/or phrases in a single query, Solr will only require that
one of them is present in order for a document to match. Documents containing more terms will be sorted
higher in the results list.
You can require that a term or phrase is present by prefixing it with a +; conversely, to disallow the presence
of a term or phrase, prefix it with a -.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 22 of 1426
Apache Solr Reference Guide 7.7
To find documents that contain both terms "electronics" and "music", enter +electronics +music in the q
box in the Admin UI Query tab.
If you’re using curl, you must encode the + character because it has a reserved purpose in URLs (encoding
the space character). The encoding for + is %2B as in:
curl "http://localhost:8983/solr/techproducts/select?q=%2Belectronics%20%2Bmusic"
You should only get a single result.
To search for documents that contain the term "electronics" but don’t contain the term "music", enter
+electronics -music in the q box in the Admin UI. For curl, again, URL encode + as %2B as in:
curl "http://localhost:8983/solr/techproducts/select?q=%2Belectronics+-music"
This time you get 13 results.
More Information on Searching
We have only scratched the surface of the search options available in Solr. For more Solr search options, see
the section on Searching.
Exercise 1 Wrap Up
At this point, you’ve seen how Solr can index data and have done some basic queries. You can choose now
to continue to the next example which will introduce more Solr concepts, such as faceting results and
managing your schema, or you can strike out on your own.
If you decide not to continue with this tutorial, the data we’ve indexed so far is likely of little value to you.
You can delete your installation and start over, or you can use the bin/solr script we started out with to
delete this collection:
bin/solr delete -c techproducts
And then create a new collection:
bin/solr create -c -s 2 -rf 2
To stop both of the Solr nodes we started, issue the command:
bin/solr stop -all
For more information on start/stop and collection options with bin/solr, see Solr Control Script Reference.
Exercise 2: Modify the Schema and Index Films Data
This exercise will build on the last one and introduce you to the index schema and Solr’s powerful faceting
features.
Restart Solr
Did you stop Solr after the last exercise? No? Then go ahead to the next section.
If you did, though, and need to restart Solr, issue these commands:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 23 of 1426
./bin/solr start -c -p 8983 -s example/cloud/node1/solr
This starts the first node. When it’s done start the second node, and tell it how to connect to to ZooKeeper:
./bin/solr start -c -p 7574 -s example/cloud/node2/solr -z localhost:9983
If you have defined ZK_HOST in solr.in.sh/solr.in.cmd (see instructions) you can omit -z
from the above command.
Create a New Collection
We’re going to use a whole new data set in this exercise, so it would be better to have a new collection
instead of trying to reuse the one we had before.
One reason for this is we’re going to use a feature in Solr called "field guessing", where Solr attempts to
guess what type of data is in a field while it’s indexing it. It also automatically creates new fields in the
schema for new fields that appear in incoming documents. This mode is called "Schemaless". We’ll see the
benefits and limitations of this approach to help you decide how and where to use it in your real application.
What is a "schema" and why do I need one?
Solr’s schema is a single file (in XML) that stores the details about the fields and field types Solr is
expected to understand. The schema defines not only the field or field type names, but also any
modifications that should happen to a field before it is indexed. For example, if you want to ensure that
a user who enters "abc" and a user who enters "ABC" can both find a document containing the term
"ABC", you will want to normalize (lower-case it, in this case) "ABC" when it is indexed, and normalize
the user query to be sure of a match. These rules are defined in your schema.
Earlier in the tutorial we mentioned copy fields, which are fields made up of data that originated from
other fields. You can also define dynamic fields, which use wildcards (such as *_t or *_s) to dynamically
create fields of a specific field type. These types of rules are also defined in the schema.
When you initially started Solr in the first exercise, we had a choice of a configSet to use. The one we chose
had a schema that was pre-defined for the data we later indexed. This time, we’re going to use a configSet
that has a very minimal schema and let Solr figure out from the data what fields to add.
The data you’re going to index is related to movies, so start by creating a collection named "films" that uses
the _default configSet:
bin/solr create -c films -s 2 -rf 2
Whoa, wait. We didn’t specify a configSet! That’s fine, the _default is appropriately named, since it’s the
default and is used if you don’t specify one at all.
We did, however, set two parameters -s and -rf. Those are the number of shards to split the collection
across (2) and how many replicas to create (2). This is equivalent to the options we had during the
interactive example from the first exercise.
You should see output like:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 24 of 1426
Apache Solr Reference Guide 7.7
WARNING: Using _default configset. Data driven schema functionality is enabled by default, which
is
NOT RECOMMENDED for production use.
To turn it off:
bin/solr config -c films -p 7574 -action set-user-property -property
update.autoCreateFields -value false
Connecting to ZooKeeper at localhost:9983 ...
INFO - 2017-07-27 15:07:46.191; org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider;
Cluster at localhost:9983 ready
Uploading /7.7.0/server/solr/configsets/_default/conf for config films to ZooKeeper at
localhost:9983
Creating new collection 'films' using command:
http://localhost:7574/solr/admin/collections?action=CREATE&name=films&numShards=2&replicationFact
or=2&maxShardsPerNode=2&collection.configName=films
{
"responseHeader":{
"status":0,
"QTime":3830},
"success":{
"192.168.0.110:8983_solr":{
"responseHeader":{
"status":0,
"QTime":2076},
"core":"films_shard2_replica_n1"},
"192.168.0.110:7574_solr":{
"responseHeader":{
"status":0,
"QTime":2494},
"core":"films_shard1_replica_n2"}}}
The first thing the command printed was a warning about not using this configSet in production. That’s due
to some of the limitations we’ll cover shortly.
Otherwise, though, the collection should be created. If we go to the Admin UI at http://localhost:8983/solr/#
/films/collection-overview we should see the overview screen.
Preparing Schemaless for the Films Data
There are two parallel things happening with the schema that comes with the _default configSet.
First, we are using a "managed schema", which is configured to only be modified by Solr’s Schema API. That
means we should not hand-edit it so there isn’t confusion about which edits come from which source. Solr’s
Schema API allows us to make changes to fields, field types, and other types of schema rules.
Second, we are using "field guessing", which is configured in the solrconfig.xml file (and includes most of
Solr’s various configuration settings). Field guessing is designed to allow us to start using Solr without
having to define all the fields we think will be in our documents before trying to index them. This is why we
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 25 of 1426
call it "schemaless", because you can start quickly and let Solr create fields for you as it encounters them in
documents.
Sounds great! Well, not really, there are limitations. It’s a bit brute force, and if it guesses wrong, you can’t
change much about a field after data has been indexed without having to re-index. If we only have a few
thousand documents that might not be bad, but if you have millions and millions of documents, or, worse,
don’t have access to the original data anymore, this can be a real problem.
For these reasons, the Solr community does not recommend going to production without a schema that you
have defined yourself. By this we mean that the schemaless features are fine to start with, but you should
still always make sure your schema matches your expectations for how you want your data indexed and how
users are going to query it.
It is possible to mix schemaless features with a defined schema. Using the Schema API, you can define a few
fields that you know you want to control, and let Solr guess others that are less important or which you are
confident (through testing) will be guessed to your satisfaction. That’s what we’re going to do here.
Create the "names" Field
The films data we are going to index has a small number of fields for each movie: an ID, director name(s),
film name, release date, and genre(s).
If you look at one of the files in example/films, you’ll see the first film is named .45, released in 2006. As the
first document in the dataset, Solr is going to guess the field type based on the data in the record. If we go
ahead and index this data, that first film name is going to indicate to Solr that that field type is a "float"
numeric field, and will create a "name" field with a type FloatPointField. All data after this record will be
expected to be a float.
Well, that’s not going to work. We have titles like A Mighty Wind and Chicken Run, which are strings decidedly not numeric and not floats. If we let Solr guess the "name" field is a float, what will happen is later
titles will cause an error and indexing will fail. That’s not going to get us very far.
What we can do is set up the "name" field in Solr before we index the data to be sure Solr always interprets
it as a string. At the command line, enter this curl command:
curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"name",
"type":"text_general", "multiValued":false, "stored":true}}'
http://localhost:8983/solr/films/schema
This command uses the Schema API to explicitly define a field named "name" that has the field type
"text_general" (a text field). It will not be permitted to have multiple values, but it will be stored (meaning it
can be retrieved by queries).
You can also use the Admin UI to create fields, but it offers a bit less control over the properties of your field.
It will work for our case, though:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 26 of 1426
Apache Solr Reference Guide 7.7
Creating a field
Create a "catchall" Copy Field
There’s one more change to make before we start indexing.
In the first exercise when we queried the documents we had indexed, we didn’t have to specify a field to
search because the configuration we used was set up to copy fields into a text field, and that field was the
default when no other field was defined in the query.
The configuration we’re using now doesn’t have that rule. We would need to define a field to search for
every query. We can, however, set up a "catchall field" by defining a copy field that will take all data from all
fields and index it into a field named _text_. Let’s do that now.
You can use either the Admin UI or the Schema API for this.
At the command line, use the Schema API again to define a copy field:
curl -X POST -H 'Content-type:application/json' --data-binary '{"add-copy-field" :
{"source":"*","dest":"_text_"}}' http://localhost:8983/solr/films/schema
In the Admin UI, choose [ Add Copy Field ], then fill out the source and destination for your field, as in this
screenshot.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 27 of 1426
Creating a copy field
What this does is make a copy of all fields and put the data into the "_text_" field.
It can be very expensive to do this with your production data because it tells Solr to
effectively index everything twice. It will make indexing slower, and make your index larger.
With your production data, you will want to be sure you only copy fields that really warrant
it for your application.
OK, now we’re ready to index the data and start playing around with it.
Index Sample Film Data
The films data we will index is located in the example/films directory of your installation. It comes in three
formats: JSON, XML and CSV. Pick one of the formats and index it into the "films" collection (in each
example, one command is for Unix/MacOS and the other is for Windows):
To Index JSON Format
bin/post -c films example/films/films.json
C:\solr-7.7.0> java -jar -Dc=films -Dauto example\exampledocs\post.jar example\films\*.json
To Index XML Format
bin/post -c films example/films/films.xml
C:\solr-7.7.0> java -jar -Dc=films -Dauto example\exampledocs\post.jar example\films\*.xml
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 28 of 1426
Apache Solr Reference Guide 7.7
To Index CSV Format
bin/post -c films example/films/films.csv -params
"f.genre.split=true&f.directed_by.split=true&f.genre.separator=|&f.directed_by.separator=|"
C:\solr-7.7.0> java -jar -Dc=films
-Dparams=f.genre.split=true&f.directed_by.split=true&f.genre.separator=|&f.directed_by.separator=
| -Dauto example\exampledocs\post.jar example\films\*.csv
Each command includes these main parameters:
• -c films: this is the Solr collection to index data to.
• example/films/films.json (or films.xml or films.csv): this is the path to the data file to index. You
could simply supply the directory where this file resides, but since you know the format you want to
index, specifying the exact file for that format is more efficient.
Note the CSV command includes extra parameters. This is to ensure multi-valued entries in the "genre" and
"directed_by" columns are split by the pipe (|) character, used in this file as a separator. Telling Solr to split
these columns this way will ensure proper indexing of the data.
Each command will produce output similar to the below seen while indexing JSON:
$ ./bin/post -c films example/films/films.json
/bin/java -classpath /solr-7.7.0/dist/solr-core-7.7.0.jar -Dauto=yes -Dc=films -Ddata=files
org.apache.solr.util.SimplePostTool example/films/films.json
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/films/update...
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file films.json (application/json) to [base]/json/docs
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/films/update...
Time spent: 0:00:00.878
Hooray!
If you go to the Query screen in the Admin UI for films (http://localhost:8983/solr/#/films/query) and hit
[ Execute Query ] you should see 1100 results, with the first 10 returned to the screen.
Let’s do a query to see if the "catchall" field worked properly. Enter "comedy" in the q box and hit [ Execute
Query ] again. You should see get 417 results. Feel free to play around with other searches before we move
on to faceting.
Faceting
One of Solr’s most popular features is faceting. Faceting allows the search results to be arranged into
subsets (or buckets, or categories), providing a count for each subset. There are several types of faceting:
field values, numeric and date ranges, pivots (decision tree), and arbitrary query faceting.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 29 of 1426
Field Facets
In addition to providing search results, a Solr query can return the number of documents that contain each
unique value in the whole result set.
On the Admin UI Query tab, if you check the facet checkbox, you’ll see a few facet-related options appear:
Facet options in the Query screen
To see facet counts from all documents (q=*:*): turn on faceting (facet=true), and specify the field to facet
on via the facet.field parameter. If you only want facets, and no document contents, specify rows=0. The
curl command below will return facet counts for the genre_str field:
curl "http://localhost:8983/solr/films/select?q=*:*&rows=0&facet=true&facet.field=genre_str"
In your terminal, you’ll see something like:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 30 of 1426
Apache Solr Reference Guide 7.7
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":11,
"params":{
"q":"*:*",
"facet.field":"genre_str",
"rows":"0",
"facet":"true"}},
"response":{"numFound":1100,"start":0,"maxScore":1.0,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{
"genre_str":[
"Drama",552,
"Comedy",389,
"Romance Film",270,
"Thriller",259,
"Action Film",196,
"Crime Fiction",170,
"World cinema",167]},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}}}
We’ve truncated the output here a little bit, but in the facet_counts section, you see by default you get a
count of the number of documents using each genre for every genre in the index. Solr has a parameter
facet.mincount that you could use to limit the facets to only those that contain a certain number of
documents (this parameter is not shown in the UI). Or, perhaps you do want all the facets, and you’ll let your
application’s front-end control how it’s displayed to users.
If you wanted to control the number of items in a bucket, you could do something like this:
curl
"http://localhost:8983/solr/films/select?=&q=*:*&facet.field=genre_str&facet.mincount=200&fa
cet=on&rows=0"
You should only see 4 facets returned.
There are a great deal of other parameters available to help you control how Solr constructs the facets and
facet lists. We’ll cover some of them in this exercise, but you can also see the section Faceting for more
detail.
Range Facets
For numerics or dates, it’s often desirable to partition the facet counts into ranges rather than discrete
values. A prime example of numeric range faceting, using the example techproducts data from our previous
exercise, is price. In the /browse UI, it looks like this:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 31 of 1426
Range facets
The films data includes the release date for films, and we could use that to create date range facets, which
are another common use for range facets.
The Solr Admin UI doesn’t yet support range facet options, so you will need to use curl or similar command
line tool for the following examples.
If we construct a query that looks like this:
curl 'http://localhost:8983/solr/films/select?q=*:*&rows=0'\
'&facet=true'\
'&facet.range=initial_release_date'\
'&facet.range.start=NOW-20YEAR'\
'&facet.range.end=NOW'\
'&facet.range.gap=%2B1YEAR'
This will request all films and ask for them to be grouped by year starting with 20 years ago (our earliest
release date is in 2000) and ending today. Note that this query again URL encodes a + as %2B.
In the terminal you will see:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 32 of 1426
Apache Solr Reference Guide 7.7
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":8,
"params":{
"facet.range":"initial_release_date",
"facet.limit":"300",
"q":"*:*",
"facet.range.gap":"+1YEAR",
"rows":"0",
"facet":"on",
"facet.range.start":"NOW-20YEAR",
"facet.range.end":"NOW"}},
"response":{"numFound":1100,"start":0,"maxScore":1.0,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{},
"facet_ranges":{
"initial_release_date":{
"counts":[
"1997-07-28T17:12:06.919Z",0,
"1998-07-28T17:12:06.919Z",0,
"1999-07-28T17:12:06.919Z",48,
"2000-07-28T17:12:06.919Z",82,
"2001-07-28T17:12:06.919Z",103,
"2002-07-28T17:12:06.919Z",131,
"2003-07-28T17:12:06.919Z",137,
"2004-07-28T17:12:06.919Z",163,
"2005-07-28T17:12:06.919Z",189,
"2006-07-28T17:12:06.919Z",92,
"2007-07-28T17:12:06.919Z",26,
"2008-07-28T17:12:06.919Z",7,
"2009-07-28T17:12:06.919Z",3,
"2010-07-28T17:12:06.919Z",0,
"2011-07-28T17:12:06.919Z",0,
"2012-07-28T17:12:06.919Z",1,
"2013-07-28T17:12:06.919Z",1,
"2014-07-28T17:12:06.919Z",1,
"2015-07-28T17:12:06.919Z",0,
"2016-07-28T17:12:06.919Z",0],
"gap":"+1YEAR",
"start":"1997-07-28T17:12:06.919Z",
"end":"2017-07-28T17:12:06.919Z"}},
"facet_intervals":{},
"facet_heatmaps":{}}}
Pivot Facets
Another faceting type is pivot facets, also known as "decision trees", allowing two or more fields to be
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 33 of 1426
nested for all the various possible combinations. Using the films data, pivot facets can be used to see how
many of the films in the "Drama" category (the genre_str field) are directed by a director. Here’s how to get
at the raw data for this scenario:
curl
"http://localhost:8983/solr/films/select?q=*:*&rows=0&facet=on&facet.pivot=genre_str,directe
d_by_str"
This results in the following response, which shows a facet for each category and director combination:
{"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":1147,
"params":{
"q":"*:*",
"facet.pivot":"genre_str,directed_by_str",
"rows":"0",
"facet":"on"}},
"response":{"numFound":1100,"start":0,"maxScore":1.0,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{},
"facet_pivot":{
"genre_str,directed_by_str":[{
"field":"genre_str",
"value":"Drama",
"count":552,
"pivot":[{
"field":"directed_by_str",
"value":"Ridley Scott",
"count":5},
{
"field":"directed_by_str",
"value":"Steven Soderbergh",
"count":5},
{
"field":"directed_by_str",
"value":"Michael Winterbottom",
"count":4}}]}]}}}
We’ve truncated this output as well - you will see a lot of genres and directors in your screen.
Exercise 2 Wrap Up
In this exercise, we learned a little bit more about how Solr organizes data in the indexes, and how to work
with the Schema API to manipulate the schema file. We also learned a bit about facets in Solr, including
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 34 of 1426
Apache Solr Reference Guide 7.7
range facets and pivot facets. In both of these things, we’ve only scratched the surface of the available
options. If you can dream it, it might be possible!
Like our previous exercise, this data may not be relevant to your needs. We can clean up our work by
deleting the collection. To do that, issue this command at the command line:
bin/solr delete -c films
Exercise 3: Index Your Own Data
For this last exercise, work with a dataset of your choice. This can be files on your local hard drive, a set of
data you have worked with before, or maybe a sample of the data you intend to index to Solr for your
production application.
This exercise is intended to get you thinking about what you will need to do for your application:
• What sorts of data do you need to index?
• What will you need to do to prepare Solr for your data (such as, create specific fields, set up copy fields,
determine analysis rules, etc.)
• What kinds of search options do you want to provide to users?
• How much testing will you need to do to ensure everything works the way you expect?
Create Your Own Collection
Before you get started, create a new collection, named whatever you’d like. In this example, the collection
will be named "localDocs"; replace that name with whatever name you choose if you want to.
./bin/solr create -c localDocs -s 2 -rf 2
Again, as we saw from Exercise 2 above, this will use the _default configSet and all the schemaless features
it provides. As we noted previously, this may cause problems when we index our data. You may need to
iterate on indexing a few times before you get the schema right.
Indexing Ideas
Solr has lots of ways to index data. Choose one of the approaches below and try it out with your system:
Local Files with bin/post
If you have a local directory of files, the Post Tool (bin/post) can index a directory of files. We saw this in
action in our first exercise.
We used only JSON, XML and CSV in our exercises, but the Post Tool can also handle HTML, PDF, Microsoft
Office formats (such as MS Word), plain text, and more.
In this example, assume there is a directory named "Documents" locally. To index it, we would issue a
command like this (correcting the collection name after the -c parameter as needed):
./bin/post -c localDocs ~/Documents
You may get errors as it works through your documents. These might be caused by the field guessing, or
the file type may not be supported. Indexing content such as this demonstrates the need to plan Solr for
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 35 of 1426
your data, which requires understanding it and perhaps also some trial and error.
DataImportHandler
Solr includes a tool called the Data Import Handler (DIH) which can connect to databases (if you have a
jdbc driver), mail servers, or other structured data sources. There are several examples included for
feeds, GMail, and a small HSQL database.
The README.txt file in example/example-DIH will give you details on how to start working with this tool.
SolrJ
SolrJ is a Java-based client for interacting with Solr. Use SolrJ for JVM-based languages or other Solr clients
to programmatically create documents to send to Solr.
Documents Screen
Use the Admin UI Documents tab (at http://localhost:8983/solr/#/localDocs/documents) to paste in a
document to be indexed, or select Document Builder from the Document Type dropdown to build a
document one field at a time. Click on the [ Submit Document ] button below the form to index your
document.
Updating Data
You may notice that even if you index content in this tutorial more than once, it does not duplicate the
results found. This is because the example Solr schema (a file named either managed-schema or schema.xml)
specifies a uniqueKey field called id. Whenever you POST commands to Solr to add a document with the
same value for the uniqueKey as an existing document, it automatically replaces it for you.
You can see that that has happened by looking at the values for numDocs and maxDoc in the core-specific
Overview section of the Solr Admin UI.
numDocs represents the number of searchable documents in the index (and will be larger than the number
of XML, JSON, or CSV files since some files contained more than one document). The maxDoc value may be
larger as the maxDoc count includes logically deleted documents that have not yet been physically removed
from the index. You can re-post the sample files over and over again as much as you want and numDocs will
never increase, because the new documents will constantly be replacing the old.
Go ahead and edit any of the existing example data files, change some of the data, and re-run the PostTool
(bin/post). You’ll see your changes reflected in subsequent searches.
Deleting Data
If you need to iterate a few times to get your schema right, you may want to delete documents to clear out
the collection and try again. Note, however, that merely removing documents doesn’t change the
underlying field definitions. Essentially, this will allow you to re-index your data after making changes to
fields for your needs.
You can delete data by POSTing a delete command to the update URL and specifying the value of the
document’s unique key field, or a query that matches multiple documents (be careful with that one!). We
can use bin/post to delete documents also if we structure the request properly.
Execute the following command to delete a specific document:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 36 of 1426
Apache Solr Reference Guide 7.7
bin/post -c localDocs -d "SP2514N "
To delete all documents, you can use "delete-by-query" command like:
bin/post -c localDocs -d "*:* "
You can also modify the above to only delete documents that match a specific query.
Exercise 3 Wrap Up
At this point, you’re ready to start working on your own.
Jump ahead to the overall wrap up when you’re ready to stop Solr and remove all the examples you worked
with and start fresh.
Spatial Queries
Solr has sophisticated geospatial support, including searching within a specified distance range of a given
location (or within a bounding box), sorting by distance, or even boosting results by the distance.
Some of the example techproducts documents we indexed in Exercise 1 have locations associated with them
to illustrate the spatial capabilities. To re-index this data, see Exercise 1.
Spatial queries can be combined with any other types of queries, such as in this example of querying for
"ipod" within 10 kilometers from San Francisco:
Spatial queries and results
This is from Solr’s example search UI (called /browse), which has a nice feature to show a map for each item
and allow easy selection of the location to search near. You can see this yourself by going to
http://localhost:8983/solr/techproducts/browse?q=ipod&pt=37.7752%2C-122.4232&d=10&sfield=store&
fq=%7B%21bbox%7D&queryOpts=spatial&queryOpts=spatial in a browser.
To learn more about Solr’s spatial capabilities, see the section Spatial Search.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 37 of 1426
Wrapping Up
If you’ve run the full set of commands in this quick start guide you have done the following:
• Launched Solr into SolrCloud mode, two nodes, two collections including shards and replicas
• Indexed several types of files
• Used the Schema API to modify your schema
• Opened the admin console, used its query interface to get results
• Opened the /browse interface to explore Solr’s features in a more friendly and familiar interface
Nice work!
Cleanup
As you work through this tutorial, you may want to stop Solr and reset the environment back to the starting
point. The following command line will stop Solr and remove the directories for each of the two nodes that
were created all the way back in Exercise 1:
bin/solr stop -all ; rm -Rf example/cloud/
Where to next?
This Guide will be your best resource for learning more about Solr.
Solr also has a robust community made up of people happy to help you get started. For more information,
check out the Solr website’s Resources page.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 38 of 1426
Apache Solr Reference Guide 7.7
A Quick Overview
Solr is a search server built on top of Apache Lucene, an open source, Java-based, information retrieval
library. It is designed to drive powerful document retrieval applications - wherever you need to serve data to
users based on their queries, Solr can work for you.
Here is a example of how Solr could integrate with an application:
Solr integration with applications
In the scenario above, Solr runs alongside other server applications. For example, an online store application
would provide a user interface, a shopping cart, and a way to make purchases for end users; while an
inventory management application would allow store employees to edit product information. The product
metadata would be kept in some kind of database, as well as in Solr.
Solr makes it easy to add the capability to search through the online store through the following steps:
1. Define a schema. The schema tells Solr about the contents of documents it will be indexing. In the online
store example, the schema would define fields for the product name, description, price, manufacturer,
and so on. Solr’s schema is powerful and flexible and allows you to tailor Solr’s behavior to your
application. See Documents, Fields, and Schema Design for all the details.
2. Feed Solr documents for which your users will search.
3. Expose search functionality in your application.
Because Solr is based on open standards, it is highly extensible. Solr queries are simple HTTP request URLs
and the response is a structured document: mainly JSON, but it could also be XML, CSV, or other formats.
This means that a wide variety of clients will be able to use Solr, from other web applications to browser
clients, rich client applications, and mobile devices. Any platform capable of HTTP can talk to Solr. See Client
APIs for details on client APIs.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 39 of 1426
Solr offers support for the simplest keyword searching through to complex queries on multiple fields and
faceted search results. Searching has more information about searching and queries.
If Solr’s capabilities are not impressive enough, its ability to handle very high-volume applications should do
the trick.
A relatively common scenario is that you have so much data, or so many queries, that a single Solr server is
unable to handle your entire workload. In this case, you can scale up the capabilities of your application
using SolrCloud to better distribute the data, and the processing of requests, across many servers. Multiple
options can be mixed and matched depending on the scalability you need.
For example: "Sharding" is a scaling technique in which a collection is split into multiple logical pieces called
"shards" in order to scale up the number of documents in a collection beyond what could physically fit on a
single server. Incoming queries are distributed to every shard in the collection, which respond with merged
results. Another technique available is to increase the "Replication Factor" of your collection, which allows
you to add servers with additional copies of your collection to handle higher concurrent query load by
spreading the requests around to multiple machines. Sharding and replication are not mutually exclusive,
and together make Solr an extremely powerful and scalable platform.
Best of all, this talk about high-volume applications is not just hypothetical: some of the famous Internet
sites that use Solr today are Macy’s, EBay, and Zappo’s. For more examples, take a look at
https://wiki.apache.org/solr/PublicServers.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 40 of 1426
Apache Solr Reference Guide 7.7
Solr System Requirements
You can install Solr in any system where a suitable Java Runtime Environment (JRE) is available, as detailed
below.
Currently this includes Linux, MacOS/OS X, and Microsoft Windows.
Installation Requirements
Java Requirements
You will need the Java Runtime Environment (JRE) version 1.8 or higher. At a command line, check your Java
version like this:
$ java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
The exact output will vary, but you need to make sure you meet the minimum version requirement. We also
recommend choosing a version that is not end-of-life from its vendor. Oracle or OpenJDK are the most
tested JREs and are recommended. It’s also recommended to use the latest available official release when
possible.
Some versions of Java VM have bugs that may impact your implementation. To be sure, check the page
Lucene JavaBugs.
If you don’t have the required version, or if the java command is not found, download and install the latest
version from Oracle at http://www.oracle.com/technetwork/java/javase/downloads/index.html.
Supported Operating Systems
Solr is tested on several versions of Linux, MacOS, and Windows.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 41 of 1426
Installing Solr
Installation of Solr on Unix-compatible or Windows servers generally requires simply extracting (or,
unzipping) the download package.
Please be sure to review the Solr System Requirements before starting Solr.
Available Solr Packages
Solr is available from the Solr website. Download the latest release https://lucene.apache.org/solr/mirrorssolr-latest-redir.html.
There are three separate packages:
• solr-7.7.0.tgz for Linux/Unix/OSX systems
• solr-7.7.0.zip for Microsoft Windows systems
• solr-7.7.0-src.tgz the package Solr source code. This is useful if you want to develop on Solr without
using the official Git repository.
Preparing for Installation
When getting started with Solr, all you need to do is extract the Solr distribution archive to a directory of
your choosing. This will suffice as an initial development environment, but take care not to overtax this "toy"
installation before setting up your true development and production environments.
When you’ve progressed past initial evaluation of Solr, you’ll want to take care to plan your implementation.
You may need to reinstall Solr on another server or make a clustered SolrCloud environment.
When you’re ready to setup Solr for a production environment, please refer to the instructions provided on
the Taking Solr to Production page.
What Size Server Do I Need?
How to size your Solr installation is a complex question that relies on a number of factors,
including the number and structure of documents, how many fields you intend to store, the
number of users, etc.
It’s highly recommended that you spend a bit of time thinking about the factors that will
impact hardware sizing for your Solr implementation. A very good blog post that discusses
the issues to consider is Sizing Hardware in the Abstract: Why We Don’t have a Definitive
Answer.
One thing to note when planning your installation is that a hard limit exists in Lucene for the number of
documents in a single index: approximately 2.14 billion documents (2,147,483,647 to be exact). In practice, it
is highly unlikely that such a large number of documents would fit and perform well in a single index, and
you will likely need to distribute your index across a cluster before you ever approach this number. If you
know you will exceed this number of documents in total before you’ve even started indexing, it’s best to
plan your installation with SolrCloud as part of your design from the start.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 42 of 1426
Apache Solr Reference Guide 7.7
Package Installation
To keep things simple for now, extract the Solr distribution archive to your local home directory, for instance
on Linux, do:
cd ~/
tar zxf solr-7.7.0.tgz
Once extracted, you are now ready to run Solr using the instructions provided in the Starting Solr section
below.
Directory Layout
After installing Solr, you’ll see the following directories and files within them:
bin/
This directory includes several important scripts that will make using Solr easier.
solr and solr.cmd
This is Solr’s Control Script, also known as bin/solr (*nix) / bin/solr.cmd (Windows). This script is the
preferred tool to start and stop Solr. You can also create collections or cores, configure authentication,
and work with configuration files when running in SolrCloud mode.
post
The PostTool, which provides a simple command line interface for POSTing content to Solr.
solr.in.sh and solr.in.cmd
These are property files for *nix and Windows systems, respectively. System-level properties for Java,
Jetty, and Solr are configured here. Many of these settings can be overridden when using bin/solr /
bin/solr.cmd, but this allows you to set all the properties in one place.
install_solr_services.sh
This script is used on *nix systems to install Solr as a service. It is described in more detail in the
section Taking Solr to Production.
contrib/
Solr’s contrib directory includes add-on plugins for specialized features of Solr.
dist/
The dist directory contains the main Solr .jar files.
docs/
The docs directory includes a link to online Javadocs for Solr.
example/
The example directory includes several types of examples that demonstrate various Solr capabilities. See
the section Solr Examples below for more details on what is in this directory.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 43 of 1426
licenses/
The licenses directory includes all of the licenses for 3rd party libraries used by Solr.
server/
This directory is where the heart of the Solr application resides. A README in this directory provides a
detailed overview, but here are some highlights:
• Solr’s Admin UI (server/solr-webapp)
• Jetty libraries (server/lib)
• Log files (server/logs) and log configurations (server/resources). See the section Configuring
Logging for more details on how to customize Solr’s default logging.
• Sample configsets (server/solr/configsets)
Solr Examples
Solr includes a number of example documents and configurations to use when getting started. If you ran
through the Solr Tutorial, you have already interacted with some of these files.
Here are the examples included with Solr:
exampledocs
This is a small set of simple CSV, XML, and JSON files that can be used with bin/post when first getting
started with Solr. For more information about using bin/post with these files, see Post Tool.
example-DIH
This directory includes a few example DataImport Handler (DIH) configurations to help you get started
with importing structured content in a database, an email server, or even an Atom feed. Each example
will index a different set of data; see the README there for more details about these examples.
files
The files directory provides a basic search UI for documents such as Word or PDF that you may have
stored locally. See the README there for details on how to use this example.
films
The films directory includes a robust set of data about movies in three formats: CSV, XML, and JSON. See
the README there for details on how to use this dataset.
Starting Solr
Solr includes a command line interface tool called bin/solr (Linux/MacOS) or bin\solr.cmd (Windows). This
tool allows you to start and stop Solr, create cores and collections, configure authentication, and check the
status of your system.
To use it to start Solr you can simply enter:
bin/solr start
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 44 of 1426
Apache Solr Reference Guide 7.7
If you are running Windows, you can start Solr by running bin\solr.cmd instead.
bin\solr.cmd start
This will start Solr in the background, listening on port 8983.
When you start Solr in the background, the script will wait to make sure Solr starts correctly before returning
to the command line prompt.
All of the options for the Solr CLI are described in the section Solr Control Script Reference.
Start Solr with a Specific Bundled Example
Solr also provides a number of useful examples to help you learn about key features. You can launch the
examples using the -e flag. For instance, to launch the "techproducts" example, you would do:
bin/solr -e techproducts
Currently, the available examples you can run are: techproducts, dih, schemaless, and cloud. See the section
Running with Example Configurations for details on each example.
Getting Started with SolrCloud
Running the cloud example starts Solr in SolrCloud mode. For more information on
starting Solr in SolrCloud mode, see the section Getting Started with SolrCloud.
Check if Solr is Running
If you’re not sure if Solr is running locally, you can use the status command:
bin/solr status
This will search for running Solr instances on your computer and then gather basic information about them,
such as the version and memory usage.
That’s it! Solr is running. If you need convincing, use a Web browser to see the Admin Console.
http://localhost:8983/solr/
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 45 of 1426
The Solr Admin interface.
If Solr is not running, your browser will complain that it cannot connect to the server. Check your port
number and try again.
Create a Core
If you did not start Solr with an example configuration, you would need to create a core in order to be able
to index and search. You can do so by running:
bin/solr create -c
This will create a core that uses a data-driven schema which tries to guess the correct field type when you
add documents to the index.
To see all available options for creating a new core, execute:
bin/solr create -help
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 46 of 1426
Apache Solr Reference Guide 7.7
Deployment and Operations
An important aspect of Solr is that all operations and deployment can be done online, with minimal or no
impact to running applications. This includes minor upgrades and provisioning and removing nodes,
backing up and restoring indexes and editing configurations
Common administrative tasks include:
Solr Control Script Reference: This section provides information about all of the options available to the
bin/solr / bin\solr.cmd scripts, which can start and stop Solr, configure authentication, and create or
remove collections and cores.
Solr Configuration Files: Overview of the installation layout and major configuration files.
Taking Solr to Production: Detailed steps to help you install Solr as a service and take your application to
production.
Making and Restoring Backups: Describes backup strategies for your Solr indexes.
Running Solr on HDFS: How to use HDFS to store your Solr indexes and transaction logs.
SolrCloud on AWS EC2: A tutorial on deploying Solr in Amazon Web Services (AWS) using EC2 instances.
Upgrading a Solr Cluster: Information for upgrading a production SolrCloud cluster.
Solr Upgrade Notes: Information about changes made in Solr releases.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 47 of 1426
Solr Control Script Reference
Solr includes a script known as “bin/solr” that allows you to perform many common operations on your Solr
installation or cluster.
You can start and stop Solr, create and delete collections or cores, perform operations on ZooKeeper and
check the status of Solr and configured shards.
You can find the script in the bin/ directory of your Solr installation. The bin/solr script makes Solr easier
to work with by providing simple commands and options to quickly accomplish common goals.
More examples of bin/solr in use are available throughout the Solr Reference Guide, but particularly in the
sections Starting Solr and Getting Started with SolrCloud.
Starting and Stopping
Start and Restart
The start command starts Solr. The restart command allows you to restart Solr while it is already running
or if it has been stopped already.
The start and restart commands have several options to allow you to run in SolrCloud mode, use an
example configuration set, start with a hostname or port that is not the default and point to a local
ZooKeeper ensemble.
bin/solr start [options]
bin/solr start -help
bin/solr restart [options]
bin/solr restart -help
When using the restart command, you must pass all of the parameters you initially passed when you
started Solr. Behind the scenes, a stop request is initiated, so Solr will be stopped before being started again.
If no nodes are already running, restart will skip the step to stop and proceed to starting Solr.
Start Parameters
The bin/solr script provides many options to allow you to customize the server in common ways, such as
changing the listening port. However, most of the defaults are adequate for most Solr installations,
especially when just getting started.
-a ""
Start Solr with additional JVM parameters, such as those starting with -X. If you are passing JVM
parameters that begin with "-D", you can omit the -a option.
Example:
bin/solr start -a "-Xdebug -Xrunjdwp:transport=dt_socket, server=y,suspend=n,address=1044"
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 48 of 1426
Apache Solr Reference Guide 7.7
-cloud
Start Solr in SolrCloud mode, which will also launch the embedded ZooKeeper instance included with Solr.
This option can be shortened to simply -c.
If you are already running a ZooKeeper ensemble that you want to use instead of the embedded (singlenode) ZooKeeper, you should also either specify ZK_HOST in solr.in.sh/solr.in.cmd (see instructions)
or pass the -z parameter.
For more details, see the section SolrCloud Mode below.
Example: bin/solr start -c
-d
Define a server directory, defaults to server (as in, $SOLR_HOME/server). It is uncommon to override this
option. When running multiple instances of Solr on the same host, it is more common to use the same
server directory for each instance and use a unique Solr home directory using the -s option.
Example: bin/solr start -d newServerDir
-e
Start Solr with an example configuration. These examples are provided to help you get started faster with
Solr generally, or just try a specific feature.
The available options are:
• cloud
• techproducts
• dih
• schemaless
See the section Running with Example Configurations below for more details on the example
configurations.
Example: bin/solr start -e schemaless
-f
Start Solr in the foreground; you cannot use this option when running examples with the -e option.
Example: bin/solr start -f
-h
Start Solr with the defined hostname. If this is not specified, 'localhost' will be assumed.
Example: bin/solr start -h search.mysolr.com
-m
Start Solr with the defined value as the min (-Xms) and max (-Xmx) heap size for the JVM.
Example: bin/solr start -m 1g
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 49 of 1426
-noprompt
Start Solr and suppress any prompts that may be seen with another option. This would have the side
effect of accepting all defaults implicitly.
For example, when using the "cloud" example, an interactive session guides you through several options
for your SolrCloud cluster. If you want to accept all of the defaults, you can simply add the -noprompt
option to your request.
Example: bin/solr start -e cloud -noprompt
-p
Start Solr on the defined port. If this is not specified, '8983' will be used.
Example: bin/solr start -p 8655
-s
Sets the solr.solr.home system property; Solr will create core directories under this directory. This
allows you to run multiple Solr instances on the same host while reusing the same server directory set
using the -d parameter.
If set, the specified directory should contain a solr.xml file, unless solr.xml exists in ZooKeeper. The
default value is server/solr.
This parameter is ignored when running examples (-e), as the solr.solr.home depends on which
example is run.
Example: bin/solr start -s newHome
-v
Be more verbose. This changes the logging level of log4j from INFO to DEBUG, having the same effect as if
you edited log4j2.xml accordingly.
Example: bin/solr start -f -v
-q
Be more quiet. This changes the logging level of log4j from INFO to WARN, having the same effect as if you
edited log4j2.xml accordingly. This can be useful in a production setting where you want to limit logging
to warnings and errors.
Example: bin/solr start -f -q
-V
Start Solr with verbose messages from the start script.
Example: bin/solr start -V
-z
Start Solr with the defined ZooKeeper connection string. This option is only used with the -c option, to
start Solr in SolrCloud mode. If ZK_HOST is not specified in solr.in.sh/solr.in.cmd and this option is not
provided, Solr will start the embedded ZooKeeper instance and use that instance for SolrCloud
operations.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 50 of 1426
Apache Solr Reference Guide 7.7
Example: bin/solr start -c -z server1:2181,server2:2181
-force
If attempting to start Solr as the root user, the script will exit with a warning that running Solr as "root"
can cause problems. It is possible to override this warning with the -force parameter.
Example: sudo bin/solr start -force
To emphasize how the default settings work take a moment to understand that the following commands are
equivalent:
bin/solr start
bin/solr start -h localhost -p 8983 -d server -s solr -m 512m
It is not necessary to define all of the options when starting if the defaults are fine for your needs.
Setting Java System Properties
The bin/solr script will pass any additional parameters that begin with -D to the JVM, which allows you to
set arbitrary Java system properties.
For example, to set the auto soft-commit frequency to 3 seconds, you can do:
bin/solr start -Dsolr.autoSoftCommit.maxTime=3000
SolrCloud Mode
The -c and -cloud options are equivalent:
bin/solr start -c
bin/solr start -cloud
If you specify a ZooKeeper connection string, such as -z 192.168.1.4:2181, then Solr will connect to
ZooKeeper and join the cluster.
If you have defined ZK_HOST in solr.in.sh/solr.in.cmd (see instructions) you can omit -z
from all bin/solr commands.
When starting Solr in SolrCloud mode, if you do not define `ZK_HOST` in
`solr.in.sh`/`solr.in.cmd` nor specify the `-z` option, then Solr will launch an embedded
ZooKeeper server listening on the Solr port + 1000, i.e., if Solr is running on port 8983, then
the embedded ZooKeeper will be listening on port 9983.
If your ZooKeeper connection string uses a chroot, such as localhost:2181/solr, then you
need to create the /solr znode before launching SolrCloud using the bin/solr script.
+ To do this use the mkroot command outlined below, for example: bin/solr zk mkroot
/solr -z 192.168.1.4:2181
When starting in SolrCloud mode, the interactive script session will prompt you to choose a configset to use.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 51 of 1426
For more information about starting Solr in SolrCloud mode, see also the section Getting Started with
SolrCloud.
Running with Example Configurations
bin/solr start -e
The example configurations allow you to get started quickly with a configuration that mirrors what you hope
to accomplish with Solr.
Each example launches Solr with a managed schema, which allows use of the Schema API to make schema
edits, but does not allow manual editing of a Schema file.
If you would prefer to manually modify a schema.xml file directly, you can change this default as described
in the section Schema Factory Definition in SolrConfig.
Unless otherwise noted in the descriptions below, the examples do not enable SolrCloud nor schemaless
mode.
The following examples are provided:
• cloud: This example starts a 1-4 node SolrCloud cluster on a single machine. When chosen, an interactive
session will start to guide you through options to select the initial configset to use, the number of nodes
for your example cluster, the ports to use, and name of the collection to be created.
When using this example, you can choose from any of the available configsets found in
$SOLR_HOME/server/solr/configsets.
• techproducts: This example starts Solr in standalone mode with a schema designed for the sample
documents included in the $SOLR_HOME/example/exampledocs directory.
The configset used can be found in
$SOLR_HOME/server/solr/configsets/sample_techproducts_configs.
• dih: This example starts Solr in standalone mode with the DataImportHandler (DIH) enabled and several
example dataconfig.xml files pre-configured for different types of data supported with DIH (such as,
database contents, email, RSS feeds, etc.).
The configset used is customized for DIH, and is found in $SOLR_HOME/example/example-DIH/solr/conf.
For more information about DIH, see the section Uploading Structured Data Store Data with the Data
Import Handler.
• schemaless: This example starts Solr in standalone mode using a managed schema, as described in the
section Schema Factory Definition in SolrConfig, and provides a very minimal pre-defined schema. Solr
will run in Schemaless Mode with this configuration, where Solr will create fields in the schema on the fly
and will guess field types used in incoming documents.
The configset used can be found in $SOLR_HOME/server/solr/configsets/_default.
The run in-foreground option (-f) is not compatible with the -e option since the script
needs to perform additional tasks after starting the Solr server.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 52 of 1426
Apache Solr Reference Guide 7.7
Stop
The stop command sends a STOP request to a running Solr node, which allows it to shutdown gracefully.
The command will wait up to 180 seconds for Solr to stop gracefully and then will forcefully kill the process
(kill -9).
bin/solr stop [options]
bin/solr stop -help
Stop Parameters
-p
Stop Solr running on the given port. If you are running more than one instance, or are running in
SolrCloud mode, you either need to specify the ports in separate requests or use the -all option.
Example: bin/solr stop -p 8983
-all
Stop all running Solr instances that have a valid PID.
Example: bin/solr stop -all
-k
Stop key used to protect from stopping Solr inadvertently; default is "solrrocks".
Example: bin/solr stop -k solrrocks
System Information
Version
The version command simply returns the version of Solr currently installed and immediately exists.
$ bin/solr version
X.Y.0
Status
The status command displays basic JSON-formatted information for any Solr nodes found running on the
local system.
The status command uses the SOLR_PID_DIR environment variable to locate Solr process ID files to find
running Solr instances, which defaults to the bin directory.
bin/solr status
The output will include a status of each node of the cluster, as in this example:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 53 of 1426
Found 2 Solr nodes:
Solr process 39920 running on port 7574
{
"solr_home":"/Applications/Solr/example/cloud/node2/solr/",
"version":"X.Y.0",
"startTime":"2015-02-10T17:19:54.739Z",
"uptime":"1 days, 23 hours, 55 minutes, 48 seconds",
"memory":"77.2 MB (%15.7) of 490.7 MB",
"cloud":{
"ZooKeeper":"localhost:9865",
"liveNodes":"2",
"collections":"2"}}
Solr process 39827 running on port 8865
{
"solr_home":"/Applications/Solr/example/cloud/node1/solr/",
"version":"X.Y.0",
"startTime":"2015-02-10T17:19:49.057Z",
"uptime":"1 days, 23 hours, 55 minutes, 54 seconds",
"memory":"94.2 MB (%19.2) of 490.7 MB",
"cloud":{
"ZooKeeper":"localhost:9865",
"liveNodes":"2",
"collections":"2"}}
Assert
The assert command sanity checks common issues with Solr installations. These include checking the
ownership/existence of particular directories, and ensuring Solr is available on the expected URL. The
command can either output a specified error message, or change its exit code to indicate errors.
As an example:
bin/solr assert --exists /opt/bin/solr
Results in the output below:
ERROR: Directory /opt/bin/solr does not exist.
Use bin/solr assert -help for a full list of options.
Healthcheck
The healthcheck command generates a JSON-formatted health report for a collection when running in
SolrCloud mode. The health report provides information about the state of every replica for all shards in a
collection, including the number of committed documents and its current state.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 54 of 1426
Apache Solr Reference Guide 7.7
bin/solr healthcheck [options]
bin/solr healthcheck -help
Healthcheck Parameters
-c
Name of the collection to run a healthcheck against (required).
Example: bin/solr healthcheck -c gettingstarted
-z
ZooKeeper connection string, defaults to localhost:9983. If you are running Solr on a port other than
8983, you will have to specify the ZooKeeper connection string. By default, this will be the Solr port + 1000.
Unnecessary if ZK_HOST is defined in solr.in.sh or solr.in.cmd.
Example: bin/solr healthcheck -z localhost:2181
Below is an example healthcheck request and response using a non-standard ZooKeeper connect string,
with 2 nodes running:
$ bin/solr healthcheck -c gettingstarted -z localhost:9865
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 55 of 1426
{
"collection":"gettingstarted",
"status":"healthy",
"numDocs":0,
"numShards":2,
"shards":[
{
"shard":"shard1",
"status":"healthy",
"replicas":[
{
"name":"core_node1",
"url":"http://10.0.1.10:8865/solr/gettingstarted_shard1_replica2/",
"numDocs":0,
"status":"active",
"uptime":"2 days, 1 hours, 18 minutes, 48 seconds",
"memory":"25.6 MB (%5.2) of 490.7 MB",
"leader":true},
{
"name":"core_node4",
"url":"http://10.0.1.10:7574/solr/gettingstarted_shard1_replica1/",
"numDocs":0,
"status":"active",
"uptime":"2 days, 1 hours, 18 minutes, 42 seconds",
"memory":"95.3 MB (%19.4) of 490.7 MB"}]},
{
"shard":"shard2",
"status":"healthy",
"replicas":[
{
"name":"core_node2",
"url":"http://10.0.1.10:8865/solr/gettingstarted_shard2_replica2/",
"numDocs":0,
"status":"active",
"uptime":"2 days, 1 hours, 18 minutes, 48 seconds",
"memory":"25.8 MB (%5.3) of 490.7 MB"},
{
"name":"core_node3",
"url":"http://10.0.1.10:7574/solr/gettingstarted_shard2_replica1/",
"numDocs":0,
"status":"active",
"uptime":"2 days, 1 hours, 18 minutes, 42 seconds",
"memory":"95.4 MB (%19.4) of 490.7 MB",
"leader":true}]}]}
Collections and Cores
The bin/solr script can also help you create new collections (in SolrCloud mode) or cores (in standalone
mode), or delete collections.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 56 of 1426
Apache Solr Reference Guide 7.7
Create a Core or Collection
The create command detects the mode that Solr is running in (standalone or SolrCloud) and then creates a
core or collection depending on the mode.
bin/solr create [options]
bin/solr create -help
Create Core or Collection Parameters
-c
Name of the core or collection to create (required).
Example: bin/solr create -c mycollection
-d
The configuration directory. This defaults to _default.
See the section Configuration Directories and SolrCloud below for more details about this option when
running in SolrCloud mode.
Example: bin/solr create -d _default
-n
The configuration name. This defaults to the same name as the core or collection.
Example: bin/solr create -n basic
-p
Port of a local Solr instance to send the create command to; by default the script tries to detect the port
by looking for running Solr instances.
This option is useful if you are running multiple standalone Solr instances on the same host, thus
requiring you to be specific about which instance to create the core in.
Example: bin/solr create -p 8983
-s or -shards
Number of shards to split a collection into, default is 1; only applies when Solr is running in SolrCloud
mode.
Example: bin/solr create -s 2
-rf or -replicationFactor
Number of copies of each document in the collection. The default is 1 (no replication).
Example: bin/solr create -rf 2
-force
If attempting to run create as "root" user, the script will exit with a warning that running Solr or actions
against Solr as "root" can cause problems. It is possible to override this warning with the -force
parameter.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 57 of 1426
Example: bin/solr create -c foo -force
Configuration Directories and SolrCloud
Before creating a collection in SolrCloud, the configuration directory used by the collection must be
uploaded to ZooKeeper. The create command supports several use cases for how collections and
configuration directories work. The main decision you need to make is whether a configuration directory in
ZooKeeper should be shared across multiple collections.
Let’s work through a few examples to illustrate how configuration directories work in SolrCloud.
First, if you don’t provide the -d or -n options, then the default configuration
($SOLR_HOME/server/solr/configsets/_default/conf) is uploaded to ZooKeeper using the same name as
the collection.
For example, the following command will result in the _default configuration being uploaded to
/configs/contacts in ZooKeeper: bin/solr create -c contacts.
If you create another collection with bin/solr create -c contacts2, then another copy of the _default
directory will be uploaded to ZooKeeper under /configs/contacts2.
Any changes you make to the configuration for the contacts collection will not affect the contacts2
collection. Put simply, the default behavior creates a unique copy of the configuration directory for each
collection you create.
You can override the name given to the configuration directory in ZooKeeper by using the -n option. For
instance, the command bin/solr create -c logs -d _default -n basic will upload the
server/solr/configsets/_default/conf directory to ZooKeeper as /configs/basic.
Notice that we used the -d option to specify a different configuration than the default. Solr provides several
built-in configurations under server/solr/configsets. However you can also provide the path to your own
configuration directory using the -d option. For instance, the command bin/solr create -c mycoll -d
/tmp/myconfigs, will upload /tmp/myconfigs into ZooKeeper under /configs/mycoll.
To reiterate, the configuration directory is named after the collection unless you override it using the -n
option.
Other collections can share the same configuration by specifying the name of the shared configuration
using the -n option. For instance, the following command will create a new collection that shares the basic
configuration created previously: bin/solr create -c logs2 -n basic.
Data-driven Schema and Shared Configurations
The _default schema can mutate as data is indexed, since it has schemaless functionality (i.e., data-driven
changes to the schema). Consequently, we recommend that you do not share data-driven configurations
between collections unless you are certain that all collections should inherit the changes made when
indexing data into one of the collections. You can turn off schemaless functionality (i.e., data-driven changes
to the schema) for a collection by the following, assuming the collection name is mycollection - see Set or
Unset Configuration Properties:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 58 of 1426
Apache Solr Reference Guide 7.7
bin/solr config -c mycollection -p 8983 -action set-user-property -property
update.autoCreateFields -value false
Delete Core or Collection
The delete command detects the mode that Solr is running in (standalone or SolrCloud) and then deletes
the specified core (standalone) or collection (SolrCloud) as appropriate.
bin/solr delete [options]
bin/solr delete -help
If running in SolrCloud mode, the delete command checks if the configuration directory used by the
collection you are deleting is being used by other collections. If not, then the configuration directory is also
deleted from ZooKeeper.
For example, if you created a collection with bin/solr create -c contacts, then the delete command
bin/solr delete -c contacts will check to see if the /configs/contacts configuration directory is being
used by any other collections. If not, then the /configs/contacts directory is removed from ZooKeeper.
Delete Core or Collection Parameters
-c
Name of the core / collection to delete (required).
Example: bin/solr delete -c mycoll
-deleteConfig
Whether or not the configuration directory should also be deleted from ZooKeeper. The default is true.
If the configuration directory is being used by another collection, then it will not be deleted even if you
pass -deleteConfig as true.
Example: bin/solr delete -deleteConfig false
-p
The port of a local Solr instance to send the delete command to. By default the script tries to detect the
port by looking for running Solr instances.
This option is useful if you are running multiple standalone Solr instances on the same host, thus
requiring you to be specific about which instance to delete the core from.
Example: bin/solr delete -p 8983
Authentication
The bin/solr script allows enabling or disabling Basic Authentication, allowing you to configure
authentication from the command line.
Currently, this script only enables Basic Authentication, and is only available when using SolrCloud mode.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 59 of 1426
Enabling Basic Authentication
The command bin/solr auth enable configures Solr to use Basic Authentication when accessing the User
Interface, using bin/solr and any API requests.
For more information about Solr’s authentication plugins, see the section Securing Solr. For
more information on Basic Authentication support specifically, see the section Basic
Authentication Plugin.
The bin/solr auth enable command makes several changes to enable Basic Authentication:
• Creates a security.json file and uploads it to ZooKeeper. The security.json file will look similar to:
{
"authentication":{
"blockUnknown": false,
"class":"solr.BasicAuthPlugin",
"credentials":{"user":"vgGVo69YJeUg/O6AcFiowWsdyOUdqfQvOLsrpIPMCzk=
7iTnaKOWe+Uj5ZfGoKKK2G6hrcF10h6xezMQK+LBvpI="}
},
"authorization":{
"class":"solr.RuleBasedAuthorizationPlugin",
"permissions":[
{"name":"security-edit", "role":"admin"},
{"name":"collection-admin-edit", "role":"admin"},
{"name":"core-admin-edit", "role":"admin"}
],
"user-role":{"user":"admin"}
}
}
• Adds two lines to bin/solr.in.sh or bin\solr.in.cmd to set the authentication type, and the path to
basicAuth.conf:
# The following lines added by ./solr for enabling BasicAuth
SOLR_AUTH_TYPE="basic"
SOLR_AUTHENTICATION_OPTS="-Dsolr.httpclient.config=/path/to/solr7.7.0/server/solr/basicAuth.conf"
• Creates the file server/solr/basicAuth.conf to store the credential information that is used with
bin/solr commands.
The command takes the following parameters:
-credentials
The username and password in the format of username:password of the initial user.
If you prefer not to pass the username and password as an argument to the script, you can choose the
-prompt option. Either -credentials or -prompt must be specified.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 60 of 1426
Apache Solr Reference Guide 7.7
-prompt
If prompt is preferred, pass true as a parameter to request the script to prompt the user to enter a
username and password.
Either -credentials or -prompt must be specified.
-blockUnknown
When true, blocks all unauthenticated users from accessing Solr. This defaults to false, which means
unauthenticated users will still be able to access Solr.
-updateIncludeFileOnly
When true, only the settings in bin/solr.in.sh or bin\solr.in.cmd will be updated, and security.json
will not be created.
-z
Defines the ZooKeeper connect string. This is useful if you want to enable authentication before all your
Solr nodes have come up. Unnecessary if ZK_HOST is defined in solr.in.sh or solr.in.cmd.
-d
Defines the Solr server directory, by default $SOLR_HOME/server. It is not common to need to override the
default, and is only needed if you have customized the $SOLR_HOME directory path.
-s
Defines the location of solr.solr.home, which by default is server/solr. If you have multiple instances
of Solr on the same host, or if you have customized the $SOLR_HOME directory path, you likely need to
define this.
Disabling Basic Authentication
You can disable Basic Authentication with bin/solr auth disable.
If the -updateIncludeFileOnly option is set to true, then only the settings in bin/solr.in.sh or
bin\solr.in.cmd will be updated, and security.json will not be removed.
If the -updateIncludeFileOnly option is set to false, then the settings in bin/solr.in.sh or
bin\solr.in.cmd will be updated, and security.json will be removed. However, the basicAuth.conf file is
not removed with either option.
Set or Unset Configuration Properties
The bin/solr script enables a subset of the Config API: (un)setting common properties and (un)setting
user-defined properties.
bin/solr config [options]
bin/solr config -help
Set or Unset Common Properties
To set the common property updateHandler.autoCommit.maxDocs to 100 on collection mycollection:
bin/solr config -c mycollection -p 8983 -action set-property -property
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 61 of 1426
updateHandler.autoCommit.maxDocs -value 100
The default -action is set-property, so the above can be shortened by not mentioning it:
bin/solr config -c mycollection -p 8983 -property updateHandler.autoCommit.maxDocs -value
100
To unset a previously set common property, specify -action unset-property with no -value:
bin/solr config -c mycollection -p 8983 -action unset-property -property
updateHandler.autoCommit.maxDocs
Set or Unset User-defined Properties
To set the user-defined property update.autoCreateFields to false (to disable Schemaless Mode):
bin/solr config -c mycollection -p 8983 -action set-user-property -property
update.autoCreateFields -value false
To unset a previously set user-defined property, specify -action unset-user-property with no -value:
bin/solr config -c mycollection -p 8983 -action unset-user-property -property
update.autoCreateFields
Config Parameters
-c
Name of the core or collection on which to change configuration (required).
-action
Config API action, one of: set-property, unset-property, set-user-property, unset-user-property;
defaults to set-property.
-property
Name of the property to change (required).
-value
Set the property to this value.
-z
The ZooKeeper connection string, usable in SolrCloud mode. Unnecessary if ZK_HOST is defined in
solr.in.sh or solr.in.cmd.
-p
localhost port of the Solr node to use when applying the configuration change.
-solrUrl
Base Solr URL, which can be used in SolrCloud mode to determine the ZooKeeper connection string if
that’s not known.
ZooKeeper Operations
The bin/solr script allows certain operations affecting ZooKeeper. These operations are for SolrCloud mode
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 62 of 1426
Apache Solr Reference Guide 7.7
only. The operations are available as sub-commands, which each have their own set of options.
bin/solr zk [sub-command] [options]
bin/solr zk -help
Solr should have been started at least once before issuing these commands to initialize
ZooKeeper with the znodes Solr expects. Once ZooKeeper is initialized, Solr doesn’t need to
be running on any node to use these commands.
Upload a Configuration Set
Use the zk upconfig command to upload one of the pre-configured configuration set or a customized
configuration set to ZooKeeper.
ZK Upload Parameters
All parameters below are required.
-n
Name of the configuration set in ZooKeeper. This command will upload the configuration set to the
"configs" ZooKeeper node giving it the name specified.
You can see all uploaded configuration sets in the Admin UI via the Cloud screens. Choose Cloud -> Tree
-> configs to see them.
If a pre-existing configuration set is specified, it will be overwritten in ZooKeeper.
Example: -n myconfig
-d
The path of the configuration set to upload. It should have a conf directory immediately below it that in
turn contains solrconfig.xml etc.
If just a name is supplied, $SOLR_HOME/server/solr/configsets will be checked for this name. An
absolute path may be supplied instead.
Examples:
• -d directory_under_configsets
• -d /path/to/configset/source
-z
The ZooKeeper connection string. Unnecessary if ZK_HOST is defined in solr.in.sh or solr.in.cmd.
Example: -z 123.321.23.43:2181
An example of this command with all of the parameters is:
bin/solr zk upconfig -z 111.222.333.444:2181 -n mynewconfig -d /path/to/configset
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 63 of 1426
Reload Collections When Changing Configurations
This command does not automatically make changes effective! It simply uploads the
configuration sets to ZooKeeper. You can use the Collection API’s RELOAD command to
reload any collections that uses this configuration set.
Download a Configuration Set
Use the zk downconfig command to download a configuration set from ZooKeeper to the local filesystem.
ZK Download Parameters
All parameters listed below are required.
-n
Name of the configset in ZooKeeper to download. The Admin UI Cloud -> Tree -> configs node lists all
available configuration sets.
Example: -n myconfig
-d
The path to write the downloaded configuration set into. If just a name is supplied,
$SOLR_HOME/server/solr/configsets will be the parent. An absolute path may be supplied as well.
In either case, pre-existing configurations at the destination will be overwritten!
Examples:
• -d directory_under_configsets
• -d /path/to/configset/destination
-z
The ZooKeeper connection string. Unnecessary if ZK_HOST is defined in solr.in.sh or solr.in.cmd.
Example: -z 123.321.23.43:2181
An example of this command with all parameters is:
bin/solr zk downconfig -z 111.222.333.444:2181 -n mynewconfig -d /path/to/configset
A "best practice" is to keep your configuration sets in some form of version control as the system-of-record.
In that scenario, downconfig should rarely be used.
Copy between Local Files and ZooKeeper znodes
Use the zk cp command for transferring files and directories between ZooKeeper znodes and your local
drive. This command will copy from the local drive to ZooKeeper, from ZooKeeper to the local drive or from
ZooKeeper to ZooKeeper.
ZK Copy Parameters
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 64 of 1426
Apache Solr Reference Guide 7.7
-r
Optional. Do a recursive copy. The command will fail if the has children unless '-r' is specified.
Example: -r
The file or path to copy from. If prefixed with zk: then the source is presumed to be ZooKeeper. If no
prefix or the prefix is 'file:' this is the local drive. At least one of or must be prefixed by 'zk:'
or the command will fail.
Examples:
• zk:/configs/myconfigs/solrconfig.xml
• file:/Users/apache/configs/src
The file or path to copy to. If prefixed with zk: then the source is presumed to be ZooKeeper. If no prefix
or the prefix is file: this is the local drive.
At least one of or must be prefixed by zk: or the command will fail. If ends in a slash
character it names a directory.
Examples:
• zk:/configs/myconfigs/solrconfig.xml
• file:/Users/apache/configs/src
-z
The ZooKeeper connection string. Unnecessary if ZK_HOST is defined in solr.in.sh or solr.in.cmd.
Example: -z 123.321.23.43:2181
An example of this command with the parameters is:
Recursively copy a directory from local to ZooKeeper.
bin/solr zk cp -r file:/apache/confgs/whatever/conf zk:/configs/myconf -z
111.222.333.444:2181
Copy a single file from ZooKeeper to local.
bin/solr zk cp zk:/configs/myconf/managed_schema /configs/myconf/managed_schema -z
111.222.333.444:2181
Remove a znode from ZooKeeper
Use the zk rm command to remove a znode (and optionally all child nodes) from ZooKeeper.
ZK Remove Parameters
-r
Optional. Do a recursive removal. The command will fail if the has children unless '-r' is specified.
Example: -r
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 65 of 1426
The path to remove from ZooKeeper, either a parent or leaf node.
There are limited safety checks, you cannot remove '/' or '/zookeeper' nodes.
The path is assumed to be a ZooKeeper node, no zk: prefix is necessary.
Examples:
• /configs
• /configs/myconfigset
• /configs/myconfigset/solrconfig.xml
-z
The ZooKeeper connection string. Unnecessary if ZK_HOST is defined in solr.in.sh or solr.in.cmd.
Example: -z 123.321.23.43:2181
Examples of this command with the parameters are:
bin/solr zk rm -r /configs
bin/solr zk rm /configs/myconfigset/schema.xml
Move One ZooKeeper znode to Another (Rename)
Use the zk mv command to move (rename) a ZooKeeper znode.
ZK Move Parameters
The znode to rename. The zk: prefix is assumed.
Example: /configs/oldconfigset
The new name of the znode. The zk: prefix is assumed.
Example: /configs/newconfigset
-z
The ZooKeeper connection string. Unnecessary if ZK_HOST is defined in solr.in.sh or solr.in.cmd.
Example: -z 123.321.23.43:2181
An example of this command is:
bin/solr zk mv /configs/oldconfigset /configs/newconfigset
List a ZooKeeper znode’s Children
Use the zk ls command to see the children of a znode.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 66 of 1426
Apache Solr Reference Guide 7.7
ZK List Parameters
-r Optional. Recursively list all descendants of a znode.
+ Example: -r
The path on ZooKeeper to list.
Example: /collections/mycollection
-z
The ZooKeeper connection string. Unnecessary if ZK_HOST is defined in solr.in.sh or solr.in.cmd.
Example: -z 123.321.23.43:2181
An example of this command with the parameters is:
bin/solr zk ls -r /collections/mycollection
bin/solr zk ls /collections
Create a znode (supports chroot)
Use the zk mkroot command to create a znode. The primary use-case for this command to support
ZooKeeper’s "chroot" concept. However, it can also be used to create arbitrary paths.
Create znode Parameters
The path on ZooKeeper to create. Intermediate znodes will be created if necessary. A leading slash is
assumed even if not specified.
Example: /solr
-z
The ZooKeeper connection string. Unnecessary if ZK_HOST is defined in solr.in.sh or solr.in.cmd.
Example: -z 123.321.23.43:2181
Examples of this command:
bin/solr zk mkroot /solr -z 123.321.23.43:2181
bin/solr zk mkroot /solr/production
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 67 of 1426
Solr Configuration Files
Solr has several configuration files that you will interact with during your implementation.
Many of these files are in XML format, although APIs that interact with configuration settings tend to accept
JSON for programmatic access as needed.
Solr Home
When Solr runs, it needs access to a home directory.
When you first install Solr, your home directory is server/solr. However, some examples may change this
location (such as, if you run bin/solr start -e cloud, your home directory will be example/cloud).
The home directory contains important configuration information and is the place where Solr will store its
index. The layout of the home directory will look a little different when you are running Solr in standalone
mode vs. when you are running in SolrCloud mode.
The crucial parts of the Solr home directory are shown in these examples:
Standalone Mode
/
solr.xml
core_name1/
core.properties
conf/
solrconfig.xml
managed-schema
data/
core_name2/
core.properties
conf/
solrconfig.xml
managed-schema
data/
SolrCloud Mode
/
solr.xml
core_name1/
core.properties
data/
core_name2/
core.properties
data/
You may see other files, but the main ones you need to know are discussed in the next section.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 68 of 1426
Apache Solr Reference Guide 7.7
Configuration Files
Inside Solr’s Home, you’ll find these files:
• solr.xml specifies configuration options for your Solr server instance. For more information on
solr.xml see Solr Cores and solr.xml.
• Per Solr Core:
◦ core.properties defines specific properties for each core such as its name, the collection the core
belongs to, the location of the schema, and other parameters. For more details on core.properties,
see the section Defining core.properties.
◦ solrconfig.xml controls high-level behavior. You can, for example, specify an alternate location for
the data directory. For more information on solrconfig.xml, see Configuring solrconfig.xml.
◦ managed-schema (or schema.xml instead) describes the documents you will ask Solr to index. The
Schema define a document as a collection of fields. You get to define both the field types and the
fields themselves. Field type definitions are powerful and include information about how Solr
processes incoming field values and query values. For more information on Solr Schemas, see
Documents, Fields, and Schema Design and the Schema API.
◦ data/ The directory containing the low level index files.
Note that the SolrCloud example does not include a conf directory for each Solr Core (so there is no
solrconfig.xml or Schema file). This is because the configuration files usually found in the conf directory
are stored in ZooKeeper so they can be propagated across the cluster.
If you are using SolrCloud with the embedded ZooKeeper instance, you may also see zoo.cfg and zoo.data
which are ZooKeeper configuration and data files. However, if you are running your own ZooKeeper
ensemble, you would supply your own ZooKeeper configuration file when you start it and the copies in Solr
would be unused. For more information about SolrCloud, see the section SolrCloud.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 69 of 1426
Taking Solr to Production
This section provides guidance on how to setup Solr to run in production on *nix platforms, such as Ubuntu.
Specifically, we’ll walk through the process of setting up to run a single Solr instance on a Linux host and
then provide tips on how to support multiple Solr nodes running on the same host.
Service Installation Script
Solr includes a service installation script (bin/install_solr_service.sh) to help you install Solr as a service
on Linux. Currently, the script only supports CentOS, Debian, Red Hat, SUSE and Ubuntu Linux distributions.
Before running the script, you need to determine a few parameters about your setup. Specifically, you need
to decide where to install Solr and which system user should be the owner of the Solr files and process.
Planning Your Directory Structure
We recommend separating your live Solr files, such as logs and index files, from the files included in the Solr
distribution bundle, as that makes it easier to upgrade Solr and is considered a good practice to follow as a
system administrator.
Solr Installation Directory
By default, the service installation script will extract the distribution archive into /opt. You can change this
location using the -i option when running the installation script. The script will also create a symbolic link to
the versioned directory of Solr. For instance, if you run the installation script for Solr 7.7.0, then the following
directory structure will be used:
/opt/solr-7.7.0
/opt/solr -> /opt/solr-7.7.0
Using a symbolic link insulates any scripts from being dependent on the specific Solr version. If, down the
road, you need to upgrade to a later version of Solr, you can just update the symbolic link to point to the
upgraded version of Solr. We’ll use /opt/solr to refer to the Solr installation directory in the remaining
sections of this page.
Separate Directory for Writable Files
You should also separate writable Solr files into a different directory; by default, the installation script uses
/var/solr, but you can override this location using the -d option. With this approach, the files in /opt/solr
will remain untouched and all files that change while Solr is running will live under /var/solr.
Create the Solr User
Running Solr as root is not recommended for security reasons, and the control script start command will
refuse to do so. Consequently, you should determine the username of a system user that will own all of the
Solr files and the running Solr process. By default, the installation script will create the solr user, but you can
override this setting using the -u option. If your organization has specific requirements for creating new
user accounts, then you should create the user before running the script. The installation script will make
the Solr user the owner of the /opt/solr and /var/solr directories.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 70 of 1426
Apache Solr Reference Guide 7.7
You are now ready to run the installation script.
Run the Solr Installation Script
To run the script, you’ll need to download the latest Solr distribution archive and then do the following:
tar xzf solr-7.7.0.tgz solr-7.7.0/bin/install_solr_service.sh --strip-components=2
The previous command extracts the install_solr_service.sh script from the archive into the current
directory. If installing on Red Hat, please make sure lsof is installed before running the Solr installation
script (sudo yum install lsof). The installation script must be run as root:
sudo bash ./install_solr_service.sh solr-7.7.0.tgz
By default, the script extracts the distribution archive into /opt, configures Solr to write files into /var/solr,
and runs Solr as the solr user. Consequently, the following command produces the same result as the
previous command:
sudo bash ./install_solr_service.sh solr-7.7.0.tgz -i /opt -d /var/solr -u solr -s solr -p 8983
You can customize the service name, installation directories, port, and owner using options passed to the
installation script. To see available options, simply do:
sudo bash ./install_solr_service.sh -help
Once the script completes, Solr will be installed as a service and running in the background on your server
(on port 8983). To verify, you can do:
sudo service solr status
If you do not want to start the service immediately, pass the -n option. You can then start the service
manually later, e.g., after completing the configuration setup.
We’ll cover some additional configuration settings you can make to fine-tune your Solr setup in a moment.
Before moving on, let’s take a closer look at the steps performed by the installation script. This gives you a
better overview and will help you understand important details about your Solr installation when reading
other pages in this guide; such as when a page refers to Solr home, you’ll know exactly where that is on your
system.
Solr Home Directory
The Solr home directory (not to be confused with the Solr installation directory) is where Solr manages core
directories with index files. By default, the installation script uses /var/solr/data. If the -d option is used on
the install script, then this will change to the data subdirectory in the location given to the -d option. Take a
moment to inspect the contents of the Solr home directory on your system. If you do not store solr.xml in
ZooKeeper, the home directory must contain a solr.xml file. When Solr starts up, the Solr Control Script
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 71 of 1426
passes the location of the home directory using the -Dsolr.solr.home=… system property.
Environment Overrides Include File
The service installation script creates an environment specific include file that overrides defaults used by the
bin/solr script. The main advantage of using an include file is that it provides a single location where all of
your environment-specific overrides are defined. Take a moment to inspect the contents of the
/etc/default/solr.in.sh file, which is the default path setup by the installation script. If you used the -s
option on the install script to change the name of the service, then the first part of the filename will be
different. For a service named solr-demo, the file will be named /etc/default/solr-demo.in.sh. There are
many settings that you can override using this file. However, at a minimum, this script needs to define the
SOLR_PID_DIR and SOLR_HOME variables, such as:
SOLR_PID_DIR=/var/solr
SOLR_HOME=/var/solr/data
The SOLR_PID_DIR variable sets the directory where the control script will write out a file containing the Solr
server’s process ID.
Log Settings
Solr uses Apache Log4J for logging. The installation script copies /opt/solr/server/resources/log4j2.xml
to /var/solr/log4j2.xml. Take a moment to verify that the Solr include file is configured to send logs to the
correct location by checking the following settings in /etc/default/solr.in.sh:
LOG4J_PROPS=/var/solr/log4j2.xml
SOLR_LOGS_DIR=/var/solr/logs
For more information about Log4J configuration, please see: Configuring Logging
init.d Script
When running a service like Solr on Linux, it’s common to setup an init.d script so that system administrators
can control Solr using the service tool, such as: service solr start. The installation script creates a very
basic init.d script to help you get started. Take a moment to inspect the /etc/init.d/solr file, which is the
default script name setup by the installation script. If you used the -s option on the install script to change
the name of the service, then the filename will be different. Notice that the following variables are setup for
your environment based on the parameters passed to the installation script:
SOLR_INSTALL_DIR=/opt/solr
SOLR_ENV=/etc/default/solr.in.sh
RUNAS=solr
The SOLR_INSTALL_DIR and SOLR_ENV variables should be self-explanatory. The RUNAS variable sets the
owner of the Solr process, such as solr; if you don’t set this value, the script will run Solr as root, which is
not recommended for production. You can use the /etc/init.d/solr script to start Solr by doing the
following as root:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 72 of 1426
Apache Solr Reference Guide 7.7
service solr start
The /etc/init.d/solr script also supports the stop, restart, and status commands. Please keep in mind
that the init script that ships with Solr is very basic and is intended to show you how to setup Solr as a
service. However, it’s also common to use more advanced tools like supervisord or upstart to control Solr
as a service on Linux. While showing how to integrate Solr with tools like supervisord is beyond the scope of
this guide, the init.d/solr script should provide enough guidance to help you get started. Also, the
installation script sets the Solr service to start automatically when the host machine initializes.
Progress Check
In the next section, we cover some additional environment settings to help you fine-tune your production
setup. However, before we move on, let’s review what we’ve achieved thus far. Specifically, you should be
able to control Solr using /etc/init.d/solr. Please verify the following commands work with your setup:
sudo service solr restart
sudo service solr status
The status command should give some basic information about the running Solr node that looks similar to:
Solr process PID running on port 8983
{
"version":"5.0.0 - ubuntu - 2014-12-17 19:36:58",
"startTime":"2014-12-19T19:25:46.853Z",
"uptime":"0 days, 0 hours, 0 minutes, 8 seconds",
"memory":"85.4 MB (%17.4) of 490.7 MB"}
If the status command is not successful, look for error messages in /var/solr/logs/solr.log.
Fine-Tune Your Production Setup
Dynamic Defaults for ConcurrentMergeScheduler
The Merge Scheduler is configured in solrconfig.xml and defaults to ConcurrentMergeScheduler. This
scheduler uses multiple threads to merge Lucene segments in the background.
By default, the ConcurrentMergeScheduler auto-detects whether the underlying disk drive is rotational or a
SSD and sets defaults for maxThreadCount and maxMergeCount accordingly. If the disk drive is determined to
be rotational then the maxThreadCount is set to 1 and maxMergeCount is set to 6. Otherwise, maxThreadCount
is set to 4 or half the number of processors available to the JVM whichever is greater and maxMergeCount is
set to maxThreadCount+5.
This auto-detection works only on Linux and even then it is not guaranteed to be correct. On all other
platforms, the disk is assumed to be rotational. Therefore, if the auto-detection fails or is incorrect then
indexing performance can suffer badly due to the wrong defaults.
The auto-detected value is exposed by the Metrics API with the key
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 73 of 1426
solr.node:CONTAINER.fs.coreRoot.spins. A value of true denotes that the disk is detected to be a
rotational or spinning disk.
It is safer to explicitly set values for maxThreadCount and maxMergeCount in the IndexConfig section of
SolrConfig.xml so that values appropriate to your hardware are used.
Alternatively, the boolean system property lucene.cms.override_spins can be set in the SOLR_OPTS
variable in the include file to override the auto-detected value. Similarily, the system property
lucene.cms.override_core_count can be set to the number of CPU cores to override the auto-detected
processor count.
Memory and GC Settings
By default, the bin/solr script sets the maximum Java heap size to 512M (-Xmx512m), which is fine for
getting started with Solr. For production, you’ll want to increase the maximum heap size based on the
memory requirements of your search application; values between 10 and 20 gigabytes are not uncommon
for production servers. When you need to change the memory settings for your Solr server, use the
SOLR_JAVA_MEM variable in the include file, such as:
SOLR_JAVA_MEM="-Xms10g -Xmx10g"
Also, the Solr Control Script comes with a set of pre-configured Java Garbage Collection settings that have
shown to work well with Solr for a number of different workloads. However, these settings may not work
well for your specific use of Solr. Consequently, you may need to change the GC settings, which should also
be done with the GC_TUNE variable in the /etc/default/solr.in.sh include file. For more information about
tuning your memory and garbage collection settings, see: JVM Settings.
Out-of-Memory Shutdown Hook
The bin/solr script registers the bin/oom_solr.sh script to be called by the JVM if an OutOfMemoryError
occurs. The oom_solr.sh script will issue a kill -9 to the Solr process that experiences the
OutOfMemoryError. This behavior is recommended when running in SolrCloud mode so that ZooKeeper is
immediately notified that a node has experienced a non-recoverable error. Take a moment to inspect the
contents of the /opt/solr/bin/oom_solr.sh script so that you are familiar with the actions the script will
perform if it is invoked by the JVM.
Going to Production with SolrCloud
To run Solr in SolrCloud mode, you need to set the ZK_HOST variable in the include file to point to your
ZooKeeper ensemble. Running the embedded ZooKeeper is not supported in production environments. For
instance, if you have a ZooKeeper ensemble hosted on the following three hosts on the default client port
2181 (zk1, zk2, and zk3), then you would set:
ZK_HOST=zk1,zk2,zk3
When the ZK_HOST variable is set, Solr will launch in "cloud" mode.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 74 of 1426
Apache Solr Reference Guide 7.7
ZooKeeper chroot
If you’re using a ZooKeeper instance that is shared by other systems, it’s recommended to isolate the
SolrCloud znode tree using ZooKeeper’s chroot support. For instance, to ensure all znodes created by
SolrCloud are stored under /solr, you can put /solr on the end of your ZK_HOST connection string, such as:
ZK_HOST=zk1,zk2,zk3/solr
Before using a chroot for the first time, you need to create the root path (znode) in ZooKeeper by using the
Solr Control Script. We can use the mkroot command for that:
bin/solr zk mkroot /solr -z :
If you also want to bootstrap ZooKeeper with existing solr_home, you can instead use the
zkcli.sh / zkcli.bat bootstrap command, which will also create the chroot path if it does
not exist. See Command Line Utilities for more info.
Solr Hostname
Use the SOLR_HOST variable in the include file to set the hostname of the Solr server.
SOLR_HOST=solr1.example.com
Setting the hostname of the Solr server is recommended, especially when running in SolrCloud mode, as this
determines the address of the node when it registers with ZooKeeper.
Override Settings in solrconfig.xml
Solr allows configuration properties to be overridden using Java system properties passed at startup using
the -Dproperty=value syntax. For instance, in solrconfig.xml, the default auto soft commit settings are set
to:
${solr.autoSoftCommit.maxTime:-1}
In general, whenever you see a property in a Solr configuration file that uses the
${solr.PROPERTY:DEFAULT_VALUE} syntax, then you know it can be overridden using a Java system property.
For instance, to set the maxTime for soft-commits to be 10 seconds, then you can start Solr with
-Dsolr.autoSoftCommit.maxTime=10000, such as:
bin/solr start -Dsolr.autoSoftCommit.maxTime=10000
The bin/solr script simply passes options starting with -D on to the JVM during startup. For running in
production, we recommend setting these properties in the SOLR_OPTS variable defined in the include file.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 75 of 1426
Keeping with our soft-commit example, in /etc/default/solr.in.sh, you would do:
SOLR_OPTS="$SOLR_OPTS -Dsolr.autoSoftCommit.maxTime=10000"
File Handles and Processes (ulimit settings)
Two common settings that result in errors on *nix systems are file handles and user processes.
It is common for the default limits for number of processes and file handles to default to values that are too
low for a large Solr installation. The required number of each of these will increase based on a combination
of the number of replicas hosted per node and the number of segments in the index for each replica.
The usual recommendation is to make processes and file handles at least 65,000 each, unlimited if possible.
On most *nix systems, this command will show the currently-defined limits:
ulimit -a
It is strongly recommended that file handle and process limits be permanently raised as above. The exact
form of the command will vary per operating system, and some systems require editing configuration files
and restarting your server. Consult your system administrators for guidance in your particular environment.
If these limits are exceeded, the problems reported by Solr vary depending on the specific
operation responsible for exceeding the limit. Errors such as "too many open files",
"connection error", and "max processes exceeded" have been reported, as well as
SolrCloud recovery failures.
Since exceeding these limits can result in such varied symptoms it is strongly recommended
that these limits be permanently raised as recommended above.
Running Multiple Solr Nodes per Host
The bin/solr script is capable of running multiple instances on one machine, but for a typical installation,
this is not a recommended setup. Extra CPU and memory resources are required for each additional
instance. A single instance is easily capable of handling multiple indexes.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 76 of 1426
Apache Solr Reference Guide 7.7
When to ignore the recommendation
For every recommendation, there are exceptions. For the recommendation above, that
exception is mostly applicable when discussing extreme scalability. The best reason for
running multiple Solr nodes on one host is decreasing the need for extremely large heaps.
When the Java heap gets very large, it can result in extremely long garbage collection
pauses, even with the GC tuning that the startup script provides by default. The exact point
at which the heap is considered "very large" will vary depending on how Solr is used. This
means that there is no hard number that can be given as a threshold, but if your heap is
reaching the neighborhood of 16 to 32 gigabytes, it might be time to consider splitting
nodes. Ideally this would mean more machines, but budget constraints might make that
impossible.
There is another issue once the heap reaches 32GB. Below 32GB, Java is able to use
compressed pointers, but above that point, larger pointers are required, which uses more
memory and slows down the JVM.
Because of the potential garbage collection issues and the particular issues that happen at
32GB, if a single instance would require a 64GB heap, performance is likely to improve
greatly if the machine is set up with two nodes that each have a 31GB heap.
If your use case requires multiple instances, at a minimum you will need unique Solr home directories for
each node you want to run; ideally, each home should be on a different physical disk so that multiple Solr
nodes don’t have to compete with each other when accessing files on disk. Having different Solr home
directories implies that you’ll need a different include file for each node. Moreover, if using the
/etc/init.d/solr script to control Solr as a service, then you’ll need a separate script for each node. The
easiest approach is to use the service installation script to add multiple services on the same host, such as:
sudo bash ./install_solr_service.sh solr-7.7.0.tgz -s solr2 -p 8984
The command shown above will add a service named solr2 running on port 8984 using /var/solr2 for
writable (aka "live") files; the second server will still be owned and run by the solr user and will use the Solr
distribution files in /opt. After installing the solr2 service, verify it works correctly by doing:
sudo service solr2 restart
sudo service solr2 status
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 77 of 1426
Making and Restoring Backups
If you are worried about data loss, and of course you should be, you need a way to back up your Solr indexes
so that you can recover quickly in case of catastrophic failure.
Solr provides two approaches to backing up and restoring Solr cores or collections, depending on how you
are running Solr. If you run in SolrCloud mode, you will use the Collections API. If you run Solr in standalone
mode, you will use the replication handler.
SolrCloud Backups
Support for backups when running SolrCloud is provided with the Collections API. This allows the backups to
be generated across multiple shards, and restored to the same number of shards and replicas as the
original collection.
SolrCloud Backup/Restore requires a shared file system mounted at the same path on all
nodes, or HDFS.
Two commands are available:
• action=BACKUP: This command backs up Solr indexes and configurations. More information is available
in the section Backup Collection.
• action=RESTORE: This command restores Solr indexes and configurations. More information is available
in the section Restore Collection.
Standalone Mode Backups
Backups and restoration uses Solr’s replication handler. Out of the box, Solr includes implicit support for
replication so this API can be used. Configuration of the replication handler can, however, be customized by
defining your own replication handler in solrconfig.xml. For details on configuring the replication handler,
see the section Configuring the ReplicationHandler.
Backup API
The backup API requires sending a command to the /replication handler to back up the system.
You can trigger a back-up with an HTTP command like this (replace "gettingstarted" with the name of the
core you are working with):
Backup API Example
http://localhost:8983/solr/gettingstarted/replication?command=backup
The backup command is an asynchronous call, and it will represent data from the latest index commit point.
All indexing and search operations will continue to be executed against the index as usual.
Only one backup call can be made against a core at any point in time. While an ongoing backup operation is
happening subsequent calls for restoring will throw an exception.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 78 of 1426
Apache Solr Reference Guide 7.7
The backup request can also take the following additional parameters:
location
The path where the backup will be created. If the path is not absolute then the backup path will be
relative to Solr’s instance directory. |name |The snapshot will be created in a directory called
snapshot.. If a name is not specified then the directory name would have the following format:
snapshot..
numberToKeep
The number of backups to keep. If maxNumberOfBackups has been specified on the replication handler in
solrconfig.xml, maxNumberOfBackups is always used and attempts to use numberToKeep will cause an
error. Also, this parameter is not taken into consideration if the backup name is specified. More
information about maxNumberOfBackups can be found in the section Configuring the ReplicationHandler.
repository
The name of the repository to be used for the backup. If no repository is specified then the local
filesystem repository will be used automatically.
commitName
The name of the commit which was used while taking a snapshot using the CREATESNAPSHOT command.
Backup Status
The backup operation can be monitored to see if it has completed by sending the details command to the
/replication handler, as in this example:
Status API Example
http://localhost:8983/solr/gettingstarted/replication?command=details&wt=xml
Output Snippet
Sun Apr 12 16:22:50 DAVT 2015
10
success
Sun Apr 12 16:22:50 DAVT 2015
my_backup
If it failed then a snapShootException will be sent in the response.
Restore API
Restoring the backup requires sending the restore command to the /replication handler, followed by the
name of the backup to restore.
You can restore from a backup with a command like this:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 79 of 1426
Example Usage
http://localhost:8983/solr/gettingstarted/replication?command=restore&name=backup_name
This will restore the named index snapshot into the current core. Searches will start reflecting the snapshot
data once the restore is complete.
The restore request can take these additional parameters:
location
The location of the backup snapshot file. If not specified, it looks for backups in Solr’s data directory.
name
The name of the backed up index snapshot to be restored. If the name is not provided it looks for
backups with snapshot. format in the location directory. It picks the latest timestamp
backup in that case.
repository
The name of the repository to be used for the backup. If no repository is specified then the local
filesystem repository will be used automatically.
The restore command is an asynchronous call. Once the restore is complete the data reflected will be of the
backed up index which was restored.
Only one restore call can can be made against a core at one point in time. While an ongoing restore
operation is happening subsequent calls for restoring will throw an exception.
Restore Status API
You can also check the status of a restore operation by sending the restorestatus command to the
/replication handler, as in this example:
Status API Example
http://localhost:8983/solr/gettingstarted/replication?command=restorestatus&wt=xml
Status API Output
0
0
snapshot.
success
The status value can be "In Progress", "success" or "failed". If it failed then an "exception" will also be sent
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 80 of 1426
Apache Solr Reference Guide 7.7
in the response.
Create Snapshot API
The snapshot functionality is different from the backup functionality as the index files aren’t copied
anywhere. The index files are snapshotted in the same index directory and can be referenced while taking
backups.
You can trigger a snapshot command with an HTTP command like this (replace "techproducts" with the
name of the core you are working with):
Create Snapshot API Example
http://localhost:8983/solr/admin/cores?action=CREATESNAPSHOT&core=techproducts&commitName=commit1
The CREATESNAPSHOT request parameters are:
commitName
The name to store the snapshot as.
core
The name of the core to perform the snapshot on.
async
Request ID to track this action which will be processed asynchronously.
List Snapshot API
The LISTSNAPSHOTS command lists all the taken snapshots for a particular core.
You can trigger a list snapshot command with an HTTP command like this (replace "techproducts" with the
name of the core you are working with):
List Snapshot API
http://localhost:8983/solr/admin/cores?action=LISTSNAPSHOTS&core=techproducts&commitName=commit1
The list snapshot request parameters are:
core
The name of the core to whose snapshots we want to list.
async
Request ID to track this action which will be processed asynchronously.
Delete Snapshot API
The DELETESNAPSHOT command deletes a snapshot for a particular core.
You can trigger a delete snapshot with an HTTP command like this (replace "techproducts" with the name of
the core you are working with):
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 81 of 1426
Delete Snapshot API Example
http://localhost:8983/solr/admin/cores?action=DELETESNAPSHOT&core=techproducts&commitName=commit1
The delete snapshot request parameters are:
commitName
Specify the commit name to be deleted.
core
The name of the core whose snapshot we want to delete.
async
Request ID to track this action which will be processed asynchronously.
Backup/Restore Storage Repositories
Solr provides interfaces to plug different storage systems for backing up and restoring. For example, you
can have a Solr cluster running on a local filesystem like EXT3 but you can backup the indexes to a HDFS
filesystem or vice versa.
The repository interfaces needs to be configured in the solr.xml file. While running backup/restore
commands we can specify the repository to be used.
If no repository is configured then the local filesystem repository will be used automatically.
Example solr.xml section to configure a repository like HDFS:
${solr.hdfs.default.backup.path}
${solr.hdfs.home:}
${solr.hdfs.confdir:}
Better throughput might be achieved by increasing buffer size with 262144 . Buffer size is specified in bytes, by default it’s 4096 bytes
(4KB).
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 82 of 1426
Apache Solr Reference Guide 7.7
Running Solr on HDFS
Solr has support for writing and reading its index and transaction log files to the HDFS distributed
filesystem.
This does not use Hadoop MapReduce to process Solr data, rather it only uses the HDFS filesystem for index
and transaction log file storage.
To use HDFS rather than a local filesystem, you must be using Hadoop 2.x and you will need to instruct Solr
to use the HdfsDirectoryFactory. There are also several additional parameters to define. These can be set
in one of three ways:
• Pass JVM arguments to the bin/solr script. These would need to be passed every time you start Solr
with bin/solr.
• Modify solr.in.sh (or solr.in.cmd on Windows) to pass the JVM arguments automatically when using
bin/solr without having to set them manually.
• Define the properties in solrconfig.xml. These configuration changes would need to be repeated for
every collection, so is a good option if you only want some of your collections stored in HDFS.
Starting Solr on HDFS
Standalone Solr Instances
For standalone Solr instances, there are a few parameters you should modify before starting Solr. These can
be set in solrconfig.xml (more on that below), or passed to the bin/solr script at startup.
• You need to use an HdfsDirectoryFactory and a data directory in the form hdfs://host:port/path
• You need to specify an updateLog location in the form hdfs://host:port/path
• You should specify a lock factory type of 'hdfs' or none.
If you do not modify solrconfig.xml, you can instead start Solr on HDFS with the following command:
bin/solr start -Dsolr.directoryFactory=HdfsDirectoryFactory
-Dsolr.lock.type=hdfs
-Dsolr.data.dir=hdfs://host:port/path
-Dsolr.updatelog=hdfs://host:port/path
This example will start Solr in standalone mode, using the defined JVM properties (explained in more detail
below).
SolrCloud Instances
In SolrCloud mode, it’s best to leave the data and update log directories as the defaults Solr comes with and
simply specify the solr.hdfs.home. All dynamically created collections will create the appropriate directories
automatically under the solr.hdfs.home root directory.
• Set solr.hdfs.home in the form hdfs://host:port/path
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 83 of 1426
• You should specify a lock factory type of 'hdfs' or none.
bin/solr start -c -Dsolr.directoryFactory=HdfsDirectoryFactory
-Dsolr.lock.type=hdfs
-Dsolr.hdfs.home=hdfs://host:port/path
This command starts Solr in SolrCloud mode, using the defined JVM properties.
Modifying solr.in.sh (*nix) or solr.in.cmd (Windows)
The examples above assume you will pass JVM arguments as part of the start command every time you use
bin/solr to start Solr. However, bin/solr looks for an include file named solr.in.sh (solr.in.cmd on
Windows) to set environment variables. By default, this file is found in the bin directory, and you can modify
it to permanently add the HdfsDirectoryFactory settings and ensure they are used every time Solr is
started.
For example, to set JVM arguments to always use HDFS when running in SolrCloud mode (as shown above),
you would add a section such as this:
# Set HDFS DirectoryFactory & Settings
-Dsolr.directoryFactory=HdfsDirectoryFactory \
-Dsolr.lock.type=hdfs \
-Dsolr.hdfs.home=hdfs://host:port/path \
The Block Cache
For performance, the HdfsDirectoryFactory uses a Directory that will cache HDFS blocks. This caching
mechanism replaces the standard file system cache that Solr utilizes. By default, this cache is allocated offheap. This cache will often need to be quite large and you may need to raise the off-heap memory limit for
the specific JVM you are running Solr in. For the Oracle/OpenJDK JVMs, the following is an example
command-line parameter that you can use to raise the limit when starting Solr:
-XX:MaxDirectMemorySize=20g
HdfsDirectoryFactory Parameters
The HdfsDirectoryFactory has a number of settings defined as part of the directoryFactory
configuration.
Solr HDFS Settings
solr.hdfs.home
A root location in HDFS for Solr to write collection data to. Rather than specifying an HDFS location for the
data directory or update log directory, use this to specify one root location and have everything
automatically created within this HDFS location. The structure of this parameter is
hdfs://host:port/path/solr.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 84 of 1426
Apache Solr Reference Guide 7.7
Block Cache Settings
solr.hdfs.blockcache.enabled
Enable the blockcache. The default is true.
solr.hdfs.blockcache.read.enabled
Enable the read cache. The default is true.
solr.hdfs.blockcache.direct.memory.allocation
Enable direct memory allocation. If this is false, heap is used. The default is true.
solr.hdfs.blockcache.slab.count
Number of memory slabs to allocate. Each slab is 128 MB in size. The default is 1.
solr.hdfs.blockcache.global
Enable/Disable using one global cache for all SolrCores. The settings used will be from the first
HdfsDirectoryFactory created. The default is true.
NRTCachingDirectory Settings
solr.hdfs.nrtcachingdirectory.enable
true | Enable the use of NRTCachingDirectory. The default is true.
solr.hdfs.nrtcachingdirectory.maxmergesizemb
NRTCachingDirectory max segment size for merges. The default is 16.
solr.hdfs.nrtcachingdirectory.maxcachedmb
NRTCachingDirectory max cache size. The default is 192.
HDFS Client Configuration Settings
solr.hdfs.confdir
Pass the location of HDFS client configuration files - needed for HDFS HA for example.
Kerberos Authentication Settings
Hadoop can be configured to use the Kerberos protocol to verify user identity when trying to access core
services like HDFS. If your HDFS directories are protected using Kerberos, then you need to configure Solr’s
HdfsDirectoryFactory to authenticate using Kerberos in order to read and write to HDFS. To enable Kerberos
authentication from Solr, you need to set the following parameters:
solr.hdfs.security.kerberos.enabled
Set to true to enable Kerberos authentication. The default is false.
solr.hdfs.security.kerberos.keytabfile
A keytab file contains pairs of Kerberos principals and encrypted keys which allows for password-less
authentication when Solr attempts to authenticate with secure Hadoop.
This file will need to be present on all Solr servers at the same path provided in this parameter.
solr.hdfs.security.kerberos.principal
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 85 of 1426
The Kerberos principal that Solr should use to authenticate to secure Hadoop; the format of a typical
Kerberos V5 principal is: primary/instance@realm.
Example solrconfig.xml for HDFS
Here is a sample solrconfig.xml configuration for storing Solr indexes on HDFS:
hdfs://host:port/solr
true
1
true
16384
true
true
16
192
If using Kerberos, you will need to add the three Kerberos related properties to the
element in solrconfig.xml, such as:
...
true
/etc/krb5.keytab
solr/admin@KERBEROS.COM
Automatically Add Replicas in SolrCloud
The ability to automatically add new replicas when the Overseer notices that a shard has gone down was
previously only available to users running Solr in HDFS, but it is now available to all users via Solr’s
autoscaling framework. See the section SolrCloud Autoscaling Automatically Adding Replicas for details on
how to enable and disable this feature.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 86 of 1426
Apache Solr Reference Guide 7.7
The ability to enable or disable the autoAddReplicas feature with cluster properties has
been deprecated and will be removed in a future version. All users of this feature who have
previously used that approach are encouraged to change their configurations to use the
autoscaling framework to ensure continued operation of this feature in their Solr
installations.
For users using this feature with the deprecated configuration, you can temporarily disable
it cluster-wide by setting the cluster property autoAddReplicas to false, as in these
examples:
V1 API
http://localhost:8983/solr/admin/collections?action=CLUSTERPROP&name=autoAddRepli
cas&val=false
V2 API
curl -X POST -H 'Content-type: application/json' -d '{"set-property":
{"name":"autoAddReplicas", "val":false}}' http://localhost:8983/api/cluster
Re-enable the feature by unsetting the autoAddReplicas cluster property. When no val
parameter is provided, the cluster property is unset:
V1 API
http://localhost:8983/solr/admin/collections?action=CLUSTERPROP&name=autoAddRepli
cas
V2 API
curl -X POST -H 'Content-type: application/json' -d '{"set-property":
{"name":"autoAddReplicas"}}' http://localhost:8983/api/cluster
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 87 of 1426
SolrCloud on AWS EC2
This guide is a tutorial on how to set up a multi-node SolrCloud cluster on Amazon Web Services (AWS) EC2
instances for early development and design.
This tutorial is not meant for production systems. For one, it uses Solr’s embedded ZooKeeper instance, and
for production you should have at least 3 ZooKeeper nodes in an ensemble. There are additional steps you
should take for a production installation; refer to Taking Solr to Production for how to deploy Solr in
production.
In this guide we are going to:
1. Launch multiple AWS EC2 instances
◦ Create new Security Group
◦ Configure instances and launch
2. Install, configure and start Solr on newly launched EC2 instances
◦ Install system prerequisites: Java 1.8 and later
◦ Download latest version of Solr
◦ Start the Solr nodes in SolrCloud mode
3. Create a collection, index documents and query the system
◦ Create collection with multiple shards and replicas
◦ Index documents to the newly created collection
◦ Verify documents presence by querying the collection
Before You Start
To use this guide, you must have the following:
• An AWS account.
• Familiarity with setting up a single-node SolrCloud on local machine. Refer to the Solr Tutorial if you have
never used Solr before.
Launch EC2 instances
Create new Security Group
1. Navigate to the AWS EC2 console and to the region of your choice.
2. Configure an AWS security group which will limit access to the installation and allow our launched EC2
instances to talk to each other without restrictions.
a. From the EC2 Dashboard, click [ Security Groups ] from the left-hand menu, under "Network &
Security".
b. Click [ Create Security Group ] under the Security Groups section. Give your security group a
descriptive name.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 88 of 1426
Apache Solr Reference Guide 7.7
c. You can select one of the existing VPCs or create a new one.
d. We need two ports open for our cloud here:
i. Solr port. In this example we will use Solr’s default port 8983.
ii. ZooKeeper Port: We’ll use Solr’s embedded ZooKeeper, so we’ll use the default port 9983 (see the
Deploying with External ZooKeeper to configure external ZooKeeper).
e. Click [ Inbound ] to set inbound network rules, then select [ Add Rule ]. Select "Custom TCP" as the
type. Enter 8983 for the "Port Range" and choose "My IP for the Source, then enter your public IP.
Create a second rule with the same type and source, but enter 9983 for the port.
This will limit access to your current machine. If you want wider access to the instance in order to
collaborate with others, you can specify that, but make sure you only allow as much access as
needed. A Solr instance should not be exposed to general Internet traffic.
f. Add another rule for SSH access. Choose "SSH" as the type, and again "My IP" for the source and
again enter your public IP. You need SSH access on all instances to install and configure Solr.
g. Review the details, your group configuration should look like this:
h. Click [ Create ] when finished.
i. We need to modify the rules so that instances that are part of the group can talk to all other instances
that are part of the same group. We could not do this while creating the group, so we need to edit the
group after creating it to add this.
i. Select the newly created group in the Security Group overview table. Under the "Inbound" tab,
click [ Edit ].
ii. Click [ Add rule ]. Choose All TCP from the pulldown list for the type, and enter 0-65535 for the
port range. Specify the name of the current Security Group as the solr-sample.
j. Review the details, your group configuration should now look like this:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 89 of 1426
k. Click [ Save ] when finished.
Configure Instances and Launch
Once the security group is in place, you can choose [ Instances ] from the left-hand navigation menu.
Under Instances, click [ Launch Instance ] button and follow the wizard steps:
1. Choose your Amazon Machine Image (AMI): Choose Amazon Linux AMI, SSD Volume Type as the AMI.
There are both commercial AMIs and Community based AMIs available, e.g., Amazon Linux AMI (HVM),
SSD Volume Type, but this is a nice AMI to use for our purposes. Click [ Select ] next to the image you
choose.
2. The next screen asks you to choose the instance type, t2.medium is sufficient. Choose it from the list,
then click [ Configure Instance Details ].
3. Configure the instance. Enter 2 in the "Number of instances" field. Make sure the setting for "Autoassign Public IP" is "Enabled".
4. When finished, click [ Add Storage ]. The default of 8 GB for size and General Purpose SSD for the
volume type is sufficient for running this quick start. Optionally select "Delete on termination" if you
know you won’t need the data stored in Solr indexes after you terminate the instances.
5. When finished, click [ Add Tags ]. You do not have to add any tags for this quick start, but you can add
them if you want.
6. Click [ Configure Security Group ]. Choose Select an existing security group and select the security
group you created earlier: solr-sample. You should see the expected inbound rules at the bottom of the
page.
7. Click [ Review ].
8. If everything looks correct, click [ Launch ].
9. Select an existing “private key file” or create a new one and download to your local machine so you will
be able to login into the instances via SSH.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 90 of 1426
Apache Solr Reference Guide 7.7
10. On the instances list, you can watch the states change. You cannot use the instances until they become
“running”.
Install, Configure and Start
1. Locate the Public DNS record for the instance by selecting the instance from the list of instances, and log
on to each machine one by one.
Using SSH, if your AWS identity key file is aws-key.pem and the AMI uses ec2-user as login user, on each
AWS instance, do the following:
$ ssh-add aws-key.pem
$ ssh -A ec2-user@
2. While logged in to each of the AWS EC2 instances, configure Java 1.8 and download Solr:
#
$
$
$
#
#
$
verify default java version packaged with AWS instances is 1.7
java -version
sudo yum install java-1.8.0
sudo /usr/sbin/alternatives --config java
select jdk-1.8
verify default java version to java-1.8
java -version
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 91 of 1426
# download desired version of Solr
$ wget http://archive.apache.org/dist/lucene/solr/7.7.0/solr-7.7.0.tgz
# untar
$ tar -zxvf solr-7.7.0.tgz
# set SOLR_HOME
$ export SOLR_HOME=$PWD/solr-7.7.0
# put the env variable in .bashrc
# vim ~/.bashrc
export SOLR_HOME=/home/ec2-user/solr-7.7.0
3. Resolve the Public DNS to simpler hostnames.
Let’s assume AWS instances public DNS with IPv4 Public IP are as follows:
◦ ec2-54-1-2-3.us-east-2.compute.amazonaws.com: 54.1.2.3
◦ ec2-54-4-5-6.us-east-2.compute.amazonaws.com: 54.4.5.6
Edit /etc/hosts, and add entries for the above machines:
$ sudo vim /etc/hosts
54.1.2.3 solr-node-1
54.4.5.6 solr-node-2
4. Configure Solr in running EC2 instances.
In this case, one of the machines will host ZooKeeper embedded along with Solr node, say, ec2-101-1-23.us-east-2.compute.amazonaws.com (aka, solr-node-1)
See Deploying with External ZooKeeper for configure external ZooKeeper.
Inside the ec2-101-1-2-3.us-east-2.compute.amazonaws.com (solr-node-1)
$ cd $SOLR_HOME
# start Solr node on 8983 and ZooKeeper will start on 8983+1000 9983
$ bin/solr start -c -p 8983 -h solr-node-1
On the other node, ec2-101-4-5-6.us-east-2.compute.amazonaws.com (solr-node-2)
$ cd $SOLR_HOME
# start Solr node on 8983 and connect to ZooKeeper running on first node
$ bin/solr start -c -p 8983 -h solr-node-2 -z solr-node-1:9983
5. Inspect and Verify. Inspect the Solr nodes state from browser on local machine:
Go to:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 92 of 1426
Apache Solr Reference Guide 7.7
http://ec2-101-1-2-3.us-east-2.compute.amazonaws.com:8983/solr (solr-node-1:8983/solr)
http://ec2-101-4-5-6.us-east-2.compute.amazonaws.com:8983/solr (solr-node-2:8983/solr)
You should able to see Solr UI dashboard for both nodes.
Create Collection, Index and Query
You can refer Solr Tutorial for an extensive walkthrough on creating collections with multiple shards and
replicas, indexing data via different methods and querying documents accordingly.
Deploying with External ZooKeeper
If you want to configure an external ZooKeeper ensemble to avoid using the embedded single-instance
ZooKeeper that runs in the same JVM as the Solr node, you need to make few tweaks in the above listed
steps as follows.
• When creating the security group, instead of opening port 9983 for ZooKeeper, you’ll open 2181 (or
whatever port you are using for ZooKeeper: it’s default is 2181).
• When configuring the number of instances to launch, choose to open 3 instances instead of 2.
• When modifying the /etc/hosts on each machine, add a third line for the 3rd instance and give it a
recognizable name:
$ sudo vim /etc/hosts
54.1.2.3 solr-node-1
54.4.5.6 solr-node-2
54.7.8.9 zookeeper-node
• You’ll need to install ZooKeeper manually, described in the next section.
Install ZooKeeper
These steps will help you install and configure a single instance of ZooKeeper on AWS. This is not sufficient
for a production, use, however, where a ZooKeeper ensemble of at least three nodes is recommended. See
the section Setting Up an External ZooKeeper Ensemble for information about how to change this singleinstance into an ensemble.
1. Download a stable version of ZooKeeper. In this example we’re using ZooKeeper v3.4.13. On the node
you’re using to host ZooKeeper (zookeeper-node), download the package and untar it:
#
$
#
$
download stable version of ZooKeeper, here 3.4.13
wget https://archive.apache.org/dist/zookeeper/zookeeper-3.4.13/zookeeper-3.4.13.tar.gz
untar
tar -zxvf zookeeper-3.4.13.tar.gz
Add an environment variable for ZooKeeper’s home directory (ZOO_HOME) to the .bashrc for the user
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 93 of 1426
who will be running the process. The rest of the instructions assume you have set this variable. Correct
the path to the ZooKeeper installation as appropriate if where you put it does not match the below.
$ export ZOO_HOME=$PWD/zookeeper-3.4.13
# put the env variable in .bashrc
# vim ~/.bashrc
export ZOO_HOME=/home/ec2-user/zookeeper-3.4.13
2. Change directories to ZOO_HOME, and create the ZooKeeper configuration by using the template provided
by ZooKeeper.
$ cd $ZOO_HOME
# create ZooKeeper config by using zoo_sample.cfg
$ cp conf/zoo_sample.cfg conf/zoo.cfg
3. Create the ZooKeeper data directory in the filesystem, and edit the zoo.cfg file to uncomment the
autopurge parameters and define the location of the data directory.
# create data dir for ZooKeeper, edit zoo.cfg, uncomment autopurge parameters
$ mkdir data
$ vim conf/zoo.cfg
# -- uncomment -autopurge.snapRetainCount=3
autopurge.purgeInterval=1
# -- edit -dataDir=data
4. Start ZooKeeper.
$ cd $ZOO_HOME
# start ZooKeeper, default port: 2181
$ bin/zkServer.sh start
5. On the the first node being used for Solr (solr-node-1), start Solr and tell it where to find ZooKeeper.
$ cd $SOLR_HOME
# start Solr node on 8983 and connect to ZooKeeper running on ZooKeeper node
$ bin/solr start -c -p 8983 -h solr-node-1 -z zookeeper-node:2181
6. On the second Solr node (solr-node-2), again start Solr and tell it where to find ZooKeeper.
$ cd $SOLR_HOME
# start Solr node on 8983 and connect to ZooKeeper running on ZooKeeper node
$ bin/solr start -c -p 8983 -h solr-node-1 -z zookeeper-node:2181
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 94 of 1426
Apache Solr Reference Guide 7.7
As noted earlier, a single ZooKeeper node is not sufficient for a production installation. See
these additional resources for more information about deploying Solr in production, which
can be used once you have the EC2 instances up and running:
• Taking Solr to Production
• Setting Up an External ZooKeeper Ensemble
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 95 of 1426
Upgrading a Solr Cluster
This page covers how to upgrade an existing Solr cluster that was installed using the service installation
scripts.
The steps outlined on this page assume you use the default service name of solr. If you
use an alternate service name or Solr installation directory, some of the paths and
commands mentioned below will have to be modified accordingly.
Planning Your Upgrade
Here is a checklist of things you need to prepare before starting the upgrade process:
1. Examine the Solr Upgrade Notes to determine if any behavior changes in the new version of Solr will
affect your installation.
2. If not using replication (i.e., collections with replicationFactor less than 1), then you should make a
backup of each collection. If all of your collections use replication, then you don’t technically need to
make a backup since you will be upgrading and verifying each node individually.
3. Determine which Solr node is currently hosting the Overseer leader process in SolrCloud, as you should
upgrade this node last. To determine the Overseer, use the Overseer Status API, see: Collections API.
4. Plan to perform your upgrade during a system maintenance window if possible. You’ll be doing a rolling
restart of your cluster (each node, one-by-one), but we still recommend doing the upgrade when system
usage is minimal.
5. Verify the cluster is currently healthy and all replicas are active, as you should not perform an upgrade
on a degraded cluster.
6. Re-build and test all custom server-side components against the new Solr JAR files.
7. Determine the values of the following variables that are used by the Solr Control Scripts:
◦ ZK_HOST: The ZooKeeper connection string your current SolrCloud nodes use to connect to
ZooKeeper; this value will be the same for all nodes in the cluster.
◦ SOLR_HOST: The hostname each Solr node used to register with ZooKeeper when joining the
SolrCloud cluster; this value will be used to set the host Java system property when starting the new
Solr process.
◦ SOLR_PORT: The port each Solr node is listening on, such as 8983.
◦ SOLR_HOME: The absolute path to the Solr home directory for each Solr node; this directory must
contain a solr.xml file. This value will be passed to the new Solr process using the solr.solr.home
system property, see: Solr Cores and solr.xml.
If you are upgrading from an installation of Solr 5.x or later, these values can typically be found in
either /var/solr/solr.in.sh or /etc/default/solr.in.sh.
You should now be ready to upgrade your cluster. Please verify this process in a test or staging cluster
before doing it in production.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 96 of 1426
Apache Solr Reference Guide 7.7
Upgrade Process
The approach we recommend is to perform the upgrade of each Solr node, one-by-one. In other words, you
will need to stop a node, upgrade it to the new version of Solr, and restart it before moving on to the next
node. This means that for a short period of time, there will be a mix of "Old Solr" and "New Solr" nodes
running in your cluster. We also assume that you will point the new Solr node to your existing Solr home
directory where the Lucene index files are managed for each collection on the node. This means that you
won’t need to move any index files around to perform the upgrade.
Step 1: Stop Solr
Begin by stopping the Solr node you want to upgrade. After stopping the node, if using a replication (i.e.,
collections with replicationFactor less than 1), verify that all leaders hosted on the downed node have
successfully migrated to other replicas; you can do this by visiting the Cloud panel in the Solr Admin UI. If
not using replication, then any collections with shards hosted on the downed node will be temporarily offline.
Step 2: Install Solr as a Service
Please follow the instructions to install Solr as a Service on Linux documented at Taking Solr to Production.
Use the -n parameter to avoid automatic start of Solr by the installer script. You need to update the
/etc/default/solr.in.sh include file in the next step to complete the upgrade process.
If you have a /var/solr/solr.in.sh file for your existing Solr install, running the
install_solr_service.sh script will move this file to its new location:
/etc/default/solr.in.sh (see SOLR-8101 for more details)
Step 3: Set Environment Variable Overrides
Open /etc/default/solr.in.sh with a text editor and verify that the following variables are set correctly,
or add them bottom of the include file as needed:
ZK_HOST=
SOLR_HOST=
SOLR_PORT=
SOLR_HOME=
Make sure the user you plan to own the Solr process is the owner of the SOLR_HOME directory. For instance, if
you plan to run Solr as the "solr" user and SOLR_HOME is /var/solr/data, then you would do: sudo chown -R
solr: /var/solr/data
Step 4: Start Solr
You are now ready to start the upgraded Solr node by doing: sudo service solr start. The upgraded
instance will join the existing cluster because you’re using the same SOLR_HOME, SOLR_PORT, and SOLR_HOST
settings used by the old Solr node; thus, the new server will look like the old node to the running cluster. Be
sure to look in /var/solr/logs/solr.log for errors during startup.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 97 of 1426
Step 5: Run Healthcheck
You should run the Solr healthcheck command for all collections that are hosted on the upgraded node
before proceeding to upgrade the next node in your cluster. For instance, if the newly upgraded node hosts
a replica for the MyDocuments collection, then you can run the following command (replace ZK_HOST with
the ZooKeeper connection string):
/opt/solr/bin/solr healthcheck -c MyDocuments -z ZK_HOST
Look for any problems reported about any of the replicas for the collection.
Lastly, repeat Steps 1-5 for all nodes in your cluster.
IndexUpgrader Tool
The Lucene distribution includes a tool that upgrades an index from previous Lucene versions to the current
file format.
The tool can be used from command line, or it can be instantiated and executed in Java.
Indexes can only be upgraded from the previous major release version to the current
major release version.
This means that the IndexUpgrader Tool in any Solr 7.x release, for example, can only work
with indexes from 6.x releases, but cannot work with indexes from Solr 5.x or earlier.
If you are currently using an earlier release such as 5.x and want to move more than one
major version ahead, you need to first upgrade your indexes to the next major version
(6.x), then again to the major version after that (7.x), etc.
In a Solr distribution, the Lucene files are located in ./server/solr-webapp/webapp/WEB-INF/lib. You will
need to include the lucene-core-.jar and lucene-backwards-codecs-.jar on the
classpath when running the tool.
java -cp lucene-core-7.7.0.jar:lucene-backward-codecs-7.7.0.jar
org.apache.lucene.index.IndexUpgrader [-delete-prior-commits] [-verbose] /path/to/index
This tool keeps only the last commit in an index. For this reason, if the incoming index has more than one
commit, the tool refuses to run by default. Specify -delete-prior-commits to override this, allowing the tool
to delete all but the last commit.
Upgrading large indexes may take a long time. As a rule of thumb, the upgrade processes about 1 GB per
minute.
This tool may reorder documents if the index was partially upgraded before execution (e.g.,
documents were added). If your application relies on monotonicity of document IDs (i.e.,
the order in which the documents were added to the index is preserved), do a full optimize
instead.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 98 of 1426
Apache Solr Reference Guide 7.7
Solr Upgrade Notes
The following notes describe changes to Solr in recent releases that you should be aware of before
upgrading.
These notes highlight the biggest changes that may impact the largest number of implementations. It is not
a comprehensive list of all changes to Solr in any release.
When planning your Solr upgrade, consider the customizations you have made to your system and review
the CHANGES.txt file found in your Solr package. That file includes all the changes and updates that may
effect your existing implementation.
Detailed steps for upgrading a Solr cluster are in the section Upgrading a Solr Cluster.
Upgrading to 7.x Releases
Solr 7.7
See the 7.7 Release Notes for an overview of the main new features in Solr 7.7.
When upgrading to Solr 7.7.x, users should be aware of the following major changes from v7.6:
Admin UI
• The Admin UI now presents a login screen for any users with authentication enabled on their cluster.
Clusters with Basic Authentication will prompt users to enter a username and password. On clusters
configured to use Kerberos Authentication, users will be directed to configure their browser to provide
an appropriate Kerberos ticket.
The login screen’s purpose is cosmetic only - Admin UI-triggered Solr requests were subject to
authentication prior to 7.7 and still are today. The login screen changes only the user experience of
providing this authentication.
Distributed Requests
• The shards parameter, used to manually select the shards and replicas that receive distributed requests,
now checks nodes against a whitelist of acceptable values for security reasons.
In SolrCloud mode this whitelist is automatically configured to contain all live nodes. In standalone mode
the whitelist is empty by default. Upgrading users who use the shards parameter in standalone mode
can correct this value by setting the shardsWhitelist property in any shardHandler configurations in
their solrconfig.xml file.
For more information, see the Distributed Request documentation.
Solr 7.6
See the 7.6 Release Notes for an overview of the main new features in Solr 7.6.
When upgrading to Solr 7.6, users should be aware of the following major changes from v7.5:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 99 of 1426
Collections
• The JSON parameter to set cluster-wide default cluster properties with the CLUSTERPROP command has
changed.
The old syntax nested the defaults into a property named clusterDefaults. The new syntax uses only
defaults. The command to use is still set-obj-property.
An example of the new syntax is:
{
"set-obj-property": {
"defaults" : {
"collection": {
"numShards": 2,
"nrtReplicas": 1,
"tlogReplicas": 1,
"pullReplicas": 1
}
}
}
}
The old syntax will be supported until at least Solr 9, but users are advised to begin using the new syntax
as soon as possible.
• The parameter min_rf has been deprecated and no longer needs to be provided in order to see the
achieved replication factor. This information will now always be returned to the client with the response.
Autoscaling
• An autoscaling policy is now used as the default strategy for selecting nodes on which new replicas or
replicas of new collections are created.
A default policy is now in place for all users, which will sort nodes by the number of cores and available
freedisk, which means by default a node with the fewest number of cores already on it and the highest
available freedisk will be selected for new core creation.
• The change described above has two additional impacts on the maxShardsPerNode parameter:
1. It removes the restriction against using maxShardsPerNode when an autoscaling policy is in place.
This parameter can now always be set when creating a collection.
2. It removes the default setting of maxShardsPerNode=1 when an autoscaling policy is in place. It will be
set correctly (if required) regardless of whether an autoscaling policy is in place or not.
The default value of maxShardsPerNode is still 1. It can be set to -1 if the old behavior of unlimited
maxSharedsPerNode is desired.
DirectoryFactory
• Lucene has introduced the ByteBuffersDirectoryFactory as a replacement for the
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 100 of 1426
Apache Solr Reference Guide 7.7
RAMDirectoryFactory, which will be removed in Solr 9.
While most users are still encouraged to use the NRTCachingDirectoryFactory, which allows Lucene to
select the best directory factory to use, if you have explicitly configured Solr to use the
RAMDirectoryFactory, you are encouraged to switch to the new implementation as soon as possible
before Solr 9 is released.
For more information about the new directory factory, see the Jira issue LUCENE-8438.
For more information about the directory factory configuration in Solr, see the section DataDir and
DirectoryFactory in SolrConfig.
Solr 7.5
See the 7.5 Release Notes for an overview of the main new features in Solr 7.5.
When upgrading to Solr 7.5, users should be aware of the following major changes from v7.4:
Schema Changes
• Since Solr 7.0, Solr’s schema field-guessing has created _str fields for all _txt fields, and returned those
by default with queries. As of 7.5, _str fields will no longer be returned by default. They will still be
available and can be requested with the fl parameter on queries. See also the section on field guessing
for more information about how schema field guessing works.
• The Standard Filter, which has been non-operational since at least Solr v4, has been removed.
Index Merge Policy
• When using the TieredMergePolicy, the default merge policy for Solr, optimize and expungeDeletes
now respect the maxMergedSegmentMB configuration parameter, which defaults to 5000 (5GB).
If it is absolutely necessary to control the number of segments present after optimize, specify
maxSegments as a positive integer. Setting maxSegments higher than 1 are honored on a "best effort"
basis.
The TieredMergePolicy will also reclaim resources from segments that exceed maxMergedSegmentMB
more aggressively than earlier.
UIMA Removed
• The UIMA contrib has been removed from Solr and is no longer available.
Logging
• Solr’s logging configuration file is now located in server/resources/log4j2.xml by default.
• A bug for Windows users has been corrected. When using Solr’s examples (bin/solr start -e) log files
will now be put in the correct location (example/ instead of server). See also Solr Examples and Solr
Control Script Reference for more information.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 101 of 1426
Solr 7.4
See the 7.4 Release Notes for an overview of the main new features in Solr 7.4.
When upgrading to Solr 7.4, users should be aware of the following major changes from v7.3:
Logging
• Solr now uses Log4j v2.11. The Log4j configuration is now in log4j2.xml rather than log4j.properties
files. This is a server side change only and clients using SolrJ won’t need any changes. Clients can still use
any logging implementation which is compatible with SLF4J. We now let Log4j handle rotation of Solr logs
at startup, and bin/solr start scripts will no longer attempt this nor move existing console or garbage
collection logs into logs/archived either. See Configuring Logging for more details about Solr logging.
• Configuring slowQueryThresholdMillis now logs slow requests to a separate file named
solr_slow_requests.log. Previously they would get logged in the solr.log file.
Legacy Scaling (non-SolrCloud)
• In the master-slave model of scaling Solr, a slave no longer commits an empty index when a completely
new index is detected on master during replication. To return to the previous behavior pass false to
skipCommitOnMasterVersionZero in the slave section of replication handler configuration, or pass it to
the fetchindex command.
If you are upgrading from a version earlier than Solr 7.3, please see previous version notes below.
Solr 7.3
See the 7.3 Release Notes for an overview of the main new features in Solr 7.3.
When upgrading to Solr 7.3, users should be aware of the following major changes from v7.2:
ConfigSets
• Collections created without specifying a configset name have used a copy of the _default configset
since Solr 7.0. Before 7.3, the copied configset was named the same as the collection name, but from 7.3
onwards it will be named with a new ".AUTOCREATED" suffix. This is to prevent overwriting custom
configset names.
Learning to Rank
• The rq parameter used with Learning to Rank rerank query parsing no longer considers the defType
parameter. See Running a Rerank Query for more information about this parameter.
Autoscaling & AutoAddReplicas
• The behaviour of the autoscaling system will now pause all triggers from execution between the start of
actions and the end of a cool down period. The triggers will resume after the cool down period expires.
Previously, the cool down period was a fixed period started after actions for a trigger event completed
and during this time all triggers continued to run but any events were rejected and tried later.
• The throttling mechanism used to limit the rate of autoscaling events processed has been removed. This
deprecates the actionThrottlePeriodSeconds setting in the set-properties Autoscaling API which is
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 102 of 1426
Apache Solr Reference Guide 7.7
now non-operational. Use the triggerCooldownPeriodSeconds parameter instead to pause event
processing.
• The default value of autoReplicaFailoverWaitAfterExpiration, used with the AutoAddReplicas
feature, has increased to 120 seconds from the previous default of 30 seconds. This affects how soon Solr
adds new replicas to replace the replicas on nodes which have either crashed or shutdown.
Logging
• The default Solr log file size and number of backups have been raised to 32MB and 10 respectively. See
the section Configuring Logging for more information about how to configure logging.
SolrCloud
• The old Leader-In-Recovery implementation (implemented in Solr 4.9) is now deprecated and replaced.
Solr will support rolling upgrades from old 7.x versions of Solr to future 7.x releases until the last release
of the 7.x major version.
This means to upgrade to Solr 8 in the future, you will need to be on Solr 7.3 or higher.
• Replicas which are not up-to-date are no longer allowed to become leader. Use the FORCELEADER
command of the Collections API to allow these replicas become leader.
Spatial
• If you are using the spatial JTS library with Solr, you must upgrade to 1.15.0. This new version of JTS is
now dual-licensed to include a BSD style license. See the section on Spatial Search for more information.
Highlighting
• The top-level element in solrconfig.xml is now officially deprecated in favour of the
equivalent syntax. This element has been out of use in default Solr installations for
several releases already.
If you are upgrading from a version earlier than Solr 7.2, please see previous version notes below.
Solr 7.2
See the 7.2 Release Notes for an overview of the main new features in Solr 7.2.
When upgrading to Solr 7.2, users should be aware of the following major changes from v7.1:
Local Parameters
• Starting a query string with local parameters {!myparser …} is used to switch from one query parser to
another, and is intended for use by Solr system developers, not end users doing searches. To reduce
negative side-effects of unintended hack-ability, Solr now limits the cases when local parameters will be
parsed to only contexts in which the default parser is "lucene" or "func".
So, if defType=edismax then q={!myparser …} won’t work. In that example, put the desired query parser
into the defType parameter.
Another example is if deftype=edismax then hl.q={!myparser …} won’t work for the same reason. In
this example, either put the desired query parser into the hl.qparser parameter or set
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 103 of 1426
hl.qparser=lucene. Most users won’t run into these cases but some will need to change.
If you must have full backwards compatibility, use luceneMatchVersion=7.1.0 or an earlier version.
eDisMax Parser
• The eDisMax parser by default no longer allows subqueries that specify a Solr parser using either local
parameters, or the older _query_ magic field trick.
For example, {!prefix f=myfield v=enterp} or _query_:"{!prefix f=myfield v=enterp}" are not
supported by default any longer. If you want to allow power-users to do this, set uf=* query or some
other value that includes _query_.
If you need full backwards compatibility for the time being, use luceneMatchVersion=7.1.0 or
something earlier.
If you are upgrading from a version earlier than Solr 7.1, please see previous version notes below.
Solr 7.1
See the 7.1 Release Notes for an overview of the main new features of Solr 7.1.
When upgrading to Solr 7.1, users should be aware of the following major changes from v7.0:
AutoAddReplicas
• The feature to automatically add replicas if a replica goes down, previously available only when storing
indexes in HDFS, has been ported to the autoscaling framework. Due to this, autoAddReplicas is now
available to all users even if their indexes are on local disks.
Existing users of this feature should not have to change anything. However, they should note these
changes:
◦ Behavior: Changing the autoAddReplicas property from disabled (false) to enabled (true) using
MODIFYCOLLECTION API no longer replaces down replicas for the collection immediately. Instead,
replicas are only added if a node containing them went down while autoAddReplicas was enabled.
The parameters autoReplicaFailoverBadNodeExpiration and autoReplicaFailoverWorkLoopDelay
are no longer used.
◦ Deprecations: Enabling/disabling autoAddReplicas cluster-wide with the API will be deprecated; use
suspend/resume trigger APIs with name=".auto_add_replicas" instead.
More information about the changes to this feature can be found in the section SolrCloud
Automatically Adding Replicas.
Metrics Reporters
• Shard and cluster metric reporter configuration now require a class attribute.
◦ If a reporter configures the group="shard" attribute then please also configure the
class="org.apache.solr.metrics.reporters.solr.SolrShardReporter" attribute.
◦ If a reporter configures the group="cluster" attribute then please also configure the
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 104 of 1426
Apache Solr Reference Guide 7.7
class="org.apache.solr.metrics.reporters.solr.SolrClusterReporter" attribute.
See the section Shard and Cluster Reporters for more information.
Streaming Expressions
• All Stream Evaluators in solrj.io.eval have been refactored to have a simpler and more robust
structure. This simplifies and condenses the code required to implement a new Evaluator and makes it
much easier for evaluators to handle differing data types (primitives, objects, arrays, lists, and so forth).
ReplicationHandler
• In the ReplicationHandler, the master.commitReserveDuration sub-element is deprecated. Instead
please configure a direct commitReserveDuration element for use in all modes (master, slave, cloud).
RunExecutableListener
• The RunExecutableListener was removed for security reasons. If you want to listen to events caused by
updates, commits, or optimize, write your own listener as native Java class as part of a Solr plugin.
XML Query Parser
• In the XML query parser (defType=xmlparser or {!xmlparser … }) the resolving of external entities is
now disallowed by default.
If you are upgrading from a version earlier than Solr 7.0, please see Major Changes in Solr 7 before starting
your upgrade.
Upgrading to 7.x from Any 6.x Release
The upgrade from Solr 6.x to Solr 7.0 introduces several major changes that you should be aware of before
upgrading. Please do a thorough review of the section Major Changes in Solr 7 before starting your
upgrade.
Upgrading to 7.x from pre-6.x Versions of Solr
Users upgrading from versions of Solr prior to 6.x are strongly encouraged to consult CHANGES.txt for the
details of all changes since the version they are upgrading from.
A summary of the significant changes between Solr 5.x and Solr 6.0 is in the section Major Changes from Solr
5 to Solr 6.
Major Changes in Solr 7
Solr 7 is a major new release of Solr which introduces new features and a number of other changes that may
impact your existing installation.
Upgrade Planning
There are major changes in Solr 7 to consider before starting to migrate your configurations and indexes.
This page is designed to highlight the biggest changes - new features you may want to be aware of, but also
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 105 of 1426
changes in default behavior and deprecated features that have been removed.
There are many hundreds of changes in Solr 7, however, so a thorough review of the Solr Upgrade Notes as
well as the CHANGES.txt file in your Solr instance will help you plan your migration to Solr 7. This section
attempts to highlight some of the major changes you should be aware of.
You should also consider all changes that have been made to Solr in any version you have not upgraded to
already. For example, if you are currently using Solr 6.2, you should review changes made in all subsequent
6.x releases in addition to changes for 7.0.
Re-indexing your data is considered the best practice and you should try to do so if possible. However, if reindexing is not feasible, keep in mind you can only upgrade one major version at a time. Thus, Solr 6.x
indexes will be compatible with Solr 7 but Solr 5.x indexes will not be.
If you do not re-index now, keep in mind that you will need to either re-index your data or upgrade your
indexes before you will be able to move to Solr 8 when it is released in the future. See the section
IndexUpgrader Tool for more details on how to upgrade your indexes.
See also the section Upgrading a Solr Cluster for details on how to upgrade a SolrCloud cluster.
New Features & Enhancements
Replication Modes
Until Solr 7, the SolrCloud model for replicas has been to allow any replica to become a leader when a leader
is lost. This is highly effective for most users, providing reliable failover in case of issues in the cluster.
However, it comes at a cost in large clusters because all replicas must be in sync at all times.
To provide additional flexibility, two new types of replicas have been added, named TLOG & PULL. These new
types provide options to have replicas which only sync with the leader by copying index segments from the
leader. The TLOG type has an additional benefit of maintaining a transaction log (the "tlog" of its name),
which would allow it to recover and become a leader if necessary; the PULL type does not maintain a
transaction log, so cannot become a leader.
As part of this change, the traditional type of replica is now named NRT. If you do not explicitly define a
number of TLOG or PULL replicas, Solr defaults to creating NRT replicas. If this model is working for you, you
will not have to change anything.
See the section Types of Replicas for more details on the new replica modes, and how define the replica type
in your cluster.
Autoscaling
Solr autoscaling is a new suite of features in Solr to make managing a SolrCloud cluster easier and more
automated.
At its core, Solr autoscaling provides users with a rule syntax to define preferences and policies for how to
distribute nodes and shards in a cluster, with the goal of maintaining a balance in the cluster. As of Solr 7,
Solr will take any policy or preference rules into account when determining where to place new shards and
replicas created or moved with various Collections API commands.
See the section SolrCloud Autoscaling for details on the options available in 7.0. Expect more features to be
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 106 of 1426
Apache Solr Reference Guide 7.7
released in subsequent 7.x releases in this area.
Other Features & Enhancements
• The Analytics Component has been refactored.
◦ The documentation for this component is in progress; until it is available, please refer to SOLR-11144
for more details.
• There were several other new features released in earlier 6.x releases, which you may have missed:
◦ Learning to Rank
◦ Unified Highlighter
◦ Metrics API. See also information about related deprecations in the section JMX Support and MBeans
below.
◦ Payload queries
◦ Streaming Evaluators
◦ /v2 API
◦ Graph streaming expressions
Configuration and Default Changes
New Default ConfigSet
Several changes have been made to configSets that ship with Solr; not only their content but how Solr
behaves in regard to them:
• The data_driven_configset and basic_configset have been removed, and replaced by the _default
configset. The sample_techproducts_configset also remains, and is designed for use with the example
documents shipped with Solr in the example/exampledocs directory.
• When creating a new collection, if you do not specify a configSet, the _default will be used.
◦ If you use SolrCloud, the _default configSet will be automatically uploaded to ZooKeeper.
◦ If you use standalone mode, the instanceDir will be created automatically, using the _default
configSet as it’s basis.
Schemaless Improvements
To improve the functionality of Schemaless Mode, Solr now behaves differently when it detects that data in
an incoming field should have a text-based field type.
• Incoming fields will be indexed as text_general by default (you can change this). The name of the field
will be the same as the field name defined in the document.
• A copy field rule will be inserted into your schema to copy the new text_general field to a new field with
the name _str. This field’s type will be a strings field (to allow for multiple values). The first 256
characters of the text field will be inserted to the new strings field.
This behavior can be customized if you wish to remove the copy field rule, or to change the number of
characters inserted to the string field, or the field type used. See the section Schemaless Mode for details.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 107 of 1426
Because copy field rules can slow indexing and increase index size, it’s recommended you
only use copy fields when you need to. If you do not need to sort or facet on a field, you
should remove the automatically-generated copy field rule.
Automatic field creation can be disabled with the update.autoCreateFields property. To do this, you can
use the Config API with a command such as:
V1 API
curl http://host:8983/solr/mycollection/config -d '{"set-user-property":
{"update.autoCreateFields":"false"}}'
V2 API
curl http://host:8983/api/collections/mycollection/config -d '{"set-user-property":
{"update.autoCreateFields":"false"}}'
Changes to Default Behaviors
• JSON is now the default response format. If you rely on XML responses, you must now define wt=xml in
your request. In addition, line indentation is enabled by default (indent=on).
• The sow parameter (short for "Split on Whitespace") now defaults to false, which allows support for
multi-word synonyms out of the box. This parameter is used with the eDismax and standard/"lucene"
query parsers. If this parameter is not explicitly specified as true, query text will not be split on
whitespace before analysis.
• The legacyCloud parameter now defaults to false. If an entry for a replica does not exist in state.json,
that replica will not get registered.
This may affect users who bring up replicas and they are automatically registered as a part of a shard. It
is possible to fall back to the old behavior by setting the property legacyCloud=true, in the cluster
properties using the following command:
./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:2181 -cmd clusterprop -name
legacyCloud -val true
• The eDismax query parser parameter lowercaseOperators now defaults to false if the
luceneMatchVersion in solrconfig.xml is 7.0.0 or above. Behavior for luceneMatchVersion lower than
7.0.0 is unchanged (so, true). This means that clients must sent boolean operators (such as AND, OR and
NOT) in upper case in order to be recognized, or you must explicitly set this parameter to true.
• The handleSelect parameter in solrconfig.xml now defaults to false if the luceneMatchVersion is
7.0.0 or above. This causes Solr to ignore the qt parameter if it is present in a request. If you have
request handlers without a leading '/', you can set handleSelect="true" or consider migrating your
configuration.
The qt parameter is still used as a SolrJ special parameter that specifies the request handler (tail URL
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 108 of 1426
Apache Solr Reference Guide 7.7
path) to use.
• The lucenePlusSort query parser (aka the "Old Lucene Query Parser") has been deprecated and is no
longer implicitly defined. If you wish to continue using this parser until Solr 8 (when it will be removed),
you must register it in your solrconfig.xml, as in: .
• The name of TemplateUpdateRequestProcessorFactory is changed to template from Template and the
name of AtomicUpdateProcessorFactory is changed to atomic from Atomic
◦ Also, TemplateUpdateRequestProcessorFactory now uses {} instead of ${} for template.
Deprecations and Removed Features
Point Fields Are Default Numeric Types
Solr has implemented *PointField types across the board, to replace Trie* based numeric fields. All Trie*
fields are now considered deprecated, and will be removed in Solr 8.
If you are using Trie* fields in your schema, you should consider moving to PointFields as soon as feasible.
Changing to the new PointField types will require you to re-index your data.
Spatial Fields
The following spatial-related fields have been deprecated:
• LatLonType
• GeoHashField
• SpatialVectorFieldType
• SpatialTermQueryPrefixTreeFieldType
Choose one of these field types instead:
• LatLonPointSpatialField
• SpatialRecursivePrefixTreeField
• RptWithGeometrySpatialField
See the section Spatial Search for more information.
JMX Support and MBeans
• The element in solrconfig.xml has been removed in favor of elements
defined in solr.xml.
Limited back-compatibility is offered by automatically adding a default instance of SolrJmxReporter if
it’s missing AND when a local MBean server is found. A local MBean server can be activated either via
ENABLE_REMOTE_JMX_OPTS in solr.in.sh or via system properties, e.g.,
-Dcom.sun.management.jmxremote. This default instance exports all Solr metrics from all registries as
hierarchical MBeans.
This behavior can be also disabled by specifying a SolrJmxReporter configuration with a boolean init
argument enabled set to false. For a more fine-grained control users should explicitly specify at least
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 109 of 1426
one SolrJmxReporter configuration.
See also the section The Element, which describes how to set up Metrics Reporters
in solr.xml. Note that back-compatibility support may be removed in Solr 8.
• MBean names and attributes now follow the hierarchical names used in metrics. This is reflected also in
/admin/mbeans and /admin/plugins output, and can be observed in the UI Plugins tab, because now all
these APIs get their data from the metrics API. The old (mostly flat) JMX view has been removed.
SolrJ
The following changes were made in SolrJ.
• HttpClientInterceptorPlugin is now HttpClientBuilderPlugin and must work with a
SolrHttpClientBuilder rather than an HttpClientConfigurer.
• HttpClientUtil now allows configuring HttpClient instances via SolrHttpClientBuilder rather than
an HttpClientConfigurer. Use of env variable SOLR_AUTHENTICATION_CLIENT_CONFIGURER no longer
works, please use SOLR_AUTHENTICATION_CLIENT_BUILDER
• SolrClient implementations now use their own internal configuration for socket timeouts, connect
timeouts, and allowing redirects rather than what is set as the default when building the HttpClient
instance. Use the appropriate setters on the SolrClient instance.
• HttpSolrClient#setAllowCompression has been removed and compression must be enabled as a
constructor parameter.
• HttpSolrClient#setDefaultMaxConnectionsPerHost and HttpSolrClient#setMaxTotalConnections
have been removed. These now default very high and can only be changed via parameter when creating
an HttpClient instance.
Other Deprecations and Removals
• The defaultOperator parameter in the schema is no longer supported. Use the q.op parameter instead.
This option had been deprecated for several releases. See the section Standard Query Parser Parameters
for more information.
• The defaultSearchField parameter in the schema is no longer supported. Use the df parameter
instead. This option had been deprecated for several releases. See the section Standard Query Parser
Parameters for more information.
• The mergePolicy, mergeFactor and maxMergeDocs parameters have been removed and are no longer
supported. You should define a mergePolicyFactory instead. See the section the mergePolicyFactory for
more information.
• The PostingsSolrHighlighter has been deprecated. It’s recommended that you move to using the
UnifiedHighlighter instead. See the section Unified Highlighter for more information about this
highlighter.
• Index-time boosts have been removed from Lucene, and are no longer available from Solr. If any boosts
are provided, they will be ignored by the indexing chain. As a replacement, index-time scoring factors
should be indexed in a separate field and combined with the query score using a function query. See the
section Function Queries for more information.
• The StandardRequestHandler is deprecated. Use SearchHandler instead.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 110 of 1426
Apache Solr Reference Guide 7.7
• To improve parameter consistency in the Collections API, the parameter names fromNode for the
MOVEREPLICA command and source, target for the REPLACENODE command have been deprecated
and replaced with sourceNode and targetNode instead. The old names will continue to work for backcompatibility but they will be removed in Solr 8.
• The unused valType option has been removed from ExternalFileField, if you have this in your schema
you can safely remove it.
Major Changes in Earlier 6.x Versions
The following summary of changes in earlier 6.x releases highlights significant changes released between
Solr 6.0 and 6.6 that were listed in earlier versions of this Guide. Mentions of deprecations are likely
superseded by removal in Solr 7, as noted in the above sections.
Note again that this is not a complete list of all changes that may impact your installation, so a thorough
review of CHANGES.txt is highly recommended if upgrading from any version earlier than 6.6.
• The Solr contribs map-reduce, morphlines-core and morphlines-cell have been removed.
• JSON Facet API now uses hyper-log-log for numBuckets cardinality calculation and calculates cardinality
before filtering buckets by any mincount greater than 1.
• If you use historical dates, specifically on or before the year 1582, you should re-index for better date
handling.
• If you use the JSON Facet API (json.facet) with method=stream, you must now set sort='index asc' to
get the streaming behavior; otherwise it won’t stream. Reminder: method is a hint that doesn’t change
defaults of other parameters.
• If you use the JSON Facet API (json.facet) to facet on a numeric field and if you use mincount=0 or if you
set the prefix, you will now get an error as these options are incompatible with numeric faceting.
• Solr’s logging verbosity at the INFO level has been greatly reduced, and you may need to update the log
configs to use the DEBUG level to see all the logging messages you used to see at INFO level before.
• We are no longer backing up solr.log and solr_gc.log files in date-stamped copies forever. If you
relied on the solr_log_ or solr_gc_log_ being in the logs folder that will no longer be the
case. See the section Configuring Logging for details on how log rotation works as of Solr 6.3.
• The create/deleteCollection methods on MiniSolrCloudCluster have been deprecated. Clients should
instead use the CollectionAdminRequest API. In addition,
MiniSolrCloudCluster#uploadConfigDir(File, String) has been deprecated in favour of
#uploadConfigSet(Path, String).
• The bin/solr.in.sh (bin/solr.in.cmd on Windows) is now completely commented by default.
Previously, this wasn’t so, which had the effect of masking existing environment variables.
• The _version_ field is no longer indexed and is now defined with indexed=false by default, because the
field has DocValues enabled.
• The /export handler has been changed so it no longer returns zero (0) for numeric fields that are not in
the original document. One consequence of this change is that you must be aware that some tuples will
not have values if there were none in the original document.
• Metrics-related classes in org.apache.solr.util.stats have been removed in favor of the Dropwizard
metrics library. Any custom plugins using these classes should be changed to use the equivalent classes
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 111 of 1426
from the metrics library. As part of this, the following changes were made to the output of Overseer
Status API:
◦ The "totalTime" metric has been removed because it is no longer supported.
◦ The metrics "75thPctlRequestTime", "95thPctlRequestTime", "99thPctlRequestTime" and
"999thPctlRequestTime" in Overseer Status API have been renamed to "75thPcRequestTime",
"95thPcRequestTime" and so on for consistency with stats output in other parts of Solr.
◦ The metrics "avgRequestsPerMinute", "5minRateRequestsPerMinute" and
"15minRateRequestsPerMinute" have been replaced by corresponding per-second rates viz.
"avgRequestsPerSecond", "5minRateRequestsPerSecond" and "15minRateRequestsPerSecond" for
consistency with stats output in other parts of Solr.
• A new highlighter named UnifiedHighlighter has been added. You are encouraged to try out the
UnifiedHighlighter by setting hl.method=unified and report feedback. It’s more efficient/faster than the
other highlighters, especially compared to the original Highlighter. See HighlightParams.java for a
listing of highlight parameters annotated with which highlighters use them.
hl.useFastVectorHighlighter is now considered deprecated in lieu of hl.method=fastVector.
• The maxWarmingSearchers parameter now defaults to 1, and more importantly commits will now block if
this limit is exceeded instead of throwing an exception (a good thing). Consequently there is no longer a
risk in overlapping commits. Nonetheless users should continue to avoid excessive committing. Users
are advised to remove any pre-existing maxWarmingSearchers entries from their solrconfig.xml files.
• The Complex Phrase query parser now supports leading wildcards. Beware of its possible heaviness,
users are encouraged to use ReversedWildcardFilter in index time analysis.
• The JMX metric "avgTimePerRequest" (and the corresponding metric in the metrics API for each handler)
used to be a simple non-decaying average based on total cumulative time and the number of requests.
The Codahale Metrics implementation applies exponential decay to this value, which heavily biases the
average towards the last 5 minutes.
• Parallel SQL now uses Apache Calcite as its SQL framework. As part of this change the default
aggregation mode has been changed to facet rather than map_reduce. There have also been changes to
the SQL aggregate response and some SQL syntax changes. Consult the Parallel SQL Interface
documentation for full details.
Major Changes from Solr 5 to Solr 6
There are some major changes in Solr 6 to consider before starting to migrate your configurations and
indexes.
There are many hundreds of changes, so a thorough review of the Solr Upgrade Notes section as well as the
CHANGES.txt file in your Solr instance will help you plan your migration to Solr 6. This section attempts to
highlight some of the major changes you should be aware of.
Highlights of New Features in Solr 6
Some of the major improvements in Solr 6 include:
Streaming Expressions
Introduced in Solr 5, Streaming Expressions allow querying Solr and getting results as a stream of data,
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 112 of 1426
Apache Solr Reference Guide 7.7
sorted and aggregated as requested.
Several new expression types have been added in Solr 6:
• Parallel expressions using a MapReduce-like shuffling for faster throughput of high-cardinality fields.
• Daemon expressions to support continuous push or pull streaming.
• Advanced parallel relational algebra like distributed joins, intersections, unions and complements.
• Publish/Subscribe messaging.
• JDBC connections to pull data from other systems and join with documents in the Solr index.
Parallel SQL Interface
Built on streaming expressions, new in Solr 6 is a Parallel SQL interface to be able to send SQL queries to
Solr. SQL statements are compiled to streaming expressions on the fly, providing the full range of
aggregations available to streaming expression requests. A JDBC driver is included, which allows using SQL
clients and database visualization tools to query your Solr index and import data to other systems.
Cross Data Center Replication
Replication across data centers is now possible with Cross Data Center Replication. Using an active-passive
model, a SolrCloud cluster can be replicated to another data center, and monitored with a new API.
Graph QueryParser
A new graph query parser makes it possible to to graph traversal queries of Directed (Cyclic) Graphs
modelled using Solr documents.
DocValues
Most non-text field types in the Solr sample configsets now default to using DocValues.
Java 8 Required
The minimum supported version of Java for Solr 6 (and the SolrJ client libraries) is now Java 8.
Index Format Changes
Solr 6 has no support for reading Lucene/Solr 4.x and earlier indexes. Be sure to run the Lucene
IndexUpgrader included with Solr 5.5 if you might still have old 4x formatted segments in your index.
Alternatively: fully optimize your index with Solr 5.5 to make sure it consists only of one up-to-date index
segment.
Managed Schema is now the Default
Solr’s default behavior when a solrconfig.xml does not explicitly define a is now
dependent on the luceneMatchVersion specified in that solrconfig.xml. When luceneMatchVersion <
6.0, ClassicIndexSchemaFactory will continue to be used for back compatibility, otherwise an instance of
ManagedIndexSchemaFactory will be used.
The most notable impacts of this change are:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 113 of 1426
• Existing solrconfig.xml files that are modified to use luceneMatchVersion >= 6.0, but do not have an
explicitly configured ClassicIndexSchemaFactory, will have their schema.xml file automatically
upgraded to a managed-schema file.
• Schema modifications via the Schema API will now be enabled by default.
Please review the Schema Factory Definition in SolrConfig section for more details.
Default Similarity Changes
Solr’s default behavior when a Schema does not explicitly define a global is now dependent
on the luceneMatchVersion specified in the solrconfig.xml. When luceneMatchVersion < 6.0, an
instance of ClassicSimilarityFactory will be used, otherwise an instance of SchemaSimilarityFactory
will be used. Most notably this change means that users can take advantage of per Field Type similarity
declarations, without needing to also explicitly declare a global usage of SchemaSimilarityFactory.
Regardless of whether it is explicitly declared, or used as an implicit global default,
SchemaSimilarityFactory 's implicit behavior when a Field Types do not declare an explicit
has also been changed to depend on the the luceneMatchVersion. When luceneMatchVersion < 6.0, an
instance of ClassicSimilarity will be used, otherwise an instance of BM25Similarity will be used. A
defaultSimFromFieldType init option may be specified on the SchemaSimilarityFactory declaration to
change this behavior. Please review the SchemaSimilarityFactory javadocs for more details.
Replica & Shard Delete Command Changes
DELETESHARD and DELETEREPLICA now default to deleting the instance directory, data directory, and index
directory for any replica they delete. Please review the Collection API documentation for details on new
request parameters to prevent this behavior if you wish to keep all data on disk when using these
commands.
facet.date.* Parameters Removed
The facet.date parameter (and associated facet.date.* parameters) that were deprecated in Solr 3.x have
been removed completely. If you have not yet switched to using the equivalent facet.range functionality
you must do so now before upgrading.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 114 of 1426
Apache Solr Reference Guide 7.7
Using the Solr Administration User Interface
This section discusses the Solr Administration User Interface ("Admin UI").
The Overview of the Solr Admin UI explains the basic features of the user interface, what’s on the initial
Admin UI page, and how to configure the interface. In addition, there are pages describing each screen of
the Admin UI:
• Logging shows recent messages logged by this Solr node and provides a way to change logging levels
for specific classes.
• Cloud Screens display information about nodes when running in SolrCloud mode.
• Collections / Core Admin explains how to get management information about each core.
• Java Properties shows the Java information about each core.
• Thread Dump lets you see detailed information about each thread, along with state information.
• Suggestions Screen displays the state of the system with regard to the autoscaling policies that are in
place.
• Collection-Specific Tools is a section explaining additional screens available for each collection.
◦ Analysis - lets you analyze the data found in specific fields.
◦ Dataimport - shows you information about the current status of the Data Import Handler.
◦ Documents - provides a simple form allowing you to execute various Solr indexing commands
directly from the browser.
◦ Files - shows the current core configuration files such as solrconfig.xml.
◦ Query - lets you submit a structured query about various elements of a core.
◦ Stream - allows you to submit streaming expressions and see results and parsing explanations.
◦ Schema Browser - displays schema data in a browser window.
• Core-Specific Tools is a section explaining additional screens available for each named core.
◦ Ping - lets you ping a named core and determine whether the core is active.
◦ Plugins/Stats - shows statistics for plugins and other installed components.
◦ Replication - shows you the current replication status for the core, and lets you enable/disable
replication.
◦ Segments Info - Provides a visualization of the underlying Lucene index segments.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 115 of 1426
Overview of the Solr Admin UI
Solr features a Web interface that makes it easy for Solr administrators and programmers to view Solr
configuration details, run queries and analyze document fields in order to fine-tune a Solr configuration and
access online documentation and other help.
Dashboard
Accessing the URL http://hostname:8983/solr/ will show the main dashboard, which is divided into two
parts.
Solr Dashboard
The left-side of the screen is a menu under the Solr logo that provides the navigation through the screens of
the UI.
The first set of links are for system-level information and configuration and provide access to Logging,
Collection/Core Administration, and Java Properties, among other things.
At the end of this information is at least one pulldown listing Solr cores configured for this instance. On
SolrCloud nodes, an additional pulldown list shows all collections in this cluster. Clicking on a collection or
core name shows secondary menus of information for the specified collection or core, such as a Schema
Browser, Config Files, Plugins & Statistics, and an ability to perform Queries on indexed data.
The center of the screen shows the detail of the option selected. This may include a sub-navigation for the
option or text or graphical representation of the requested data. See the sections in this guide for each
screen for more details.
Under the covers, the Solr Admin UI re-uses the same HTTP APIs available to all clients to access Solr-related
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 116 of 1426
Apache Solr Reference Guide 7.7
data to drive an external interface.
The path to the Solr Admin UI given above is http://hostname:port/solr, which redirects
to http://hostname:port/solr/#/ in the current version. A convenience redirect is also
supported, so simply accessing the Admin UI at http://hostname:port/ will also redirect
to http://hostname:port/solr/#/.
Login Screen
If authentication has been enabled, Solr will present a login screen to unauthenticated users before allowing
them further access to the Admin UI.
Login Screen
This login screen currently only works with Basic Authentication. See the section Basic Authentication Plugin
for details on how to configure Solr to use this method of authentication.
If Kerberos is enabled and the user has a valid ticket, the login screen will be skipped. However, if the user
does not have a valid ticket, they will see a message that they need to obtain a valid ticket before continuing.
Getting Assistance
At the bottom of each screen of the Admin UI is a set of links that can be used to get more assistance with
configuring and using Solr.
Assistance icons
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 117 of 1426
These icons include the following links.
Link
Description
Documentation
Navigates to the Apache Solr documentation hosted on
https://lucene.apache.org/solr/.
Issue Tracker
Navigates to the JIRA issue tracking server for the Apache Solr project. This
server resides at https://issues.apache.org/jira/browse/SOLR.
IRC Channel
Navigates to Solr’s IRC live-chat room: http://webchat.freenode.net/?
channels=#solr.
Community forum
Navigates to the Apache Wiki page which has further information about ways to
engage in the Solr User community mailing lists: https://wiki.apache.org/solr/
UsingMailingLists.
Solr Query Syntax
Navigates to the section Query Syntax and Parsing in this Reference Guide.
These links cannot be modified without editing the index.html in the server/solr/solr-webapp directory
that contains the Admin UI files.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 118 of 1426
Apache Solr Reference Guide 7.7
Logging
The Logging page shows recent messages logged by this Solr node.
When you click the link for "Logging", a page similar to the one below will be displayed:
The Main Logging Screen, including an example of an error due to a bad document sent by a client
While this example shows logged messages for only one core, if you have multiple cores in a single instance,
they will each be listed, with the level for each.
Selecting a Logging Level
When you select the Level link on the left, you see the hierarchy of classpaths and classnames for your
instance. A row highlighted in yellow indicates that the class has logging capabilities. Click on a highlighted
row, and a menu will appear to allow you to change the log level for that class. Characters in boldface
indicate that the class will not be affected by level changes to root.
Log level selection
For an explanation of the various logging levels, see Configuring Logging.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 119 of 1426
Cloud Screens
When running in SolrCloud mode, a "Cloud" option will appear in the Admin UI between Logging and
Collections.
This screen provides status information about each collection & node in your cluster, as well as access to the
low level data being stored in ZooKeeper.
Only Visible When using SolrCloud
The "Cloud" menu option is only available on Solr instances running in SolrCloud mode.
Single node or master/slave replication instances of Solr will not display this option.
Click on the "Cloud" option in the left-hand navigation, and a small sub-menu appears with options called
"Nodes", "Tree", "Graph" and "Graph (Radial)". The sub-view selected by default is "Graph".
Nodes View
The "Nodes" view shows a list of the hosts and nodes in the cluster along with key information for each:
"CPU", "Heap", "Disk usage", "Requests", "Collections" and "Replicas".
The example below shows the default "cloud" example with some documents added to the "gettingstarted"
collection. Details are expanded for node on port 7574, showing more metadata and more metrics details.
The screen provides links to navigate to nodes, collections and replicas. The table supports paging and
filtering on host/node names and collection names.
Tree View
The "Tree" view shows a directory structure of the data in ZooKeeper, including cluster wide information
regarding the live_nodes and overseer status, as well as collection specific information such as the
state.json, current shard leaders, and configuration files in use. In this example, we see part of the
state.json definition for the "tlog" collection:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 120 of 1426
Apache Solr Reference Guide 7.7
As an aid to debugging, the data shown in the "Tree" view can be exported locally using the following
command bin/solr zk ls -r /
ZK Status View
The "ZK Status" view gives an overview over the ZooKeeper servers or ensemble used by Solr. It lists
whether running in standalone or ensemble mode, shows how many zookeepers are configured, and then
displays a table listing detailed monitoring status for each of the zookeepers, including who is the leader,
configuration parameters and more.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 121 of 1426
Graph Views
The "Graph" view shows a graph of each collection, the shards that make up those collections, and the
addresses and type ("NRT", "TLOG" or "PULL") of each replica for each shard.
This example shows a simple cluster. In addition to the 2 shard, 2 replica "gettingstarted" collection, there is
an additional "tlog" collection consisting of mixed TLOG and PULL replica types.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 122 of 1426
Apache Solr Reference Guide 7.7
Tooltips appear when hovering over each replica giving additional information.
The "Graph (Radial)" option provides a different visual view of each node. Using the same example cluster,
the radial graph view looks like:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
© 2019, Apache Software Foundation
Page 123 of 1426
Guide Version 7.7 - Published: 2019-03-04
Page 124 of 1426
Apache Solr Reference Guide 7.7
Collections / Core Admin
The Collections screen provides some basic functionality for managing your Collections, powered by the
Collections API.
If you are running a single node Solr instance, you will not see a Collections option in the
left nav menu of the Admin UI.
You will instead see a "Core Admin" screen that supports some comparable Core level
information & manipulation via the CoreAdmin API instead.
The main display of this page provides a list of collections that exist in your cluster. Clicking on a collection
name provides some basic metadata about how the collection is defined, and its current shards & replicas,
with options for adding and deleting individual replicas.
The buttons at the top of the screen let you make various collection level changes to your cluster, from add
new collections or aliases to reloading or deleting a single collection.
Replicas can be deleted by clicking the red "X" next to the replica name.
If the shard is inactive, for example after a SPLITSHARD action, an option to delete the shard will appear as a
red "X" next to the shard name.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
© 2019, Apache Software Foundation
Page 125 of 1426
Guide Version 7.7 - Published: 2019-03-04
Page 126 of 1426
Apache Solr Reference Guide 7.7
Java Properties
The Java Properties screen provides easy access to one of the most essential components of a topperforming Solr systems. With the Java Properties screen, you can see all the properties of the JVM running
Solr, including the class paths, file encodings, JVM memory settings, operating system, and more.
Java Properties Screen
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 127 of 1426
Thread Dump
The Thread Dump screen lets you inspect the currently active threads on your server.
Each thread is listed and access to the stacktraces is available where applicable. Icons to the left indicate the
state of the thread: for example, threads with a green check-mark in a green circle are in a "RUNNABLE"
state. On the right of the thread name, a down-arrow means you can expand to see the stacktrace for that
thread.
List of Threads
When you move your cursor over a thread name, a box floats over the name with the state for that thread.
Thread states can be:
State
Meaning
NEW
A thread that has not yet started.
RUNNABLE
A thread executing in the Java virtual machine.
BLOCKED
A thread that is blocked waiting for a monitor lock.
WAITING
A thread that is waiting indefinitely for another thread to perform a particular
action.
TIMED_WAITING
A thread that is waiting for another thread to perform an action for up to a
specified waiting time.
TERMINATED
A thread that has exited.
When you click on one of the threads that can be expanded, you’ll see the stacktrace, as in the example
below:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 128 of 1426
Apache Solr Reference Guide 7.7
Inspecting a Thread
You can also check the Show all Stacktraces button to automatically enable expansion for all threads.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 129 of 1426
Suggestions Screen
The Suggestions screen shows violations to an autoscaling policy that exist in the
system, and allows you to take action to correct the violations.
This screen is a visual representation of the output of the Suggestions API.
When there are no violations or other suggestions, the screen will appear somewhat blank:
When the system is in violation of an aspect of a policy, each violation will be shown, as in this screenshot:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 130 of 1426
Apache Solr Reference Guide 7.7
A line is shown for each violation. In this case, we have defined a policy where no replica can exist on a node
that has less than 500Gb of available disk space. In this example, 4 replicas in our sample cluster violates this
rule.
In the "Action" column, the green button allows you to execute the recommended change to allow the
system to return to compliance with the policy. If you hover your mouse over this button, you will see the
recommended Collections API command:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 131 of 1426
In this case, the recommendation is to issue a MOVEREPLICA command to move this replica to a node with
more available disk space.
Since autoscaling features are only available in SolrCloud mode, this screen will only appear
when running Solr in SolrCloud mode.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 132 of 1426
Apache Solr Reference Guide 7.7
Collection-Specific Tools
In the left-hand navigation bar, you will see a pull-down menu titled "Collection Selector" that can be used to
access collection specific administration screens.
Only Visible When Using SolrCloud
The "Collection Selector" pull-down menu is only available on Solr instances running in
SolrCloud mode.
Single node or master/slave replication instances of Solr will not display this menu, instead
the Collection specific UI pages described in this section will be available in the Core
Selector pull-down menu.
Clicking on the Collection Selector pull-down menu will show a list of the collections in your Solr cluster, with
a search box that can be used to find a specific collection by name. When you select a collection from the
pull-down, the main display of the page will display some basic metadata about the collection, and a
secondary menu will appear in the left nav with links to additional collection specific administration screens.
The collection-specific UI screens are listed below, with a link to the section of this guide to find out more:
• Analysis - lets you analyze the data found in specific fields.
• Dataimport - shows you information about the current status of the Data Import Handler.
• Documents - provides a simple form allowing you to execute various Solr indexing commands directly
from the browser.
• Files - shows the current core configuration files such as solrconfig.xml.
• Query - lets you submit a structured query about various elements of a core.
• Stream - allows you to submit streaming expressions and see results and parsing explanations.
• Schema Browser - displays schema data in a browser window.
Analysis Screen
The Analysis screen lets you inspect how data will be handled according to the field, field type and dynamic
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 133 of 1426
field configurations found in your Schema. You can analyze how content would be handled during indexing
or during query processing and view the results separately or at the same time. Ideally, you would want
content to be handled consistently, and this screen allows you to validate the settings in the field type or
field analysis chains.
Enter content in one or both boxes at the top of the screen, and then choose the field or field type
definitions to use for analysis.
If you click the Verbose Output check box, you see more information, including more details on the
transformations to the input (such as, convert to lower case, strip extra characters, etc.) including the raw
bytes, type and detailed position information at each stage. The information displayed will vary depending
on the settings of the field or field type. Each step of the process is displayed in a separate section, with an
abbreviation for the tokenizer or filter that is applied in that step. Hover or click on the abbreviation, and
you’ll see the name and path of the tokenizer or filter.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 134 of 1426
Apache Solr Reference Guide 7.7
In the example screenshot above, several transformations are applied to the input "Running is a sport." The
words "is" and "a" have been removed and the word "running" has been changed to its basic form, "run".
This is because we are using the field type text_en in this scenario, which is configured to remove stop
words (small words that usually do not provide a great deal of context) and "stem" terms when possible to
find more possible matches (this is particularly helpful with plural forms of words). If you click the question
mark next to the Analyze Fieldname/Field Type pull-down menu, the Schema Browser window will open,
showing you the settings for the field specified.
The section Understanding Analyzers, Tokenizers, and Filters describes in detail what each option is and how
it may transform your data and the section Running Your Analyzer has specific examples for using the
Analysis screen.
Dataimport Screen
The Dataimport screen shows the configuration of the DataImportHandler (DIH) and allows you start, and
monitor the status of, import commands as defined by the options selected on the screen and defined in the
configuration file.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 135 of 1426
The Dataimport Screen
This screen also lets you adjust various options to control how the data is imported to Solr, and view the
data import configuration file that controls the import.
For more information about data importing with DIH, see the section on Uploading Structured Data Store
Data with the Data Import Handler.
Documents Screen
The Documents screen provides a simple form allowing you to execute various Solr indexing commands in a
variety of formats directly from the browser.
The Documents Screen
The screen allows you to:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 136 of 1426
Apache Solr Reference Guide 7.7
• Submit JSON, CSV or XML documents in Solr-specific format for indexing
• Upload documents (in JSON, CSV or XML) for indexing
• Construct documents by selecting fields and field values
There are other ways to load data, see also these sections:
• Uploading Data with Index Handlers
• Uploading Data with Solr Cell using Apache Tika
Common Fields
• Request-Handler: The first step is to define the RequestHandler. By default /update will be defined.
Change the request handler to /update/extract to use Solr Cell.
• Document Type: Select the Document Type to define the format of document to load. The remaining
parameters may change depending on the document type selected.
• Document(s): Enter a properly-formatted Solr document corresponding to the Document Type selected.
XML and JSON documents must be formatted in a Solr-specific format, a small illustrative document will
be shown. CSV files should have headers corresponding to fields defined in the schema. More details can
be found at: Uploading Data with Index Handlers.
• Commit Within: Specify the number of milliseconds between the time the document is submitted and
when it is available for searching.
• Overwrite: If true the new document will replace an existing document with the same value in the id
field. If false multiple documents with the same id can be added.
Setting Overwrite to false is very rare in production situations, the default is true.
CSV, JSON and XML Documents
When using these document types the functionality is similar to submitting documents via curl or similar.
The document structure must be in a Solr-specific format appropriate for the document type. Examples are
illustrated in the Document(s) text box when you select the various types.
These options will only add or overwrite documents; for other update tasks, see the Solr Command option.
Document Builder
The Document Builder provides a wizard-like interface to enter fields of a document.
File Upload
The File Upload option allows choosing a prepared file and uploading it. If using /update for the RequestHandler option, you will be limited to XML, CSV, and JSON.
Other document types (e.g Word, PDF, etc.) can be indexed using the ExtractingRequestHandler (aka, Solr
Cell). You must modify the RequestHandler to /update/extract, which must be defined in your
solrconfig.xml file with your desired defaults. You should also add &literal.id shown in the "Extracting
Request Handler Params" field so the file chosen is given a unique id. More information can be found at:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 137 of 1426
Uploading Data with Solr Cell using Apache Tika
Solr Command
The Solr Command option allows you use the /update request handler with XML or JSON formatted
commands to perform specific actions. A few examples are:
• Deleting documents
• Updating only certain fields of documents
• Issuing commit commands on the index
Files Screen
The Files screen lets you browse & view the various configuration files (such solrconfig.xml and the
schema file) for the collection you selected.
The Files Screen
If you are using SolrCloud, the files displayed are the configuration files for this collection stored in
ZooKeeper. In a standalone Solr installations, all files in the conf directory are displayed.
While solrconfig.xml defines the behavior of Solr as it indexes content and responds to queries, the
Schema allows you to define the types of data in your content (field types), the fields your documents will be
broken into, and any dynamic fields that should be generated based on patterns of field names in the
incoming documents. Any other configuration files are used depending on how they are referenced in either
solrconfig.xml or your schema.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 138 of 1426
Apache Solr Reference Guide 7.7
Configuration files cannot be edited with this screen, so a text editor of some kind must be used.
This screen is related to the Schema Browser Screen, in that they both can display information from the
schema, but the Schema Browser provides a way to drill into the analysis chain and displays linkages
between field types, fields, and dynamic field rules.
Many of the options defined in these configuration files are described throughout the rest of this Guide. In
particular, you will want to review these sections:
• Indexing and Basic Data Operations
• Searching
• The Well-Configured Solr Instance
• Documents, Fields, and Schema Design
Query Screen
You can use the Query screen to submit a search query to a Solr collection and analyze the results.
In the example in the screenshot, a query has been submitted, and the screen shows the query results sent
to the browser as JSON.
JSON Results of a Query
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 139 of 1426
In this example, a query for genre:Fantasy was sent to a "films" collection. Defaults were used for all other
options in the form, which are explained briefly in the table below, and covered in detail in later parts of this
Guide.
The response is shown to the right of the form. Requests to Solr are simply HTTP requests, and the query
submitted is shown in light type above the results; if you click on this it will open a new browser window with
just this request and response (without the rest of the Solr Admin UI). The rest of the response is shown in
JSON, which is the default output format.
The response has at least two sections, but may have several more depending on the options chosen. The
two sections it always has are the responseHeader and the response. The responseHeader includes the
status of the search (status), the processing time (QTime), and the parameters (params) that were used to
process the query.
The response includes the documents that matched the query, in doc sub-sections. The fields return depend
on the parameters of the query (and the defaults of the request handler used). The number of results is also
included in this section.
This screen allows you to experiment with different query options, and inspect how your documents were
indexed. The query parameters available on the form are some basic options that most users want to have
available, but there are dozens more available which could be simply added to the basic request by hand (if
opened in a browser). The following parameters are available:
Request-handler (qt)
Specifies the query handler for the request. If a query handler is not specified, Solr processes the
response with the standard query handler.
q
The query event. See Searching for an explanation of this parameter.
fq
The filter queries. See Common Query Parameters for more information on this parameter.
sort
Sorts the response to a query in either ascending or descending order based on the response’s score or
another specified characteristic.
start, rows
start is the offset into the query result starting at which documents should be returned. The default
value is 0, meaning that the query should return results starting with the first document that matches.
This field accepts the same syntax as the start query parameter, which is described in Searching. rows is
the number of rows to return.
fl
Defines the fields to return for each document. You can explicitly list the stored fields, functions, and doc
transformers you want to have returned by separating them with either a comma or a space.
wt
Specifies the Response Writer to be used to format the query response. Defaults to JSON if not specified.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 140 of 1426
Apache Solr Reference Guide 7.7
indent
Click this button to request that the Response Writer use indentation to make the responses more
readable.
debugQuery
Click this button to augment the query response with debugging information, including "explain info" for
each document returned. This debugging information is intended to be intelligible to the administrator or
programmer.
dismax
Click this button to enable the Dismax query parser. See The DisMax Query Parser for further
information.
edismax
Click this button to enable the Extended query parser. See The Extended DisMax Query Parser for further
information.
hl
Click this button to enable highlighting in the query response. See Highlighting for more information.
facet
Enables faceting, the arrangement of search results into categories based on indexed terms. See Faceting
for more information.
spatial
Click to enable using location data for use in spatial or geospatial searches. See Spatial Search for more
information.
spellcheck
Click this button to enable the Spellchecker, which provides inline query suggestions based on other,
similar, terms. See Spell Checking for more information.
Stream Screen
The Stream screen allows you to enter a streaming expression and see the results. It is very similar to the
Query Screen, except the input box is at the top and all options must be declared in the expression.
The screen will insert everything up to the streaming expression itself, so you do not need to enter the full
URI with the hostname, port, collection, etc. Simply input the expression after the expr= part, and the URL
will be constructed dynamically as appropriate.
Under the input box, the Execute button will run the expression. An option "with explanation" will show the
parts of the streaming expression that were executed. Under this, the streamed results are shown. A URL to
be able to view the output in a browser is also available.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 141 of 1426
Stream Screen with query and results
Schema Browser Screen
The Schema Browser screen lets you review schema data in a browser window.
If you have accessed this window from the Analysis screen, it will be opened to a specific field, dynamic field
rule or field type. If there is nothing chosen, use the pull-down menu to choose the field or field type.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 142 of 1426
Apache Solr Reference Guide 7.7
Schema Browser Screen
The screen provides a great deal of useful information about each particular field and fieldtype in the
Schema, and provides a quick UI for adding fields or fieldtypes using the Schema API (if enabled). In the
example above, we have chosen the cat field. On the left side of the main view window, we see the field
name, that it is copied to the _text_ (because of a copyField rule) and that it use the strings fieldtype. Click
on one of those field or fieldtype names, and you can see the corresponding definitions.
In the right part of the main view, we see the specific properties of how the cat field is defined – either
explicitly or implicitly via its fieldtype, as well as how many documents have populated this field. Then we see
the analyzer used for indexing and query processing. Click the icon to the left of either of those, and you’ll
see the definitions for the tokenizers and/or filters that are used. The output of these processes is the
information you see when testing how content is handled for a particular field with the Analysis Screen.
Under the analyzer information is a button to Load Term Info. Clicking that button will show the top N
terms that are in a sample shard for that field, as well as a histogram showing the number of terms with
various frequencies. Click on a term, and you will be taken to the Query Screen to see the results of a query
of that term in that field. If you want to always see the term information for a field, choose Autoload and it
will always appear when there are terms for a field. A histogram shows the number of terms with a given
frequency in the field.
Term Information is loaded from single arbitrarily selected core from the collection, to
provide a representative sample for the collection. Full Field Facet query results are needed
to see precise term counts across the entire collection.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 143 of 1426
Core-Specific Tools
The Core-Specific tools are a group of UI screens that allow you to see core-level information.
In the left-hand navigation bar, you will see a pull-down menu titled "Core Selector". Clicking on the menu
will show a list of Solr cores hosted on this Solr node, with a search box that can be used to find a specific
core by name.
When you select a core from the pull-down, the main display of the page will show some basic metadata
about the core, and a secondary menu will appear in the left nav with links to additional core specific
administration screens.
Core overview screen
The core-specific UI screens are listed below, with a link to the section of this guide to find out more:
• Ping - lets you ping a named core and determine whether the core is active.
• Plugins/Stats - shows statistics for plugins and other installed components.
• Replication - shows you the current replication status for the core, and lets you enable/disable
replication.
• Segments Info - Provides a visualization of the underlying Lucene index segments.
If you are running a single node instance of Solr, additional UI screens normally displayed on a percollection bases will also be listed:
• Analysis - lets you analyze the data found in specific fields.
• Dataimport - shows you information about the current status of the Data Import Handler.
• Documents - provides a simple form allowing you to execute various Solr indexing commands directly
from the browser.
• Files - shows the current core configuration files such as solrconfig.xml.
• Query - lets you submit a structured query about various elements of a core.
• Stream - allows you to submit streaming expressions and see results and parsing explanations.
• Schema Browser - displays schema data in a browser window.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 144 of 1426
Apache Solr Reference Guide 7.7
Ping
Choosing Ping under a core name issues a ping request to check whether the core is up and responding to
requests.
Ping Option in Core Dropdown
The search executed by a Ping is configured with the Request Parameters API. See Implicit RequestHandlers
for the paramset to use for the /admin/ping endpoint.
The Ping option doesn’t open a page, but the status of the request can be seen on the core overview page
shown when clicking on a collection name. The length of time the request has taken is displayed next to the
Ping option, in milliseconds.
Ping API Examples
While the UI screen makes it easy to see the ping response time, the underlying ping command can be more
useful when executed by remote monitoring tools:
Input
http://localhost:8983/solr//admin/ping
This command will ping the core name for a response.
Input
http://localhost:8983/solr//admin/ping?distrib=true&wt=xml
This command will ping all replicas of the given collection name for a response:
Sample Output
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 145 of 1426
0
13
{!lucene}*:*
false
_text_
10
all
OK
Both API calls have the same output. A status=OK indicates that the nodes are responding.
SolrJ Example
SolrPing ping = new SolrPing();
ping.getParams().add("distrib", "true"); //To make it a distributed request against a collection
rsp = ping.process(solrClient, collectionName);
int status = rsp.getStatus();
Plugins & Stats Screen
The Plugins screen shows information and statistics about the status and performance of various plugins
running in each Solr core. You can find information about the performance of the Solr caches, the state of
Solr’s searchers, and the configuration of Request Handlers and Search Components.
Choose an area of interest on the right, and then drill down into more specifics by clicking on one of the
names that appear in the central part of the window. In this example, we’ve chosen to look at the Searcher
stats, from the Core area:
Searcher Statistics
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 146 of 1426
Apache Solr Reference Guide 7.7
The display is a snapshot taken when the page is loaded. You can get updated status by choosing to either
Watch Changes or Refresh Values. Watching the changes will highlight those areas that have changed,
while refreshing the values will reload the page with updated information.
Replication Screen
The Replication screen shows you the current replication state for the core you have specified. SolrCloud has
supplanted much of this functionality, but if you are still using Master-Slave index replication, you can use
this screen to:
1. View the replicatable index state. (on a master node)
2. View the current replication status (on a slave node)
3. Disable replication. (on a master node)
Caution When Using SolrCloud
When using SolrCloud, do not attempt to disable replication via this screen.
More details on how to configure replication is available in the section called Index Replication.
Segments Info
The Segments Info screen lets you see a visualization of the various segments in the underlying Lucene
index for this core, with information about the size of each segment – both bytes and in number of
documents – as well as other basic metadata about those segments. Most visible is the the number of
deleted documents, but you can hover your mouse over the segments to see additional numeric details.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 147 of 1426
This information may be useful for people to help make decisions about the optimal merge settings for their
data.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 148 of 1426
Apache Solr Reference Guide 7.7
Documents, Fields, and Schema Design
This section discusses how Solr organizes its data into documents and fields, as well as how to work with a
schema in Solr.
This section includes the following topics:
Overview of Documents, Fields, and Schema Design: An introduction to the concepts covered in this section.
Solr Field Types: Detailed information about field types in Solr, including the field types in the default Solr
schema.
Defining Fields: Describes how to define fields in Solr.
Copying Fields: Describes how to populate fields with data copied from another field.
Dynamic Fields: Information about using dynamic fields in order to catch and index fields that do not exactly
conform to other field definitions in your schema.
Schema API: Use curl commands to read various parts of a schema or create new fields and copyField rules.
Other Schema Elements: Describes other important elements in the Solr schema.
Putting the Pieces Together: A higher-level view of the Solr schema and how its elements work together.
DocValues: Describes how to create a docValues index for faster lookups.
Schemaless Mode: Automatically add previously unknown schema fields using value-based field type
guessing.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 149 of 1426
Overview of Documents, Fields, and Schema Design
The fundamental premise of Solr is simple. You give it a lot of information, then later you can ask it
questions and find the piece of information you want. The part where you feed in all the information is
called indexing or updating. When you ask a question, it’s called a query.
One way to understand how Solr works is to think of a loose-leaf book of recipes. Every time you add a
recipe to the book, you update the index at the back. You list each ingredient and the page number of the
recipe you just added. Suppose you add one hundred recipes. Using the index, you can very quickly find all
the recipes that use garbanzo beans, or artichokes, or coffee, as an ingredient. Using the index is much
faster than looking through each recipe one by one. Imagine a book of one thousand recipes, or one million.
Solr allows you to build an index with many different fields, or types of entries. The example above shows
how to build an index with just one field, ingredients. You could have other fields in the index for the
recipe’s cooking style, like Asian, Cajun, or vegan, and you could have an index field for preparation times.
Solr can answer questions like "What Cajun-style recipes that have blood oranges as an ingredient can be
prepared in fewer than 30 minutes?"
The schema is the place where you tell Solr how it should build indexes from input documents.
How Solr Sees the World
Solr’s basic unit of information is a document, which is a set of data that describes something. A recipe
document would contain the ingredients, the instructions, the preparation time, the cooking time, the tools
needed, and so on. A document about a person, for example, might contain the person’s name, biography,
favorite color, and shoe size. A document about a book could contain the title, author, year of publication,
number of pages, and so on.
In the Solr universe, documents are composed of fields, which are more specific pieces of information. Shoe
size could be a field. First name and last name could be fields.
Fields can contain different kinds of data. A name field, for example, is text (character data). A shoe size field
might be a floating point number so that it could contain values like 6 and 9.5. Obviously, the definition of
fields is flexible (you could define a shoe size field as a text field rather than a floating point number, for
example), but if you define your fields correctly, Solr will be able to interpret them correctly and your users
will get better results when they perform a query.
You can tell Solr about the kind of data a field contains by specifying its field type. The field type tells Solr how
to interpret the field and how it can be queried.
When you add a document, Solr takes the information in the document’s fields and adds that information to
an index. When you perform a query, Solr can quickly consult the index and return the matching documents.
Field Analysis
Field analysis tells Solr what to do with incoming data when building an index. A more accurate name for this
process would be processing or even digestion, but the official name is analysis.
Consider, for example, a biography field in a person document. Every word of the biography must be
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 150 of 1426
Apache Solr Reference Guide 7.7
indexed so that you can quickly find people whose lives have had anything to do with ketchup, or
dragonflies, or cryptography.
However, a biography will likely contains lots of words you don’t care about and don’t want clogging up
your index—words like "the", "a", "to", and so forth. Furthermore, suppose the biography contains the word
"Ketchup", capitalized at the beginning of a sentence. If a user makes a query for "ketchup", you want Solr
to tell you about the person even though the biography contains the capitalized word.
The solution to both these problems is field analysis. For the biography field, you can tell Solr how to break
apart the biography into words. You can tell Solr that you want to make all the words lower case, and you
can tell Solr to remove accents marks.
Field analysis is an important part of a field type. Understanding Analyzers, Tokenizers, and Filters is a
detailed description of field analysis.
Solr’s Schema File
Solr stores details about the field types and fields it is expected to understand in a schema file. The name
and location of this file may vary depending on how you initially configured Solr or if you modified it later.
• managed-schema is the name for the schema file Solr uses by default to support making Schema changes
at runtime via the Schema API, or Schemaless Mode features. You may explicitly configure the managed
schema features to use an alternative filename if you choose, but the contents of the files are still
updated automatically by Solr.
• schema.xml is the traditional name for a schema file which can be edited manually by users who use the
ClassicIndexSchemaFactory.
• If you are using SolrCloud you may not be able to find any file by these names on the local filesystem.
You will only be able to see the schema through the Schema API (if enabled) or through the Solr Admin
UI’s Cloud Screens.
Whichever name of the file in use in your installation, the structure of the file is not changed. However, the
way you interact with the file will change. If you are using the managed schema, it is expected that you only
interact with the file with the Schema API, and never make manual edits. If you do not use the managed
schema, you will only be able to make manual edits to the file, the Schema API will not support any
modifications.
Note that if you are not using the Schema API yet you do use SolrCloud, you will need to interact with
schema.xml through ZooKeeper using upconfig and downconfig commands to make a local copy and upload
your changes. The options for doing this are described in Solr Control Script Reference and Using ZooKeeper
to Manage Configuration Files.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 151 of 1426
Solr Field Types
The field type defines how Solr should interpret data in a field and how the field can be queried. There are
many field types included with Solr by default, and they can also be defined locally.
Topics covered in this section:
• Field Type Definitions and Properties
• Field Types Included with Solr
• Working with Currencies and Exchange Rates
• Working with Dates
• Working with Enum Fields
• Working with External Files and Processes
• Field Properties by Use Case
See also the FieldType Javadoc.
Field Type Definitions and Properties
A field type defines the analysis that will occur on a field when documents are indexed or queries are sent to
the index.
A field type definition can include four types of information:
• The name of the field type (mandatory).
• An implementation class name (mandatory).
• If the field type is TextField, a description of the field analysis for the field type.
• Field type properties - depending on the implementation class, some properties may be mandatory.
Field Type Definitions in schema.xml
Field types are defined in schema.xml. Each field type is defined between fieldType elements. They can
optionally be grouped within a types element. Here is an example of a field type definition for a type called
text_general:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 152 of 1426
Apache Solr Reference Guide 7.7
①
②
① The first line in the example above contains the field type name, text_general, and the name of the
implementing class, solr.TextField.
② The rest of the definition is about field analysis, described in Understanding Analyzers, Tokenizers, and
Filters.
The implementing class is responsible for making sure the field is handled correctly. In the class names in
schema.xml, the string solr is shorthand for org.apache.solr.schema or org.apache.solr.analysis.
Therefore, solr.TextField is really org.apache.solr.schema.TextField.
Field Type Properties
The field type class determines most of the behavior of a field type, but optional properties can also be
defined. For example, the following definition of a date field type defines two properties, sortMissingLast
and omitNorms.
The properties that can be specified for a given field type fall into three major categories:
• Properties specific to the field type’s class.
• General Properties Solr supports for any field type.
• Field Default Properties that can be specified on the field type that will be inherited by fields that use this
type instead of the default behavior.
General Properties
These are the general properties for fields:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 153 of 1426
name
The name of the fieldType. This value gets used in field definitions, in the "type" attribute. It is strongly
recommended that names consist of alphanumeric or underscore characters only and not start with a
digit. This is not currently strictly enforced.
class
The class name that gets used to store and index the data for this type. Note that you may prefix included
class names with "solr." and Solr will automatically figure out which packages to search for the class - so
solr.TextField will work.
If you are using a third-party class, you will probably need to have a fully qualified class name. The fully
qualified equivalent for solr.TextField is org.apache.solr.schema.TextField.
positionIncrementGap
For multivalued fields, specifies a distance between multiple values, which prevents spurious phrase
matches.
autoGeneratePhraseQueries
For text fields. If true, Solr automatically generates phrase queries for adjacent terms. If false, terms
must be enclosed in double-quotes to be treated as phrases.
synonymQueryStyle
Query used to combine scores of overlapping query terms (i.e., synonyms). Consider a search for "blue
tee" with query-time synonyms tshirt,tee.
Use as_same_term (default) to blend terms, i.e., SynonymQuery(tshirt,tee) where each term will be
treated as equally important. Use pick_best to select the most significant synonym when scoring
Dismax(tee,tshirt). Use as_distinct_terms to bias scoring towards the most significant synonym
(pants OR slacks).
as_same_term is appropriate when terms are true synonyms (television, tv). Use pick_best or
as_distinct_terms when synonyms are expanding to hyponyms (q=jeans w/ jeans=>jeans,pants)
and you want exact to come before parent and sibling concepts. See this blog article.
enableGraphQueries
For text fields, applicable when querying with sow=false (which is the default for the sow parameter). Use
true, the default, for field types with query analyzers including graph-aware filters, e.g., Synonym Graph
Filter and Word Delimiter Graph Filter.
Use false for field types with query analyzers including filters that can match docs when some tokens are
missing, e.g., Shingle Filter.
docValuesFormat
Defines a custom DocValuesFormat to use for fields of this type. This requires that a schema-aware codec,
such as the SchemaCodecFactory has been configured in solrconfig.xml.
postingsFormat
Defines a custom PostingsFormat to use for fields of this type. This requires that a schema-aware codec,
such as the SchemaCodecFactory has been configured in solrconfig.xml.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 154 of 1426
Apache Solr Reference Guide 7.7
Lucene index back-compatibility is only supported for the default codec. If you choose to
customize the postingsFormat or docValuesFormat in your schema.xml, upgrading to a
future version of Solr may require you to either switch back to the default codec and
optimize your index to rewrite it into the default codec before upgrading, or re-build your
entire index from scratch after upgrading.
Field Default Properties
These are properties that can be specified either on the field types, or on individual fields to override the
values provided by the field types.
The default values for each property depend on the underlying FieldType class, which in turn may depend
on the version attribute of the . The table below includes the default value for most FieldType
implementations provided by Solr, assuming a schema.xml that declares version="1.6".
Property
Description
Values
Implicit Default
indexed
If true, the value of the field can be used
in queries to retrieve matching
documents.
true or false
true
stored
If true, the actual value of the field can be true or false
retrieved by queries.
true
docValues
If true, the value of the field will be put in
a column-oriented DocValues structure.
true or false
false
sortMissingFirst
sortMissingLast
Control the placement of documents
when a sort field is not present.
true or false
false
multiValued
If true, indicates that a single document
true or false
might contain multiple values for this field
type.
false
uninvertible
If true, indicates that an indexed="true" true or false
docValues="false" field can be "uninverted" at query time to build up large
in memory data structure to serve in place
of DocValues. Defaults to true for
historical reasons, but users are
strongly encouraged to set this to
false for stability and use
docValues="true" as needed.
true
omitNorms
If true, omits the norms associated with
true or false
this field (this disables length
normalization for the field, and saves
some memory). Defaults to true for all
primitive (non-analyzed) field types,
such as int, float, data, bool, and string.
Only full-text fields or fields need norms.
*
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Property
Description
Page 155 of 1426
Values
Implicit Default
omitTermFreqAndP If true, omits term frequency, positions,
true or false
ositions
and payloads from postings for this field.
This can be a performance boost for fields
that don’t require that information. It also
reduces the storage space required for
the index. Queries that rely on position
that are issued on a field with this option
will silently fail to find documents. This
property defaults to true for all field
types that are not text fields.
*
omitPositions
Similar to omitTermFreqAndPositions but true or false
preserves term frequency information.
*
termVectors
termPositions
termOffsets
termPayloads
These options instruct Solr to maintain full true or false
term vectors for each document,
optionally including position, offset and
payload information for each term
occurrence in those vectors. These can be
used to accelerate highlighting and other
ancillary functionality, but impose a
substantial cost in terms of index size.
They are not necessary for typical uses of
Solr.
false
required
Instructs Solr to reject any attempts to
add a document which does not have a
value for this field. This property defaults
to false.
true or false
false
useDocValuesAsStor If the field has docValues enabled, setting true or false
ed
this to true would allow the field to be
returned as if it were a stored field (even
if it has stored=false) when matching
“*” in an fl parameter.
true
large
false
Large fields are always lazy loaded and
true or false
will only take up space in the document
cache if the actual value is < 512KB. This
option requires stored="true" and
multiValued="false". It’s intended for
fields that might have very large values so
that they don’t get cached in memory.
Field Type Similarity
A field type may optionally specify a that will be used when scoring documents that refer to
fields with this type, as long as the "global" similarity for the collection allows it.
By default, any field type which does not define a similarity, uses BM25Similarity. For more details, and
examples of configuring both global & per-type Similarities, please see Other Schema Elements.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 156 of 1426
Apache Solr Reference Guide 7.7
Field Types Included with Solr
The following table lists the field types that are available in Solr. The org.apache.solr.schema package
includes all the classes listed in this table.
Class
Description
BinaryField
Binary data.
BoolField
Contains either true or false. Values of 1, t, or T in the first character are
interpreted as true. Any other values in the first character are interpreted as
false.
CollationField
Supports Unicode collation for sorting and range queries. The ICUCollationField
is a better choice if you can use ICU4J. See the section Unicode Collation for
more information.
CurrencyField
Deprecated. Use CurrencyFieldType instead.
CurrencyFieldType
Supports currencies and exchange rates. See the section Working with
Currencies and Exchange Rates for more information.
DateRangeField
Supports indexing date ranges, to include point in time date instances as well
(single-millisecond durations). See the section Working with Dates for more
detail on using this field type. Consider using this field type even if it’s just for
date instances, particularly when the queries typically fall on UTC
year/month/day/hour, etc., boundaries.
DatePointField
Date field. Represents a point in time with millisecond precision, encoded using
a "Dimensional Points" based data structure that allows for very efficient
searches for specific values, or ranges of values. See the section Working with
Dates for more details on the supported syntax. For single valued fields,
docValues="true" must be used to enable sorting.
DoublePointField
Double field (64-bit IEEE floating point). This class encodes double values using
a "Dimensional Points" based data structure that allows for very efficient
searches for specific values, or ranges of values. For single valued fields,
docValues="true" must be used to enable sorting.
ExternalFileField
Pulls values from a file on disk. See the section Working with External Files and
Processes for more information.
EnumField
Deprecated. Use EnumFieldType instead.
EnumFieldType
Allows defining an enumerated set of values which may not be easily sorted by
either alphabetic or numeric order (such as a list of severities, for example). This
field type takes a configuration file, which lists the proper order of the field
values. See the section Working with Enum Fields for more information.
FloatPointField
Floating point field (32-bit IEEE floating point). This class encodes float values
using a "Dimensional Points" based data structure that allows for very efficient
searches for specific values, or ranges of values. For single valued fields,
docValues="true" must be used to enable sorting.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 157 of 1426
Class
Description
ICUCollationField
Supports Unicode collation for sorting and range queries. See the section
Unicode Collation for more information.
IntPointField
Integer field (32-bit signed integer). This class encodes int values using a
"Dimensional Points" based data structure that allows for very efficient
searches for specific values, or ranges of values. For single valued fields,
docValues="true" must be used to enable sorting.
LatLonPointSpatialField
A latitude/longitude coordinate pair; possibly multi-valued for multiple points.
Usually it’s specified as "lat,lon" order with a comma. See the section Spatial
Search for more information.
LatLonType
Deprecated. Consider using the LatLonPointSpatialField instead. A singlevalued latitude/longitude coordinate pair. Usually it’s specified as "lat,lon" order
with a comma. See the section Spatial Search for more information.
LongPointField
Long field (64-bit signed integer). This class encodes foo values using a
"Dimensional Points" based data structure that allows for very efficient
searches for specific values, or ranges of values. For single valued fields,
docValues="true" must be used to enable sorting.
PointType
A single-valued n-dimensional point. It’s both for sorting spatial data that is not
lat-lon, and for some more rare use-cases. (NOTE: this is not related to the
"Point" based numeric fields). See Spatial Search for more information.
PreAnalyzedField
Provides a way to send to Solr serialized token streams, optionally with
independent stored values of a field, and have this information stored and
indexed without any additional text processing.
Configuration and usage of PreAnalyzedField is documented in the section
Working with External Files and Processes.
RandomSortField
Does not contain a value. Queries that sort on this field type will return results in
random order. Use a dynamic field to use this feature.
SpatialRecursivePrefixTre (RPT for short) Accepts latitude comma longitude strings or other shapes in
eFieldType
WKT format. See Spatial Search for more information.
StrField
String (UTF-8 encoded string or Unicode). Strings are intended for small fields
and are not tokenized or analyzed in any way. They have a hard limit of slightly
less than 32K.
SortableTextField
A specialized version of TextField that allows (and defaults to)
docValues="true" for sorting on the first 1024 characters of the original string
prior to analysis. The number of characters used for sorting can be overridden
with the maxCharsForDocValues attribute.
TextField
Text, usually multiple words or tokens.
TrieDateField
Deprecated. Use DatePointField instead.
TrieDoubleField
Deprecated. Use DoublePointField instead.
TrieFloatField
Deprecated. Use FloatPointField instead.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 158 of 1426
Apache Solr Reference Guide 7.7
Class
Description
TrieIntField
Deprecated. Use IntPointField instead.
TrieLongField
Deprecated. Use LongPointField instead.
TrieField
Deprecated. This field takes a type parameter to define the specific class of
Trie* field to use; Use an appropriate Point Field type instead.
UUIDField
Universally Unique Identifier (UUID). Pass in a value of NEW and Solr will create a
new UUID.
Note: configuring a UUIDField instance with a default value of NEW is not
advisable for most users when using SolrCloud (and not possible if the UUID
value is configured as the unique key field) since the result will be that each
replica of each document will get a unique UUID value. Using
UUIDUpdateProcessorFactory to generate UUID values when documents are
added is recommended instead.
All Trie* numeric and date field types have been deprecated in favor of *Point field types.
Point field types are better at range queries (speed, memory, disk), however simple
field:value queries underperform relative to Trie. Either accept this, or continue to use Trie
fields. This shortcoming may be addressed in a future release.
Working with Currencies and Exchange Rates
The currency FieldType provides support for monetary values to Solr/Lucene with query-time currency
conversion and exchange rates. The following features are supported:
• Point queries
• Range queries
• Function range queries
• Sorting
• Currency parsing by either currency code or symbol
• Symmetric & asymmetric exchange rates (asymmetric exchange rates are useful if there are fees
associated with exchanging the currency)
• Range faceting (using either facet.range or type:range in json.facet) as long as the start and end
values are specified in the same Currency.
Configuring Currencies
CurrencyField has been Deprecated
CurrencyField has been deprecated in favor of CurrencyFieldType; all configuration
examples below use CurrencyFieldType.
The currency field type is defined in schema.xml. This is the default configuration of this type.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 159 of 1426
In this example, we have defined the name and class of the field type, and defined the defaultCurrency as
"USD", for U.S. Dollars. We have also defined a currencyConfig to use a file called "currency.xml". This is a
file of exchange rates between our default currency to other currencies. There is an alternate
implementation that would allow regular downloading of currency data. See Exchange Rates below for
more.
Many of the example schemas that ship with Solr include a dynamic field that uses this type, such as this
example:
This dynamic field would match any field that ends in _c and make it a currency typed field.
At indexing time, money fields can be indexed in a native currency. For example, if a product on an ecommerce site is listed in Euros, indexing the price field as "1000,EUR" will index it appropriately. The price
should be separated from the currency by a comma, and the price must be encoded with a floating point
value (a decimal point).
During query processing, range and point queries are both supported.
Sub-field Suffixes
You must specify parameters amountLongSuffix and codeStrSuffix, corresponding to dynamic fields to be
used for the raw amount and the currency dynamic sub-fields, for example:
In the above example, the raw amount field will use the "*_l_ns" dynamic field, which must exist in the
schema and use a long field type, i.e., one that extends LongValueFieldType. The currency code field will
use the "*_s_ns" dynamic field, which must exist in the schema and use a string field type, i.e., one that is or
extends StrField.
Atomic Updates won’t work if dynamic sub-fields are stored
As noted on Updating Parts of Documents, stored dynamic sub-fields will cause indexing to
fail when you use Atomic Updates. To avoid this problem, specify stored="false" on those
dynamic fields.
Exchange Rates
You configure exchange rates by specifying a provider. Natively, two provider types are supported:
FileExchangeRateProvider or OpenExchangeRatesOrgProvider.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 160 of 1426
Apache Solr Reference Guide 7.7
FileExchangeRateProvider
This provider requires you to provide a file of exchange rates. It is the default, meaning that to use this
provider you only need to specify the file path and name as a value for currencyConfig in the definition for
this type.
There is a sample currency.xml file included with Solr, found in the same directory as the schema.xml file.
Here is a small snippet from this file:
rate="0.869914" />
rate="7.800095" />
rate="8.966508" />
OpenExchangeRatesOrgProvider
You can configure Solr to download exchange rates from OpenExchangeRates.Org, with updates rates
between USD and 170 currencies hourly. These rates are symmetrical only.
In this case, you need to specify the providerClass in the definitions for the field type and sign up for an API
key. Here is an example:
The refreshInterval is minutes, so the above example will download the newest rates every 60 minutes.
The refresh interval may be increased, but not decreased.
Working with Dates
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 161 of 1426
Date Formatting
Solr’s date fields (DatePointField, DateRangeField and the deprecated TrieDateField) represent "dates"
as a point in time with millisecond precision. The format used is a restricted form of the canonical
representation of dateTime in the XML Schema specification – a restricted subset of ISO-8601. For those
familiar with Java date handling, Solr uses DateTimeFormatter.ISO_INSTANT for formatting, and parsing too
with "leniency".
YYYY-MM-DDThh:mm:ssZ
• YYYY is the year.
• MM is the month.
• DD is the day of the month.
• hh is the hour of the day as on a 24-hour clock.
• mm is minutes.
• ss is seconds.
• Z is a literal 'Z' character indicating that this string representation of the date is in UTC
Note that no time zone can be specified; the String representations of dates is always expressed in
Coordinated Universal Time (UTC). Here is an example value:
1972-05-20T17:33:18Z
You can optionally include fractional seconds if you wish, although any precision beyond milliseconds will be
ignored. Here are example values with sub-seconds:
• 1972-05-20T17:33:18.772Z
• 1972-05-20T17:33:18.77Z
• 1972-05-20T17:33:18.7Z
There must be a leading '-' for dates prior to year 0000, and Solr will format dates with a leading '+' for
years after 9999. Year 0000 is considered year 1 BC; there is no such thing as year 0 AD or BC.
Query escaping may be required
As you can see, the date format includes colon characters separating the hours, minutes,
and seconds. Because the colon is a special character to Solr’s most common query
parsers, escaping is sometimes required, depending on exactly what you are trying to do.
This is normally an invalid query: datefield:1972-05-20T17:33:18.772Z
These are valid queries:
datefield:1972-05-20T17\:33\:18.772Z
datefield:"1972-05-20T17:33:18.772Z"
datefield:[1972-05-20T17:33:18.772Z TO *]
Date Range Formatting
Solr’s DateRangeField supports the same point in time date syntax described above (with date math
described below) and more to express date ranges. One class of examples is truncated dates, which
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 162 of 1426
Apache Solr Reference Guide 7.7
represent the entire date span to the precision indicated. The other class uses the range syntax ([ TO ]).
Here are some examples:
• 2000-11 – The entire month of November, 2000.
• 1605-11-05 – The Fifth of November.
• 2000-11-05T13 – Likewise but for an hour of the day (1300 to before 1400, i.e., 1pm to 2pm).
• -0009 – The year 10 BC. A 0 in the year position is 0 AD, and is also considered 1 BC.
• [2000-11-01 TO 2014-12-01] – The specified date range at a day resolution.
• [2014 TO 2014-12-01] – From the start of 2014 till the end of the first day of December.
• [* TO 2014-12-01] – From the earliest representable time thru till the end of the day on 2014-12-01.
Limitations: The range syntax doesn’t support embedded date math. If you specify a date instance
supported by DatePointField with date math truncating it, like NOW/DAY, you still get the first millisecond of
that day, not the entire day’s range. Exclusive ranges (using { & }) work in queries but not for indexing
ranges.
Date Math
Solr’s date field types also supports date math expressions, which makes it easy to create times relative to
fixed moments in time, include the current time which can be represented using the special value of “NOW”.
Date Math Syntax
Date math expressions consist either adding some quantity of time in a specified unit, or rounding the
current time by a specified unit. expressions can be chained and are evaluated left to right.
For example: this represents a point in time two months from now:
NOW+2MONTHS
This is one day ago:
NOW-1DAY
A slash is used to indicate rounding. This represents the beginning of the current hour:
NOW/HOUR
The following example computes (with millisecond precision) the point in time six months and three days
into the future and then rounds that time to the beginning of that day:
NOW+6MONTHS+3DAYS/DAY
Note that while date math is most commonly used relative to NOW it can be applied to any fixed moment in
time as well:
1972-05-20T17:33:18.772Z+6MONTHS+3DAYS/DAY
Request Parameters That Affect Date Math
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 163 of 1426
NOW
The NOW parameter is used internally by Solr to ensure consistent date math expression parsing across
multiple nodes in a distributed request. But it can be specified to instruct Solr to use an arbitrary moment in
time (past or future) to override for all situations where the the special value of “NOW” would impact date
math expressions.
It must be specified as a (long valued) milliseconds since epoch.
Example:
q=solr&fq=start_date:[* TO NOW]&NOW=1384387200000
TZ
By default, all date math expressions are evaluated relative to the UTC TimeZone, but the TZ parameter can
be specified to override this behaviour, by forcing all date based addition and rounding to be relative to the
specified time zone.
For example, the following request will use range faceting to facet over the current month, "per day"
relative UTC:
http://localhost:8983/solr/my_collection/select?q=*:*&facet.range=my_date_field&facet=true&facet.
range.start=NOW/MONTH&facet.range.end=NOW/MONTH%2B1MONTH&facet.range.gap=%2B1DAY&wt=xml
0
name="2013-11-02T00:00:00Z">0
name="2013-11-03T00:00:00Z">0
name="2013-11-04T00:00:00Z">0
name="2013-11-05T00:00:00Z">0
name="2013-11-06T00:00:00Z">0
name="2013-11-07T00:00:00Z">0
While in this example, the "days" will be computed relative to the specified time zone - including any
applicable Daylight Savings Time adjustments:
http://localhost:8983/solr/my_collection/select?q=*:*&facet.range=my_date_field&facet=true&facet.
range.start=NOW/MONTH&facet.range.end=NOW/MONTH%2B1MONTH&facet.range.gap=%2B1DAY&TZ=America/Los_A
ngeles&wt=xml
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 164 of 1426
0
name="2013-11-02T07:00:00Z">0
name="2013-11-03T07:00:00Z">0
name="2013-11-04T08:00:00Z">0
name="2013-11-05T08:00:00Z">0
name="2013-11-06T08:00:00Z">0
name="2013-11-07T08:00:00Z">0
More DateRangeField Details
DateRangeField is almost a drop-in replacement for places where DatePointField is used. The only
difference is that Solr’s XML or SolrJ response formats will expose the stored data as a String instead of a
Date. The underlying index data for this field will be a bit larger. Queries that align to units of time a second
on up should be faster than TrieDateField, especially if it’s in UTC.
The main point of DateRangeField, as its name suggests, is to allow indexing date ranges. To do that, simply
supply strings in the format shown above. It also supports specifying 3 different relational predicates
between the indexed data, and the query range:
• Intersects (default)
• Contains
• Within
You can specify the predicate by querying using the op local-params parameter like so:
fq={!field f=dateRange op=Contains}[2013 TO 2018]
Unlike most local parameters, op is actually not defined by any query parser (field), it is defined by the field
type, in this case DateRangeField. In the above example, it would find documents with indexed ranges that
contain (or equals) the range 2013 thru 2018. Multi-valued overlapping indexed ranges in a document are
effectively coalesced.
For a DateRangeField example use-case, see see Solr’s community wiki.
Working with Enum Fields
EnumFieldType allows defining a field whose values are a closed set, and the sort order is pre-determined
but is not alphabetic nor numeric. Examples of this are severity lists, or risk definitions.
EnumField has been Deprecated
EnumField has been deprecated in favor of EnumFieldType; all configuration examples
below use EnumFieldType.
Defining an EnumFieldType in schema.xml
The EnumFieldType type definition is quite simple, as in this example defining field types for "priorityLevel"
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 165 of 1426
and "riskLevel" enumerations:
Besides the name and the class, which are common to all field types, this type also takes two additional
parameters:
enumsConfig
the name of a configuration file that contains the list of field values and their order that you wish
to use with this field type. If a path to the file is not defined specified, the file should be in the conf
directory for the collection.
enumName
the name of the specific enumeration in the enumsConfig file to use for this type.
Note that docValues="true" must be specified either in the EnumFieldType fieldType or field specification.
Defining the EnumFieldType Configuration File
The file named with the enumsConfig parameter can contain multiple enumeration value lists with different
names if there are multiple uses for enumerations in your Solr schema.
In this example, there are two value lists defined. Each list is between enum opening and closing tags:
Not Available
Low
Medium
High
Urgent
Unknown
Very Low
Low
Medium
High
Critical
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 166 of 1426
Apache Solr Reference Guide 7.7
Changing Values
You cannot change the order, or remove, existing values in an without reindexing.
You can however add new values to the end.
Working with External Files and Processes
The ExternalFileField Type
The ExternalFileField type makes it possible to specify the values for a field in a file outside the Solr index.
For such a field, the file contains mappings from a key field to the field value. Another way to think of this is
that, instead of specifying the field in documents as they are indexed, Solr finds values for this field in the
external file.
External fields are not searchable. They can be used only for function queries or display.
For more information on function queries, see the section on Function Queries.
The ExternalFileField type is handy for cases where you want to update a particular field in many
documents more often than you want to update the rest of the documents. For example, suppose you have
implemented a document rank based on the number of views. You might want to update the rank of all the
documents daily or hourly, while the rest of the contents of the documents might be updated much less
frequently. Without ExternalFileField, you would need to update each document just to change the rank.
Using ExternalFileField is much more efficient because all document values for a particular field are
stored in an external file that can be updated as frequently as you wish.
In schema.xml, the definition of this field type might look like this:
The keyField attribute defines the key that will be defined in the external file. It is usually the unique key for
the index, but it doesn’t need to be as long as the keyField can be used to identify documents in the index.
A defVal defines a default value that will be used if there is no entry in the external file for a particular
document.
Format of the External File
The file itself is located in Solr’s index directory, which by default is $SOLR_HOME/data. The name of the file
should be external_fieldname_ or external_fieldname_.*. For the example above, then, the file could be
named external_entryRankFile or external_entryRankFile.txt.
If any files using the name pattern .* (such as .txt) appear, the last (after being sorted by
name) will be used and previous versions will be deleted. This behavior supports
implementations on systems where one may not be able to overwrite a file (for example,
on Windows, if the file is in use).
The file contains entries that map a key field, on the left of the equals sign, to a value, on the right. Here are
a few example entries:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 167 of 1426
doc33=1.414
doc34=3.14159
doc40=42
The keys listed in this file do not need to be unique. The file does not need to be sorted, but Solr will be able
to perform the lookup faster if it is.
Reloading an External File
It’s possible to define an event listener to reload an external file when either a searcher is reloaded or when
a new searcher is started. See the section Query-Related Listeners for more information, but a sample
definition in solrconfig.xml might look like this:
The PreAnalyzedField Type
The PreAnalyzedField type provides a way to send to Solr serialized token streams, optionally with
independent stored values of a field, and have this information stored and indexed without any additional
text processing applied in Solr. This is useful if user wants to submit field content that was already processed
by some existing external text processing pipeline (e.g., it has been tokenized, annotated, stemmed,
synonyms inserted, etc.), while using all the rich attributes that Lucene’s TokenStream provides (per-token
attributes).
The serialization format is pluggable using implementations of PreAnalyzedParser interface. There are two
out-of-the-box implementations:
• JsonPreAnalyzedParser: as the name suggests, it parses content that uses JSON to represent field’s
content. This is the default parser to use if the field type is not configured otherwise.
• SimplePreAnalyzedParser: uses a simple strict plain text format, which in some situations may be easier
to create than JSON.
There is only one configuration parameter, parserImpl. The value of this parameter should be a fully
qualified class name of a class that implements PreAnalyzedParser interface. The default value of this
parameter is org.apache.solr.schema.JsonPreAnalyzedParser.
By default, the query-time analyzer for fields of this type will be the same as the index-time analyzer, which
expects serialized pre-analyzed text. You must add a query type analyzer to your fieldType in order to
perform analysis on non-pre-analyzed queries. In the example below, the index-time analyzer expects the
default JSON serialization format, and the query-time analyzer will employ
StandardTokenizer/LowerCaseFilter:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 168 of 1426
Apache Solr Reference Guide 7.7
JsonPreAnalyzedParser
This is the default serialization format used by PreAnalyzedField type. It uses a top-level JSON map with the
following keys:
Key
Description
Required
v
Version key. Currently the supported version is 1.
required
str
Stored string value of a field. You can use at most one of str or optional
bin.
bin
Stored binary value of a field. The binary value has to be Base64 optional
encoded.
tokens
serialized token stream. This is a JSON list.
optional
Any other top-level key is silently ignored.
Token Stream Serialization
The token stream is expressed as a JSON list of JSON maps. The map for each token consists of the following
keys and values:
Key
Description
Lucene Attribute
Value
Required?
t
token
CharTermAttribute
UTF-8 string representing the
current token
required
s
start offset
OffsetAttribute
Non-negative integer
optional
e
end offset
OffsetAttribute
Non-negative integer
optional
i
position increment
PositionIncrementAt Non-negative integer - default
tribute
is 1
optional
p
payload
PayloadAttribute
Base64 encoded payload
optional
y
lexical type
TypeAttribute
UTF-8 string
optional
f
flags
FlagsAttribute
String representing an integer
value in hexadecimal format
optional
Any other key is silently ignored.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 169 of 1426
JsonPreAnalyzedParser Example
{
"v":"1",
"str":"test ąćęłńóśźż",
"tokens": [
{"t":"one","s":123,"e":128,"i":22,"p":"DQ4KDQsODg8=","y":"word"},
{"t":"two","s":5,"e":8,"i":1,"y":"word"},
{"t":"three","s":20,"e":22,"i":1,"y":"foobar"}
]
}
SimplePreAnalyzedParser
The fully qualified class name to use when specifying this format via the parserImpl configuration
parameter is org.apache.solr.schema.SimplePreAnalyzedParser.
SimplePreAnalyzedParser Syntax
The serialization format supported by this parser is as follows:
Serialization format
content ::= version (stored)? tokens
version ::= digit+ " "
; stored field value - any "=" inside must be escaped!
stored ::= "=" text "="
tokens ::= (token ((" ") + token)*)*
token ::= text ("," attrib)*
attrib ::= name '=' value
name ::= text
value ::= text
Special characters in "text" values can be escaped using the escape character \. The following escape
sequences are recognized:
EscapeSequence
Description
\
literal space character
\,
literal , character
\=
literal = character
\\
literal \ character
\n
newline
\r
carriage return
\t
horizontal tab
Please note that Unicode sequences (e.g., \u0001) are not supported.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 170 of 1426
Apache Solr Reference Guide 7.7
Supported Attributes
The following token attributes are supported, and identified with short symbolic names:
Name
Description
Lucene attribute
Value format
i
position increment
PositionIncrementAttribute
integer
s
start offset
OffsetAttribute
integer
e
end offset
OffsetAttribute
integer
y
lexical type
TypeAttribute
string
f
flags
FlagsAttribute
hexadecimal integer
p
payload
PayloadAttribute
bytes in hexadecimal format;
whitespace is ignored
Token positions are tracked and implicitly added to the token stream - the start and end offsets consider
only the term text and whitespace, and exclude the space taken by token attributes.
Example Token Streams
1 one two three
• version: 1
• stored: null
• token: (term=one,startOffset=0,endOffset=3)
• token: (term=two,startOffset=4,endOffset=7)
• token: (term=three,startOffset=8,endOffset=13)
1 one
two
three
• version: 1
• stored: null
• token: (term=one,startOffset=0,endOffset=3)
• token: (term=two,startOffset=5,endOffset=8)
• token: (term=three,startOffset=11,endOffset=16)
1 one,s=123,e=128,i=22 two three,s=20,e=22
• version: 1
• stored: null
• token: (term=one,positionIncrement=22,startOffset=123,endOffset=128)
• token: (term=two,positionIncrement=1,startOffset=5,endOffset=8)
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 171 of 1426
• token: (term=three,positionIncrement=1,startOffset=20,endOffset=22)
1 \ one\ \,,i=22,a=\, two\=
\n,\ =\ \
• version: 1
• stored: null
• token: (term=one ,,positionIncrement=22,startOffset=0,endOffset=6)
• token: (term=two= ,positionIncrement=1,startOffset=7,endOffset=15)
• token: (term=\,positionIncrement=1,startOffset=17,endOffset=18)
Note that unknown attributes and their values are ignored, so in this example, the “a” attribute on the first
token and the " " (escaped space) attribute on the second token are ignored, along with their values,
because they are not among the supported attribute names.
1 ,i=22 ,i=33,s=2,e=20 ,
• version: 1
• stored: null
• token: (term=,positionIncrement=22,startOffset=0,endOffset=0)
• token: (term=,positionIncrement=33,startOffset=2,endOffset=20)
• token: (term=,positionIncrement=1,startOffset=2,endOffset=2)
1 =This is the stored part with \=
\n \t escapes.=one two three
• version: 1
• stored: This is the stored part with = \t escapes.
• token: (term=one,startOffset=0,endOffset=3)
• token: (term=two,startOffset=4,endOffset=7)
• token: (term=three,startOffset=8,endOffset=13)
Note that the \t in the above stored value is not literal; it’s shown that way to visually indicate the actual tab
char that is in the stored value.
1 ==
• version: 1
• stored: ""
• (no tokens)
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 172 of 1426
Apache Solr Reference Guide 7.7
1 =this is a test.=
• version: 1
• stored: this is a test.
• (no tokens)
Field Properties by Use Case
Here is a summary of common use cases, and the attributes the fields or field types should have to support
the case. An entry of true or false in the table indicates that the option must be set to the given value for the
use case to function correctly. If no entry is provided, the setting of that attribute has no impact on the case.
Use Case
indexed
search within
field
true
multiValue omitNorm termVecto termPositi docValues
d
s
rs
ons
8
retrieve
contents
8
true
use as unique
key
true
sort on field
true
highlighting
true
faceting
stored
false
7
4
5
true
9
false
1
7
true
2
true
true
true
7
3
7
true
add multiple
values,
maintaining
order
true
true
field length
affects doc
score
MoreLikeThis
true
false
5
true
6
Notes:
1. Recommended but not necessary.
2. Will be used if present, but not necessary.
3. (if termVectors=true)
4. A tokenizer must be defined for the field, but it doesn’t need to be indexed.
5. Described in Understanding Analyzers, Tokenizers, and Filters.
6. Term vectors are not mandatory here. If not true, then a stored field is analyzed. So term vectors are
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 173 of 1426
recommended, but only required if stored=false.
7. For most field types, either indexed or docValues must be true, but both are not required. DocValues can
be more efficient in many cases. For [Int/Long/Float/Double/Date]PointFields, docValues=true is
required.
8. Stored content will be used by default, but docValues can alternatively be used. See DocValues.
9. Multi-valued sorting may be performed on docValues-enabled fields using the two-argument field()
function, e.g., field(myfield,min); see the field() function in Function Queries.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 174 of 1426
Apache Solr Reference Guide 7.7
Defining Fields
Fields are defined in the fields element of schema.xml. Once you have the field types set up, defining the
fields themselves is simple.
Example Field Definition
The following example defines a field named price with a type named float and a default value of 0.0; the
indexed and stored properties are explicitly set to true, while any other properties specified on the float
field type are inherited.
Field Properties
Field definitions can have the following properties:
name
The name of the field. Field names should consist of alphanumeric or underscore characters only and not
start with a digit. This is not currently strictly enforced, but other field names will not have first class
support from all components and back compatibility is not guaranteed. Names with both leading and
trailing underscores (e.g., _version_) are reserved. Every field must have a name.
type
The name of the fieldType for this field. This will be found in the name attribute on the fieldType
definition. Every field must have a type.
default
A default value that will be added automatically to any document that does not have a value in this field
when it is indexed. If this property is not specified, there is no default.
Optional Field Type Override Properties
Fields can have many of the same properties as field types. Properties from the table below which are
specified on an individual field will override any explicit value for that property specified on the the
fieldType of the field, or any implicit default property value provided by the underlying fieldType
implementation. The table below is reproduced from Field Type Definitions and Properties, which has more
details:
Property
Description
Values
Implicit Default
indexed
If true, the value of the field can be used
in queries to retrieve matching
documents.
true or false
true
stored
If true, the actual value of the field can be true or false
retrieved by queries.
true
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 175 of 1426
Property
Description
Values
Implicit Default
docValues
If true, the value of the field will be put in
a column-oriented DocValues structure.
true or false
false
sortMissingFirst
sortMissingLast
Control the placement of documents
when a sort field is not present.
true or false
false
multiValued
If true, indicates that a single document
true or false
might contain multiple values for this field
type.
false
uninvertible
If true, indicates that an indexed="true" true or false
docValues="false" field can be "uninverted" at query time to build up large
in memory data structure to serve in place
of DocValues. Defaults to true for
historical reasons, but users are
strongly encouraged to set this to
false for stability and use
docValues="true" as needed.
true
omitNorms
If true, omits the norms associated with
true or false
this field (this disables length
normalization for the field, and saves
some memory). Defaults to true for all
primitive (non-analyzed) field types,
such as int, float, data, bool, and string.
Only full-text fields or fields need norms.
*
omitTermFreqAndP If true, omits term frequency, positions,
true or false
ositions
and payloads from postings for this field.
This can be a performance boost for fields
that don’t require that information. It also
reduces the storage space required for
the index. Queries that rely on position
that are issued on a field with this option
will silently fail to find documents. This
property defaults to true for all field
types that are not text fields.
*
omitPositions
*
Similar to omitTermFreqAndPositions but true or false
preserves term frequency information.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 176 of 1426
Apache Solr Reference Guide 7.7
Property
Description
Values
Implicit Default
termVectors
termPositions
termOffsets
termPayloads
These options instruct Solr to maintain full true or false
term vectors for each document,
optionally including position, offset and
payload information for each term
occurrence in those vectors. These can be
used to accelerate highlighting and other
ancillary functionality, but impose a
substantial cost in terms of index size.
They are not necessary for typical uses of
Solr.
false
required
Instructs Solr to reject any attempts to
add a document which does not have a
value for this field. This property defaults
to false.
true or false
false
useDocValuesAsStor If the field has docValues enabled, setting true or false
ed
this to true would allow the field to be
returned as if it were a stored field (even
if it has stored=false) when matching
“*” in an fl parameter.
true
large
false
Large fields are always lazy loaded and
true or false
will only take up space in the document
cache if the actual value is < 512KB. This
option requires stored="true" and
multiValued="false". It’s intended for
fields that might have very large values so
that they don’t get cached in memory.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 177 of 1426
Copying Fields
You might want to interpret some document fields in more than one way. Solr has a mechanism for making
copies of fields so that you can apply several distinct field types to a single piece of incoming information.
The name of the field you want to copy is the source, and the name of the copy is the destination. In
schema.xml, it’s very simple to make copies of fields:
In this example, we want Solr to copy the cat field to a field named text. Fields are copied before analysis is
done, meaning you can have two fields with identical original content, but which use different analysis
chains and are stored in the index differently.
In the example above, if the text destination field has data of its own in the input documents, the contents
of the cat field will be added as additional values – just as if all of the values had originally been specified by
the client. Remember to configure your fields as multivalued="true" if they will ultimately get multiple
values (either from a multivalued source or from multiple copyField directives).
A common usage for this functionality is to create a single "search" field that will serve as the default query
field when users or clients do not specify a field to query. For example, title, author, keywords, and body
may all be fields that should be searched by default, with copy field rules for each field to copy to a catchall
field (for example, it could be named anything). Later you can set a rule in solrconfig.xml to search the
catchall field by default. One caveat to this is your index will grow when using copy fields. However,
whether this becomes problematic for you and the final size will depend on the number of fields being
copied, the number of destination fields being copied to, the analysis in use, and the available disk space.
The maxChars parameter, an int parameter, establishes an upper limit for the number of characters to be
copied from the source value when constructing the value added to the destination field. This limit is useful
for situations in which you want to copy some data from the source field, but also control the size of index
files.
Both the source and the destination of copyField can contain either leading or trailing asterisks, which will
match anything. For example, the following line will copy the contents of all incoming fields that match the
wildcard pattern *_t to the text field.:
The copyField command can use a wildcard (*) character in the dest parameter only if the
source parameter contains one as well. copyField uses the matching glob from the source
field for the dest field name into which the source content is copied.
Copying is done at the stream source level and no copy feeds into another copy. This means that copy fields
cannot be chained i.e., you cannot copy from here to there and then from there to elsewhere. However, the
same source field can be copied to multiple destination fields:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 178 of 1426
Apache Solr Reference Guide 7.7
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 179 of 1426
Dynamic Fields
Dynamic fields allow Solr to index fields that you did not explicitly define in your schema.
This is useful if you discover you have forgotten to define one or more fields. Dynamic fields can make your
application less brittle by providing some flexibility in the documents you can add to Solr.
A dynamic field is just like a regular field except it has a name with a wildcard in it. When you are indexing
documents, a field that does not match any explicitly defined fields can be matched with a dynamic field.
For example, suppose your schema includes a dynamic field with a name of *_i. If you attempt to index a
document with a cost_i field, but no explicit cost_i field is defined in the schema, then the cost_i field will
have the field type and analysis defined for *_i.
Like regular fields, dynamic fields have a name, a field type, and options.
It is recommended that you include basic dynamic field mappings (like that shown above) in your
schema.xml. The mappings can be very useful.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 180 of 1426
Apache Solr Reference Guide 7.7
Other Schema Elements
This section describes several other important elements of schema.xml not covered in earlier sections.
Unique Key
The uniqueKey element specifies which field is a unique identifier for documents. Although uniqueKey is not
required, it is nearly always warranted by your application design. For example, uniqueKey should be used if
you will ever update a document in the index.
You can define the unique key field by naming it:
id
Schema defaults and copyFields cannot be used to populate the uniqueKey field. The fieldType of
uniqueKey must not be analyzed and must not be any of the *PointField types. You can use
UUIDUpdateProcessorFactory to have uniqueKey values generated automatically.
Further, the operation will fail if the uniqueKey field is used, but is multivalued (or inherits the multivalueness from the fieldtype). However, uniqueKey will continue to work, as long as the field is properly used.
Similarity
Similarity is a Lucene class used to score a document in searching.
Each collection has one "global" Similarity, and by default Solr uses an implicit SchemaSimilarityFactory
which allows individual field types to be configured with a "per-type" specific Similarity and implicitly uses
BM25Similarity for any field type which does not have an explicit Similarity.
This default behavior can be overridden by declaring a top level element in your schema.xml,
outside of any single field type. This similarity declaration can either refer directly to the name of a class with
a no-argument constructor, such as in this example showing BM25Similarity:
or by referencing a SimilarityFactory implementation, which may take optional initialization parameters:
P
L
H2
7
In most cases, specifying global level similarity like this will cause an error if your schema.xml also includes
field type specific declarations. One key exception to this is that you may explicitly declare a
SchemaSimilarityFactory and specify what that default behavior will be for all field types that do not
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 181 of 1426
declare an explicit Similarity using the name of field type (specified by defaultSimFromFieldType) that is
configured with a specific similarity:
text_dfr
I(F)
B
H3
900
SPL
DF
H2
In the example above IBSimilarityFactory (using the Information-Based model) will be used for any fields
of type text_ib, while DFRSimilarityFactory (divergence from random) will be used for any fields of type
text_dfr, as well as any fields using a type that does not explicitly specify a .
If SchemaSimilarityFactory is explicitly declared without configuring a defaultSimFromFieldType, then
BM25Similarity is implicitly used as the default.
In addition to the various factories mentioned on this page, there are several other similarity
implementations that can be used such as the SweetSpotSimilarityFactory, ClassicSimilarityFactory,
etc. For details, see the Solr Javadocs for the similarity factories.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 182 of 1426
Apache Solr Reference Guide 7.7
Schema API
The Schema API allows you to use an HTTP API to manage many of the elements of your schema.
The Schema API utilizes the ManagedIndexSchemaFactory class, which is the default schema factory in
modern Solr versions. See the section Schema Factory Definition in SolrConfig for more information about
choosing a schema factory for your index.
This API provides read and write access to the Solr schema for each collection (or core, when using
standalone Solr). Read access to all schema elements is supported. Fields, dynamic fields, field types and
copyField rules may be added, removed or replaced. Future Solr releases will extend write access to allow
more schema elements to be modified.
Why is hand editing of the managed schema discouraged?
The file named "managed-schema" in the example configurations may include a note that
recommends never hand-editing the file. Before the Schema API existed, such edits were
the only way to make changes to the schema, and users may have a strong desire to
continue making changes this way.
The reason that this is discouraged is because hand-edits of the schema may be lost if the
Schema API described here is later used to make a change, unless the core or collection is
reloaded or Solr is restarted before using the Schema API. If care is taken to always reload
or restart after a manual edit, then there is no problem at all with doing those edits.
The API allows two output modes for all calls: JSON or XML. When requesting the complete schema, there is
another output mode which is XML modeled after the managed-schema file itself, which is in XML format.
When modifying the schema with the API, a core reload will automatically occur in order for the changes to
be available immediately for documents indexed thereafter. Previously indexed documents will not be
automatically updated - they must be re-indexed if existing index data uses schema elements that you
changed.
Re-index after schema modifications!
If you modify your schema, you will likely need to re-index all documents. If you do not, you
may lose access to documents, or not be able to interpret them properly, e.g., after
replacing a field type.
Modifying your schema will never modify any documents that are already indexed. You
must re-index documents in order to apply schema changes to them. Queries and updates
made after the change may encounter errors that were not present before the change.
Completely deleting the index and rebuilding it is usually the only option to fix such errors.
Modify the Schema
To add, remove or replace fields, dynamic field rules, copy field rules, or new field types, you can send a
POST request to the /collection/schema/ endpoint with a sequence of commands in JSON format to
perform the requested actions. The following commands are supported:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 183 of 1426
• add-field: add a new field with parameters you provide.
• delete-field: delete a field.
• replace-field: replace an existing field with one that is differently configured.
• add-dynamic-field: add a new dynamic field rule with parameters you provide.
• delete-dynamic-field: delete a dynamic field rule.
• replace-dynamic-field: replace an existing dynamic field rule with one that is differently configured.
• add-field-type: add a new field type with parameters you provide.
• delete-field-type: delete a field type.
• replace-field-type: replace an existing field type with one that is differently configured.
• add-copy-field: add a new copy field rule.
• delete-copy-field: delete a copy field rule.
These commands can be issued in separate POST requests or in the same POST request. Commands are
executed in the order in which they are specified.
In each case, the response will include the status and the time to process the request, but will not include
the entire schema.
When modifying the schema with the API, a core reload will automatically occur in order for the changes to
be available immediately for documents indexed thereafter. Previously indexed documents will not be
automatically handled - they must be re-indexed if they used schema elements that you changed.
Add a New Field
The add-field command adds a new field definition to your schema. If a field with the same name exists an
error is thrown.
All of the properties available when defining a field with manual schema.xml edits can be passed via the API.
These request attributes are described in detail in the section Defining Fields.
For example, to define a new stored field named "sell_by", of type "pdate", you would POST the following
request:
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
"name":"sell_by",
"type":"pdate",
"stored":true }
}' http://localhost:8983/solr/gettingstarted/schema
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 184 of 1426
Apache Solr Reference Guide 7.7
V2 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
"name":"sell_by",
"type":"pdate",
"stored":true }
}' http://localhost:8983/api/cores/gettingstarted/schema
Delete a Field
The delete-field command removes a field definition from your schema. If the field does not exist in the
schema, or if the field is the source or destination of a copy field rule, an error is thrown.
For example, to delete a field named "sell_by", you would POST the following request:
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"delete-field" : { "name":"sell_by" }
}' http://localhost:8983/solr/gettingstarted/schema
V2 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"delete-field" : { "name":"sell_by" }
}' http://localhost:8983/api/cores/gettingstarted/schema
Replace a Field
The replace-field command replaces a field’s definition. Note that you must supply the full definition for a
field - this command will not partially modify a field’s definition. If the field does not exist in the schema an
error is thrown.
All of the properties available when defining a field with manual schema.xml edits can be passed via the API.
These request attributes are described in detail in the section Defining Fields.
For example, to replace the definition of an existing field "sell_by", to make it be of type "date" and to not be
stored, you would POST the following request:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 185 of 1426
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"replace-field":{
"name":"sell_by",
"type":"date",
"stored":false }
}' http://localhost:8983/solr/gettingstarted/schema
V2 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"replace-field":{
"name":"sell_by",
"type":"date",
"stored":false }
}' http://localhost:8983/api/cores/gettingstarted/schema
Add a Dynamic Field Rule
The add-dynamic-field command adds a new dynamic field rule to your schema.
All of the properties available when editing schema.xml can be passed with the POST request. The section
Dynamic Fields has details on all of the attributes that can be defined for a dynamic field rule.
For example, to create a new dynamic field rule where all incoming fields ending with "_s" would be stored
and have field type "string", you can POST a request like this:
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-dynamic-field":{
"name":"*_s",
"type":"string",
"stored":true }
}' http://localhost:8983/solr/gettingstarted/schema
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 186 of 1426
Apache Solr Reference Guide 7.7
V2 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-dynamic-field":{
"name":"*_s",
"type":"string",
"stored":true }
}' http://localhost:8983/api/cores/gettingstarted/schema
Delete a Dynamic Field Rule
The delete-dynamic-field command deletes a dynamic field rule from your schema. If the dynamic field
rule does not exist in the schema, or if the schema contains a copy field rule with a target or destination that
matches only this dynamic field rule, an error is thrown.
For example, to delete a dynamic field rule matching "*_s", you can POST a request like this:
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"delete-dynamic-field":{ "name":"*_s" }
}' http://localhost:8983/solr/gettingstarted/schema
V2 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"delete-dynamic-field":{ "name":"*_s" }
}' http://localhost:8983/api/cores/gettingstarted/schema
Replace a Dynamic Field Rule
The replace-dynamic-field command replaces a dynamic field rule in your schema. Note that you must
supply the full definition for a dynamic field rule - this command will not partially modify a dynamic field
rule’s definition. If the dynamic field rule does not exist in the schema an error is thrown.
All of the properties available when editing schema.xml can be passed with the POST request. The section
Dynamic Fields has details on all of the attributes that can be defined for a dynamic field rule.
For example, to replace the definition of the "*_s" dynamic field rule with one where the field type is
"text_general" and it’s not stored, you can POST a request like this:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 187 of 1426
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"replace-dynamic-field":{
"name":"*_s",
"type":"text_general",
"stored":false }
}' http://localhost:8983/solr/gettingstarted/schema
V2 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"replace-dynamic-field":{
"name":"*_s",
"type":"text_general",
"stored":false }
}' http://localhost:8983/solr/gettingstarted/schema
Add a New Field Type
The add-field-type command adds a new field type to your schema.
All of the field type properties available when editing schema.xml by hand are available for use in a POST
request. The structure of the command is a JSON mapping of the standard field type definition, including the
name, class, index and query analyzer definitions, etc. Details of all of the available options are described in
the section Solr Field Types.
For example, to create a new field type named "myNewTxtField", you can POST a request as follows:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 188 of 1426
Apache Solr Reference Guide 7.7
V1 API with Single Analysis
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field-type" : {
"name":"myNewTxtField",
"class":"solr.TextField",
"positionIncrementGap":"100",
"analyzer" : {
"charFilters":[{
"class":"solr.PatternReplaceCharFilterFactory",
"replacement":"$1$1",
"pattern":"([a-zA-Z])\\\\1+" }],
"tokenizer":{
"class":"solr.WhitespaceTokenizerFactory" },
"filters":[{
"class":"solr.WordDelimiterFilterFactory",
"preserveOriginal":"0" }]}}
}' http://localhost:8983/solr/gettingstarted/schema
Note in this example that we have only defined a single analyzer section that will apply to index analysis
and query analysis.
V1 API with Two Analyzers
If we wanted to define separate analysis, we would replace the analyzer section in the above example
with separate sections for indexAnalyzer and queryAnalyzer. As in this example:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field-type":{
"name":"myNewTextField",
"class":"solr.TextField",
"indexAnalyzer":{
"tokenizer":{
"class":"solr.PathHierarchyTokenizerFactory",
"delimiter":"/" }},
"queryAnalyzer":{
"tokenizer":{
"class":"solr.KeywordTokenizerFactory" }}}
}' http://localhost:8983/solr/gettingstarted/schema
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 189 of 1426
V2 API with Two Analyzers
To define two analyzers with the V2 API, we just use a different endpoint:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field-type":{
"name":"myNewTextField",
"class":"solr.TextField",
"indexAnalyzer":{
"tokenizer":{
"class":"solr.PathHierarchyTokenizerFactory",
"delimiter":"/" }},
"queryAnalyzer":{
"tokenizer":{
"class":"solr.KeywordTokenizerFactory" }}}
}' http://localhost:8983/api/cores/gettingstarted/schema
Delete a Field Type
The delete-field-type command removes a field type from your schema. If the field type does not exist in
the schema, or if any field or dynamic field rule in the schema uses the field type, an error is thrown.
For example, to delete the field type named "myNewTxtField", you can make a POST request as follows:
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"delete-field-type":{ "name":"myNewTxtField" }
}' http://localhost:8983/solr/gettingstarted/schema
V2 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"delete-field-type":{ "name":"myNewTxtField" }
}' http://localhost:8983/api/cores/gettingstarted/schema
Replace a Field Type
The replace-field-type command replaces a field type in your schema. Note that you must supply the full
definition for a field type - this command will not partially modify a field type’s definition. If the field type
does not exist in the schema an error is thrown.
All of the field type properties available when editing schema.xml by hand are available for use in a POST
request. The structure of the command is a JSON mapping of the standard field type definition, including the
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 190 of 1426
Apache Solr Reference Guide 7.7
name, class, index and query analyzer definitions, etc. Details of all of the available options are described in
the section Solr Field Types.
For example, to replace the definition of a field type named "myNewTxtField", you can make a POST request
as follows:
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"replace-field-type":{
"name":"myNewTxtField",
"class":"solr.TextField",
"positionIncrementGap":"100",
"analyzer":{
"tokenizer":{
"class":"solr.StandardTokenizerFactory" }}}
}' http://localhost:8983/solr/gettingstarted/schema
V2 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"replace-field-type":{
"name":"myNewTxtField",
"class":"solr.TextField",
"positionIncrementGap":"100",
"analyzer":{
"tokenizer":{
"class":"solr.StandardTokenizerFactory" }}}
}' http://localhost:8983/api/cores/gettingstarted/schema
Add a New Copy Field Rule
The add-copy-field command adds a new copy field rule to your schema.
The attributes supported by the command are the same as when creating copy field rules by manually
editing the schema.xml, as below:
source
The source field. This parameter is required.
dest
A field or an array of fields to which the source field will be copied. This parameter is required.
maxChars
The upper limit for the number of characters to be copied. The section Copying Fields has more details.
For example, to define a rule to copy the field "shelf" to the "location" and "catchall" fields, you would POST
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 191 of 1426
the following request:
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-copy-field":{
"source":"shelf",
"dest":[ "location", "catchall" ]}
}' http://localhost:8983/solr/gettingstarted/schema
V2 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-copy-field":{
"source":"shelf",
"dest":[ "location", "catchall" ]}
}' http://localhost:8983/api/cores/gettingstarted/schema
Delete a Copy Field Rule
The delete-copy-field command deletes a copy field rule from your schema. If the copy field rule does not
exist in the schema an error is thrown.
The source and dest attributes are required by this command.
For example, to delete a rule to copy the field "shelf" to the "location" field, you would POST the following
request:
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"delete-copy-field":{ "source":"shelf", "dest":"location" }
}' http://localhost:8983/solr/gettingstarted/schema
V1 API
curl -X POST -H 'Content-type:application/json' --data-binary '{
"delete-copy-field":{ "source":"shelf", "dest":"location" }
}' http://localhost:8983/api/cores/gettingstarted/schema
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 192 of 1426
Apache Solr Reference Guide 7.7
Multiple Commands in a Single POST
It is possible to perform one or more add requests in a single command. The API is transactional and all
commands in a single call either succeed or fail together.
The commands are executed in the order in which they are specified. This means that if you want to create a
new field type and in the same request use the field type on a new field, the section of the request that
creates the field type must come before the section that creates the new field. Similarly, since a field must
exist for it to be used in a copy field rule, a request to add a field must come before a request for the field to
be used as either the source or the destination for a copy field rule.
The syntax for making multiple requests supports several approaches. First, the commands can simply be
made serially, as in this request to create a new field type and then a field that uses that type:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field-type":{
"name":"myNewTxtField",
"class":"solr.TextField",
"positionIncrementGap":"100",
"analyzer":{
"charFilters":[{
"class":"solr.PatternReplaceCharFilterFactory",
"replacement":"$1$1",
"pattern":"([a-zA-Z])\\\\1+" }],
"tokenizer":{
"class":"solr.WhitespaceTokenizerFactory" },
"filters":[{
"class":"solr.WordDelimiterFilterFactory",
"preserveOriginal":"0" }]}},
"add-field" : {
"name":"sell_by",
"type":"myNewTxtField",
"stored":true }
}' http://localhost:8983/solr/gettingstarted/schema
Or, the same command can be repeated, as in this example:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 193 of 1426
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
"name":"shelf",
"type":"myNewTxtField",
"stored":true },
"add-field":{
"name":"location",
"type":"myNewTxtField",
"stored":true },
"add-copy-field":{
"source":"shelf",
"dest":[ "location", "catchall" ]}
}' http://localhost:8983/solr/gettingstarted/schema
Finally, repeated commands can be sent as an array:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":[
{ "name":"shelf",
"type":"myNewTxtField",
"stored":true },
{ "name":"location",
"type":"myNewTxtField",
"stored":true }]
}' http://localhost:8983/solr/gettingstarted/schema
Schema Changes among Replicas
When running in SolrCloud mode, changes made to the schema on one node will propagate to all replicas in
the collection.
You can pass the updateTimeoutSecs parameter with your request to set the number of seconds to wait
until all replicas confirm they applied the schema updates. This helps your client application be more robust
in that you can be sure that all replicas have a given schema change within a defined amount of time.
If agreement is not reached by all replicas in the specified time, then the request fails and the error message
will include information about which replicas had trouble. In most cases, the only option is to re-try the
change after waiting a brief amount of time. If the problem persists, then you’ll likely need to investigate the
server logs on the replicas that had trouble applying the changes.
If you do not supply an updateTimeoutSecs parameter, the default behavior is for the receiving node to
return immediately after persisting the updates to ZooKeeper. All other replicas will apply the updates
asynchronously. Consequently, without supplying a timeout, your client application cannot be sure that all
replicas have applied the changes.
Retrieve Schema Information
The following endpoints allow you to read how your schema has been defined. You can GET the entire
schema, or only portions of it as needed.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 194 of 1426
Apache Solr Reference Guide 7.7
To modify the schema, see the previous section Modify the Schema.
Retrieve the Entire Schema
GET /collection/schema
Retrieve Schema Parameters
Path Parameters
collection
The collection (or core) name.
Query Parameters
The query parameters should be added to the API request after '?'.
wt
Defines the format of the response. The options are json, xml or schema.xml. If not specified, JSON will
be returned by default.
Retrieve Schema Response
Output Content
The output will include all fields, field types, dynamic rules and copy field rules, in the format requested
(JSON or XML). The schema name and version are also included.
Retrieve Schema Examples
Get the entire schema in JSON.
curl http://localhost:8983/solr/gettingstarted/schema
{
"responseHeader":{
"status":0,
"QTime":5},
"schema":{
"name":"example",
"version":1.5,
"uniqueKey":"id",
"fieldTypes":[{
"name":"alphaOnlySort",
"class":"solr.TextField",
"sortMissingLast":true,
"omitNorms":true,
"analyzer":{
"tokenizer":{
"class":"solr.KeywordTokenizerFactory"},
"filters":[{
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 195 of 1426
"class":"solr.LowerCaseFilterFactory"},
{
"class":"solr.TrimFilterFactory"},
{
"class":"solr.PatternReplaceFilterFactory",
"replace":"all",
"replacement":"",
"pattern":"([^a-z])"}]}}],
"fields":[{
"name":"_version_",
"type":"long",
"indexed":true,
"stored":true},
{
"name":"author",
"type":"text_general",
"indexed":true,
"stored":true},
{
"name":"cat",
"type":"string",
"multiValued":true,
"indexed":true,
"stored":true}],
"copyFields":[{
"source":"author",
"dest":"text"},
{
"source":"cat",
"dest":"text"},
{
"source":"content",
"dest":"text"},
{
"source":"author",
"dest":"author_s"}]}}
Get the entire schema in XML.
curl http://localhost:8983/solr/gettingstarted/schema?wt=xml
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 196 of 1426
Apache Solr Reference Guide 7.7
0
5
example
1.5
id
alphaOnlySort
solr.TextField
true
true
solr.KeywordTokenizerFactory
solr.LowerCaseFilterFactory
solr.TrimFilterFactory
solr.PatternReplaceFilterFactory
all
([^a-z])
...
author
author_s
Get the entire schema in "schema.xml" format.
curl http://localhost:8983/solr/gettingstarted/schema?wt=schema.xml
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 197 of 1426
id
...
List Fields
GET /collection/schema/fields
GET /collection/schema/fields/fieldname
List Fields Parameters
Path Parameters
collection
The collection (or core) name.
fieldname
The specific fieldname (if limiting the request to a single field).
Query Parameters
The query parameters can be added to the API request after a '?'.
wt
Defines the format of the response. The options are json or xml. If not specified, JSON will be returned by
default.
fl
Comma- or space-separated list of one or more fields to return. If not specified, all fields will be returned
by default.
includeDynamic
If true, and if the fl query parameter is specified or the fieldname path parameter is used, matching
dynamic fields are included in the response and identified with the dynamicBase property.
If neither the fl query parameter nor the fieldname path parameter is specified, the includeDynamic
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 198 of 1426
Apache Solr Reference Guide 7.7
query parameter is ignored.
If false, the default, matching dynamic fields will not be returned.
showDefaults
If true, all default field properties from each field’s field type will be included in the response (e.g.,
tokenized for solr.TextField). If false, the default, only explicitly specified field properties will be
included.
List Fields Response
The output will include each field and any defined configuration for each field. The defined configuration can
vary for each field, but will minimally include the field name, the type, if it is indexed and if it is stored.
If multiValued is defined as either true or false (most likely true), that will also be shown. See the section
Defining Fields for more information about each parameter.
List Fields Examples
Get a list of all fields.
curl http://localhost:8983/solr/gettingstarted/schema/fields
The sample output below has been truncated to only show a few fields.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 199 of 1426
{
"fields": [
{
"indexed": true,
"name": "_version_",
"stored": true,
"type": "long"
},
{
"indexed": true,
"name": "author",
"stored": true,
"type": "text_general"
},
{
"indexed": true,
"multiValued": true,
"name": "cat",
"stored": true,
"type": "string"
},
"..."
],
"responseHeader": {
"QTime": 1,
"status": 0
}
}
List Dynamic Fields
GET /collection/schema/dynamicfields
GET /collection/schema/dynamicfields/name
List Dynamic Field Parameters
Path Parameters
collection
The collection (or core) name.
name
The name of the dynamic field rule (if limiting request to a single dynamic field rule).
Query Parameters
The query parameters can be added to the API request after a '?'.
wt
Defines the format of the response. The options are json or xml. If not specified, JSON will be returned by
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 200 of 1426
Apache Solr Reference Guide 7.7
default.
showDefaults
If true, all default field properties from each dynamic field’s field type will be included in the response
(e.g., tokenized for solr.TextField). If false, the default, only explicitly specified field properties will be
included.
List Dynamic Field Response
The output will include each dynamic field rule and the defined configuration for each rule. The defined
configuration can vary for each rule, but will minimally include the dynamic field name, the type, if it is
indexed and if it is stored. See the section Dynamic Fields for more information about each parameter.
List Dynamic Field Examples
Get a list of all dynamic field declarations:
curl http://localhost:8983/solr/gettingstarted/schema/dynamicfields
The sample output below has been truncated.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 201 of 1426
{
"dynamicFields": [
{
"indexed": true,
"name": "*_coordinate",
"stored": false,
"type": "tdouble"
},
{
"multiValued": true,
"name": "ignored_*",
"type": "ignored"
},
{
"name": "random_*",
"type": "random"
},
{
"indexed": true,
"multiValued": true,
"name": "attr_*",
"stored": true,
"type": "text_general"
},
{
"indexed": true,
"multiValued": true,
"name": "*_txt",
"stored": true,
"type": "text_general"
}
"..."
],
"responseHeader": {
"QTime": 1,
"status": 0
}
}
List Field Types
GET /collection/schema/fieldtypes
GET /collection/schema/fieldtypes/name
List Field Type Parameters
Path Parameters
collection
The collection (or core) name.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 202 of 1426
Apache Solr Reference Guide 7.7
name
The name of the field type (if limiting request to a single field type).
Query Parameters
The query parameters can be added to the API request after a '?'.
wt
Defines the format of the response. The options are json or xml. If not specified, JSON will be returned by
default.
showDefaults
If true, all default field properties from each dynamic field’s field type will be included in the response
(e.g., tokenized for solr.TextField). If false, the default, only explicitly specified field properties will be
included.
List Field Type Response
The output will include each field type and any defined configuration for the type. The defined configuration
can vary for each type, but will minimally include the field type name and the class. If query or index
analyzers, tokenizers, or filters are defined, those will also be shown with other defined parameters. See the
section Solr Field Types for more information about how to configure various types of fields.
List Field Type Examples
Get a list of all field types.
curl http://localhost:8983/solr/gettingstarted/schema/fieldtypes
The sample output below has been truncated to show a few different field types from different parts of the
list.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 203 of 1426
{
"fieldTypes": [
{
"analyzer": {
"class": "solr.TokenizerChain",
"filters": [
{
"class": "solr.LowerCaseFilterFactory"
},
{
"class": "solr.TrimFilterFactory"
},
{
"class": "solr.PatternReplaceFilterFactory",
"pattern": "([^a-z])",
"replace": "all",
"replacement": ""
}
],
"tokenizer": {
"class": "solr.KeywordTokenizerFactory"
}
},
"class": "solr.TextField",
"dynamicFields": [],
"fields": [],
"name": "alphaOnlySort",
"omitNorms": true,
"sortMissingLast": true
},
{
"class": "solr.FloatPointField",
"dynamicFields": [
"*_fs",
"*_f"
],
"fields": [
"price",
"weight"
],
"name": "float",
"positionIncrementGap": "0",
}]
}
List Copy Fields
GET /collection/schema/copyfields
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 204 of 1426
Apache Solr Reference Guide 7.7
List Copy Field Parameters
Path Parameters
collection
The collection (or core) name.
Query Parameters
The query parameters can be added to the API request after a '?'.
wt
Defines the format of the response. The options are json or xml. If not specified, JSON will be returned by
default.
source.fl
Comma- or space-separated list of one or more copyField source fields to include in the response copyField directives with all other source fields will be excluded from the response. If not specified, all
copyField-s will be included in the response.
dest.fl
Comma- or space-separated list of one or more copyField destination fields to include in the response.
copyField directives with all other dest fields will be excluded. If not specified, all copyField-s will be
included in the response.
List Copy Field Response
The output will include the source and dest (destination) of each copy field rule defined in schema.xml. For
more information about copying fields, see the section Copying Fields.
List Copy Field Examples
Get a list of all copyFields.
curl http://localhost:8983/solr/gettingstarted/schema/copyfields
The sample output below has been truncated to the first few copy definitions.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 205 of 1426
{
"copyFields": [
{
"dest": "text",
"source": "author"
},
{
"dest": "text",
"source": "cat"
},
{
"dest": "text",
"source": "content"
},
{
"dest": "text",
"source": "content_type"
},
],
"responseHeader": {
"QTime": 3,
"status": 0
}
}
Show Schema Name
GET /collection/schema/name
Show Schema Parameters
Path Parameters
collection
The collection (or core) name.
Query Parameters
The query parameters can be added to the API request after a '?'.
wt
Defines the format of the response. The options are json or xml. If not specified, JSON will be returned by
default.
Show Schema Response
The output will be simply the name given to the schema.
Show Schema Examples
Get the schema name.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 206 of 1426
Apache Solr Reference Guide 7.7
curl http://localhost:8983/solr/gettingstarted/schema/name
{
"responseHeader":{
"status":0,
"QTime":1},
"name":"example"}
Show the Schema Version
GET /collection/schema/version
Show Schema Version Parameters
Path Parameters
collection
The collection (or core) name.
Query Parameters
The query parameters can be added to the API request after a '?'.
wt
Defines the format of the response. The options are json or xml. If not specified, JSON will be returned by
default.
Show Schema Version Response
The output will simply be the schema version in use.
Show Schema Version Example
Get the schema version:
curl http://localhost:8983/solr/gettingstarted/schema/version
{
"responseHeader":{
"status":0,
"QTime":2},
"version":1.5}
List UniqueKey
GET /collection/schema/uniquekey
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 207 of 1426
List UniqueKey Parameters
Path Parameters
|collection
The collection (or core) name.
Query Parameters
The query parameters can be added to the API request after a '?'.
|wt
Defines the format of the response. The options are json or xml. If not specified, JSON will be returned by
default.
List UniqueKey Response
The output will include simply the field name that is defined as the uniqueKey for the index.
List UniqueKey Example
List the uniqueKey.
curl http://localhost:8983/solr/gettingstarted/schema/uniquekey
{
"responseHeader":{
"status":0,
"QTime":2},
"uniqueKey":"id"}
Show Global Similarity
GET /collection/schema/similarity
Show Global Similarity Parameters
Path Parameters
collection
The collection (or core) name.
Query Parameters
The query parameters can be added to the API request after a '?'.
wt
Defines the format of the response. The options are json or xml. If not specified, JSON will be returned by
default.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 208 of 1426
Apache Solr Reference Guide 7.7
Show Global Similary Response
The output will include the class name of the global similarity defined (if any).
Show Global Similarity Example
Get the similarity implementation.
curl http://localhost:8983/solr/gettingstarted/schema/similarity
{
"responseHeader":{
"status":0,
"QTime":1},
"similarity":{
"class":"org.apache.solr.search.similarities.DefaultSimilarityFactory"}}
Manage Resource Data
The Managed Resources REST API provides a mechanism for any Solr plugin to expose resources that should
support CRUD (Create, Read, Update, Delete) operations. Depending on what Field Types and Analyzers are
configured in your Schema, additional /schema/ REST API paths may exist. See the Managed Resources
section for more information and examples.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 209 of 1426
Putting the Pieces Together
At the highest level, schema.xml is structured as follows.
This example is not real XML, but it gives you an idea of the structure of the file.
Obviously, most of the excitement is in types and fields, where the field types and the actual field
definitions live.
These are supplemented by copyFields.
The uniqueKey must always be defined.
Types and fields are optional tags
Note that the types and fields sections are optional, meaning you are free to mix field,
dynamicField, copyField and fieldType definitions on the top level. This allows for a more
logical grouping of related tags in your schema.
Choosing Appropriate Numeric Types
For general numeric needs, consider using one of the IntPointField, LongPointField, FloatPointField, or
DoublePointField classes, depending on the specific values you expect. These "Dimensional Point" based
numeric classes use specially encoded data structures to support efficient range queries regardless of the
size of the ranges used. Enable DocValues on these fields as needed for sorting and/or faceting.
Some Solr features may not yet work with "Dimensional Points", in which case you may want to consider the
equivalent TrieIntField, TrieLongField, TrieFloatField, and TrieDoubleField classes. These field types
are deprecated and are likely to be removed in a future major Solr release, but they can still be used if
necessary. Configure a precisionStep="0" if you wish to minimize index size, but if you expect users to
make frequent range queries on numeric types, use the default precisionStep (by not specifying it) or
specify it as precisionStep="8" (which is the default). This offers faster speed for range queries at the
expense of increasing index size.
Working With Text
Handling text properly will make your users happy by providing them with the best possible results for text
searches.
One technique is using a text field as a catch-all for keyword searching. Most users are not sophisticated
about their searches and the most common search is likely to be a simple keyword search. You can use
copyField to take a variety of fields and funnel them all into a single text field for keyword searches.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 210 of 1426
Apache Solr Reference Guide 7.7
In the schema.xml file for the “techproducts” example included with Solr, copyField declarations are used
to dump the contents of cat, name, manu, features, and includes into a single field, text. In addition, it
could be a good idea to copy ID into text in case users wanted to search for a particular product by passing
its product number to a keyword search.
Another technique is using copyField to use the same field in different ways. Suppose you have a field that
is a list of authors, like this:
Schildt, Herbert; Wolpert, Lewis; Davies, P.
For searching by author, you could tokenize the field, convert to lower case, and strip out punctuation:
schildt / herbert / wolpert / lewis / davies / p
For sorting, just use an untokenized field, converted to lower case, with punctuation stripped:
schildt herbert wolpert lewis davies p
Finally, for faceting, use the primary author only via a StrField:
Schildt, Herbert
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 211 of 1426
DocValues
DocValues are a way of recording field values internally that is more efficient for some purposes, such as
sorting and faceting, than traditional indexing.
Why DocValues?
The standard way that Solr builds the index is with an inverted index. This style builds a list of terms found in
all the documents in the index and next to each term is a list of documents that the term appears in (as well
as how many times the term appears in that document). This makes search very fast - since users search by
terms, having a ready list of term-to-document values makes the query process faster.
For other features that we now commonly associate with search, such as sorting, faceting, and highlighting,
this approach is not very efficient. The faceting engine, for example, must look up each term that appears in
each document that will make up the result set and pull the document IDs in order to build the facet list. In
Solr, this is maintained in memory, and can be slow to load (depending on the number of documents, terms,
etc.).
In Lucene 4.0, a new approach was introduced. DocValue fields are now column-oriented fields with a
document-to-value mapping built at index time. This approach promises to relieve some of the memory
requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.
Enabling DocValues
To use docValues, you only need to enable it for a field that you will use it with. As with all schema design,
you need to define a field type and then define fields of that type with docValues enabled. All of these
actions are done in schema.xml.
Enabling a field for docValues only requires adding docValues="true" to the field (or field type) definition,
as in this example from the schema.xml of Solr’s sample_techproducts_configs configset:
If you have already indexed data into your Solr index, you will need to completely re-index
your content after changing your field definitions in schema.xml in order to successfully use
docValues.
DocValues are only available for specific field types. The types chosen determine the underlying Lucene
docValue type that will be used. The available Solr field types are:
• StrField, and UUIDField:
◦ If the field is single-valued (i.e., multi-valued is false), Lucene will use the SORTED type.
◦ If the field is multi-valued, Lucene will use the SORTED_SET type. Entries are kept in sorted order and
duplicates are removed.
• BoolField:
◦ If the field is single-valued (i.e., multi-valued is false), Lucene will use the SORTED type.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 212 of 1426
Apache Solr Reference Guide 7.7
◦ If the field is multi-valued, Lucene will use the SORTED_SET type. Entries are kept in sorted order and
duplicates are removed.
• Any *PointField Numeric or Date fields, EnumFieldType, and CurrencyFieldType:
◦ If the field is single-valued (i.e., multi-valued is false), Lucene will use the NUMERIC type.
◦ If the field is multi-valued, Lucene will use the SORTED_NUMERIC type. Entries are kept in sorted order
and duplicates are kept.
• Any of the deprecated Trie* Numeric or Date fields, EnumField and CurrencyField:
◦ If the field is single-valued (i.e., multi-valued is false), Lucene will use the NUMERIC type.
◦ If the field is multi-valued, Lucene will use the SORTED_SET type. Entries are kept in sorted order and
duplicates are removed.
These Lucene types are related to how the values are sorted and stored.
There is an additional configuration option available, which is to modify the docValuesFormat used by the
field type. The default implementation employs a mixture of loading some things into memory and keeping
some on disk. In some cases, however, you may choose to specify an alternative DocValuesFormat
implementation. For example, you could choose to keep everything in memory by specifying
docValuesFormat="Direct" on a field type:
Please note that the docValuesFormat option may change in future releases.
Lucene index back-compatibility is only supported for the default codec. If you choose to
customize the docValuesFormat in your schema.xml, upgrading to a future version of Solr
may require you to either switch back to the default codec and optimize your index to
rewrite it into the default codec before upgrading, or re-build your entire index from
scratch after upgrading.
Using DocValues
Sorting, Faceting & Functions
If docValues="true" for a field, then DocValues will automatically be used any time the field is used for
sorting, faceting or function queries.
Retrieving DocValues During Search
Field values retrieved during search queries are typically returned from stored values. However, non-stored
docValues fields will be also returned along with other stored fields when all fields (or pattern matching
globs) are specified to be returned (e.g., “fl=*”) for search queries depending on the effective value of the
useDocValuesAsStored parameter for each field. For schema versions >= 1.6, the implicit default is
useDocValuesAsStored="true". See Field Type Definitions and Properties & Defining Fields for more details.
When useDocValuesAsStored="false", non-stored DocValues fields can still be explicitly requested by name
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 213 of 1426
in the fl param, but will not match glob patterns ("*"). Note that returning DocValues along with "regular"
stored fields at query time has performance implications that stored fields may not because DocValues are
column-oriented and may therefore incur additional cost to retrieve for each returned document. Also note
that while returning non-stored fields from DocValues, the values of a multi-valued field are returned in
sorted order rather than insertion order and may have duplicates removed, see above. If you require the
multi-valued fields to be returned in the original insertion order, then make your multi-valued field as stored
(such a change requires re-indexing).
In cases where the query is returning only docValues fields performance may improve since returning stored
fields requires disk reads and decompression whereas returning docValues fields in the fl list only requires
memory access.
When retrieving fields from their docValues form (such as when using the /export handler, streaming
expressions or if the field is requested in the fl parameter), two important differences between regular
stored fields and docValues fields must be understood:
1. Order is not preserved. When retrieving stored fields, the insertion order is the return order. For
docValues, it is the sorted order.
2. For field types using SORTED_SET (see above), multiple identical entries are collapsed into a single value.
Thus if values 4, 5, 2, 4, 1 are inserted, the values returned will be 1, 2, 4, 5.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 214 of 1426
Apache Solr Reference Guide 7.7
Schemaless Mode
Schemaless Mode is a set of Solr features that, when used together, allow users to rapidly construct an
effective schema by simply indexing sample data, without having to manually edit the schema.
These Solr features, all controlled via solrconfig.xml, are:
1. Managed schema: Schema modifications are made at runtime through Solr APIs, which requires the use
of a schemaFactory that supports these changes. See the section Schema Factory Definition in SolrConfig
for more details.
2. Field value class guessing: Previously unseen fields are run through a cascading set of value-based
parsers, which guess the Java class of field values - parsers for Boolean, Integer, Long, Float, Double, and
Date are currently available.
3. Automatic schema field addition, based on field value class(es): Previously unseen fields are added to the
schema, based on field value Java classes, which are mapped to schema field types - see Solr Field Types.
Using the Schemaless Example
The three features of schemaless mode are pre-configured in the _default configset in the Solr distribution.
To start an example instance of Solr using these configs, run the following command:
bin/solr start -e schemaless
This will launch a single Solr server, and automatically create a collection (named “gettingstarted”) that
contains only three fields in the initial schema: id, _version_, and _text_.
You can use the /schema/fields Schema API to confirm this: curl
http://localhost:8983/solr/gettingstarted/schema/fields will output:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 215 of 1426
{
"responseHeader":{
"status":0,
"QTime":1},
"fields":[{
"name":"_text_",
"type":"text_general",
"multiValued":true,
"indexed":true,
"stored":false},
{
"name":"_version_",
"type":"long",
"indexed":true,
"stored":true},
{
"name":"id",
"type":"string",
"multiValued":false,
"indexed":true,
"required":true,
"stored":true,
"uniqueKey":true}]}
Configuring Schemaless Mode
As described above, there are three configuration elements that need to be in place to use Solr in
schemaless mode. In the _default configset included with Solr these are already configured. If, however,
you would like to implement schemaless on your own, you should make the following changes.
Enable Managed Schema
As described in the section Schema Factory Definition in SolrConfig, Managed Schema support is enabled by
default, unless your configuration specifies that ClassicIndexSchemaFactory should be used.
You can configure the ManagedIndexSchemaFactory (and control the resource file used, or disable future
modifications) by adding an explicit like the one below, please see Schema Factory
Definition in SolrConfig for more details on the options available.
true
managed-schema
Enable Field Class Guessing
In Solr, an UpdateRequestProcessorChain defines a chain of plugins that are applied to documents before or
while they are indexed.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 216 of 1426
Apache Solr Reference Guide 7.7
The field guessing aspect of Solr’s schemaless mode uses a specially-defined UpdateRequestProcessorChain
that allows Solr to guess field types. You can also define the default field type classes to use.
To start, you should define it as follows (see the javadoc links below for update processor factory
documentation):
①
[^\w-\.]
_
②
yyyy-MM-dd'T'HH:mm:ss.SSSZ
yyyy-MM-dd'T'HH:mm:ss,SSSZ
yyyy-MM-dd'T'HH:mm:ss.SSS
yyyy-MM-dd'T'HH:mm:ss,SSS
yyyy-MM-dd'T'HH:mm:ssZ
yyyy-MM-dd'T'HH:mm:ss
yyyy-MM-dd'T'HH:mmZ
yyyy-MM-dd'T'HH:mm
yyyy-MM-dd HH:mm:ss.SSSZ
yyyy-MM-dd HH:mm:ss,SSSZ
yyyy-MM-dd HH:mm:ss.SSS
yyyy-MM-dd HH:mm:ss,SSS
yyyy-MM-dd HH:mm:ssZ
yyyy-MM-dd HH:mm:ss
yyyy-MM-dd HH:mmZ
yyyy-MM-dd HH:mm
yyyy-MM-dd
③
java.lang.String ④
text_general
⑤
*_str
256
true
java.lang.Boolean
booleans
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 217 of 1426
java.util.Date
pdates
java.lang.Long ⑥
java.lang.Integer
plongs
java.lang.Number
pdoubles
⑦
There are many things defined in this chain. Let’s step through a few of them.
① First, we’re using the FieldNameMutatingUpdateProcessorFactory to lower-case all field names. Note
that this and every following element include a name. These names will be used in the final
chain definition at the end of this example.
② Next we add several update request processors to parse different field types. Note the
ParseDateFieldUpdateProcessorFactory includes a long list of possible date formations that would be
parsed into valid Solr dates. If you have a custom date, you could add it to this list (see the link to the
Javadocs below to get information on how).
③ Once the fields have been parsed, we define the field types that will be assigned to those fields. You can
modify any of these that you would like to change.
④ In this definition, if the parsing step decides the incoming data in a field is a string, we will put this into a
field in Solr with the field type text_general. This field type by default allows Solr to query on this field.
⑤ After we’ve added the text_general field, we have also defined a copy field rule that will copy all data
from the new text_general field to a field with the same name suffixed with _str. This is done by Solr’s
dynamic fields feature. By defining the target of the copy field rule as a dynamic field in this way, you
can control the field type used in your schema. The default selection allows Solr to facet, highlight, and
sort on these fields.
⑥ This is another example of a mapping rule. In this case we define that when either of the Long or
Integer field parsers identify a field, they should both map their fields to the plongs field type.
⑦ Finally, we add a chain definition that calls the list of plugins. These plugins are each called by the names
we gave to them when we defined them. We can also add other processors to the chain, as shown here.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 218 of 1426
Apache Solr Reference Guide 7.7
Note we have also given the entire chain a name ("add-unknown-fields-to-the-schema"). We’ll use this
name in the next section to specify that our update request handler should use this chain definition.
This chain definition will make a number of copy field rules for string fields to be created
from corresponding text fields. If your data causes you to end up with a lot of copy field
rules, indexing may be slowed down noticeably, and your index size will be larger. To
control for these issues, it’s recommended that you review the copy field rules that are
created, and remove any which you do not need for faceting, sorting, highlighting, etc.
If you’re interested in more information about the classes used in this chain, here are links to the Javadocs
for update processor factories mentioned above:
• UUIDUpdateProcessorFactory
• RemoveBlankFieldUpdateProcessorFactory
• FieldNameMutatingUpdateProcessorFactory
• ParseBooleanFieldUpdateProcessorFactory
• ParseLongFieldUpdateProcessorFactory
• ParseDoubleFieldUpdateProcessorFactory
• ParseDateFieldUpdateProcessorFactory
• AddSchemaFieldsUpdateProcessorFactory
Set the Default UpdateRequestProcessorChain
Once the UpdateRequestProcessorChain has been defined, you must instruct your UpdateRequestHandlers
to use it when working with index updates (i.e., adding, removing, replacing documents).
There are two ways to do this. The update chain shown above has a default=true attribute which will use it
for any update handler.
An alternative, more explicit way is to use InitParams to set the defaults on all /update request handlers:
add-unknown-fields-to-the-schema
After all of these changes have been made, Solr should be restarted or the cores reloaded.
Disabling Automatic Field Guessing
Automatic field creation can be disabled with the update.autoCreateFields property. To do this, you can
use bin/solr config with a command such as:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 219 of 1426
bin/solr config -c mycollection -p 8983 -action set-user-property -property
update.autoCreateFields -value false
Examples of Indexed Documents
Once the schemaless mode has been enabled (whether you configured it manually or are using the
_default configset), documents that include fields that are not defined in your schema will be indexed,
using the guessed field types which are automatically added to the schema.
For example, adding a CSV document will cause unknown fields to be added, with fieldTypes based on
values:
curl "http://localhost:8983/solr/gettingstarted/update?commit=true&wt=xml" -H "Contenttype:application/csv" -d '
id,Artist,Album,Released,Rating,FromDistributor,Sold
44C,Old Shews,Mead for Walking,1988-08-13,0.01,14,0'
Output indicating success:
0 106
The fields now in the schema (output from curl
http://localhost:8983/solr/gettingstarted/schema/fields ):
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 220 of 1426
Apache Solr Reference Guide 7.7
{
"responseHeader":{
"status":0,
"QTime":2},
"fields":[{
"name":"Album",
"type":"text_general"},
{
"name":"Artist",
"type":"text_general"},
{
"name":"FromDistributor",
"type":"plongs"},
{
"name":"Rating",
"type":"pdoubles"},
{
"name":"Released",
"type":"pdates"},
{
"name":"Sold",
"type":"plongs"},
{
"name":"_root_", ...},
{
"name":"_text_", ...},
{
"name":"_version_", ...},
{
"name":"id", ...}
]}
In addition string versions of the text fields are indexed, using copyFields to a *_str dynamic field: (output
from curl http://localhost:8983/solr/gettingstarted/schema/copyfields ):
{
"responseHeader":{
"status":0,
"QTime":0},
"copyFields":[{
"source":"Artist",
"dest":"Artist_str",
"maxChars":256},
{
"source":"Album",
"dest":"Album_str",
"maxChars":256}]}
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 221 of 1426
You Can Still Be Explicit
Even if you want to use schemaless mode for most fields, you can still use the Schema API
to pre-emptively create some fields, with explicit types, before you index documents that
use them.
Internally, the Schema API and the Schemaless Update Processors both use the same
Managed Schema functionality.
Also, if you do not need the *_str version of a text field, you can simply remove the
copyField definition from the auto-generated schema and it will not be re-added since the
original field is now defined.
Once a field has been added to the schema, its field type is fixed. As a consequence, adding documents with
field value(s) that conflict with the previously guessed field type will fail. For example, after adding the above
document, the “Sold” field has the fieldType plongs, but the document below has a non-integral decimal
value in this field:
curl "http://localhost:8983/solr/gettingstarted/update?commit=true&wt=xml" -H "Contenttype:application/csv" -d '
id,Description,Sold
19F,Cassettes by the pound,4.93'
This document will fail, as shown in this output:
400
7
ERROR: [doc=19F] Error adding field 'Sold'='4.93' msg=For input string:
"4.93"
400
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 222 of 1426
Apache Solr Reference Guide 7.7
Understanding Analyzers, Tokenizers, and
Filters
The following sections describe how Solr breaks down and works with textual data. There are three main
concepts to understand: analyzers, tokenizers, and filters.
• Field analyzers are used both during ingestion, when a document is indexed, and at query time. An
analyzer examines the text of fields and generates a token stream. Analyzers may be a single class or
they may be composed of a series of tokenizer and filter classes.
• Tokenizers break field data into lexical units, or tokens.
• Filters examine a stream of tokens and keep them, transform or discard them, or create new ones.
Tokenizers and filters may be combined to form pipelines, or chains, where the output of one is input to
the next. Such a sequence of tokenizers and filters is called an analyzer and the resulting output of an
analyzer is used to match query results or build indices.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 223 of 1426
Using Analyzers, Tokenizers, and Filters
Although the analysis process is used for both indexing and querying, the same analysis process need not
be used for both operations. For indexing, you often want to simplify, or normalize, words. For example,
setting all letters to lowercase, eliminating punctuation and accents, mapping words to their stems, and so
on. Doing so can increase recall because, for example, "ram", "Ram" and "RAM" would all match a query for
"ram". To increase query-time precision, a filter could be employed to narrow the matches by, for example,
ignoring all-cap acronyms if you’re interested in male sheep, but not Random Access Memory.
The tokens output by the analysis process define the values, or terms, of that field and are used either to
build an index of those terms when a new document is added, or to identify which documents contain the
terms you are querying for.
For More Information
These sections will show you how to configure field analyzers and also serves as a reference for the details
of configuring each of the available tokenizer and filter classes. It also serves as a guide so that you can
configure your own analysis classes if you have special needs that cannot be met with the included filters or
tokenizers.
For Analyzers, see:
• Analyzers: Detailed conceptual information about Solr analyzers.
• Running Your Analyzer: Detailed information about testing and running your Solr analyzer.
For Tokenizers, see:
• About Tokenizers: Detailed conceptual information about Solr tokenizers.
• Tokenizers: Information about configuring tokenizers, and about the tokenizer factory classes included
in this distribution of Solr.
For Filters, see:
• About Filters: Detailed conceptual information about Solr filters.
• Filter Descriptions: Information about configuring filters, and about the filter factory classes included in
this distribution of Solr.
• CharFilterFactories: Information about filters for pre-processing input characters.
To find out how to use Tokenizers and Filters with various languages, see:
• Language Analysis: Information about tokenizers and filters for character set conversion or for use with
specific languages.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 224 of 1426
Apache Solr Reference Guide 7.7
Analyzers
An analyzer examines the text of fields and generates a token stream.
Analyzers are specified as a child of the element in the schema.xml configuration file (in the
same conf/ directory as solrconfig.xml).
In normal usage, only fields of type solr.TextField or solr.SortableTextField will specify an analyzer.
The simplest way to configure an analyzer is with a single element whose class attribute is a fully
qualified Java class name. The named class must derive from org.apache.lucene.analysis.Analyzer. For
example:
In this case a single class, WhitespaceAnalyzer, is responsible for analyzing the content of the named text
field and emitting the corresponding tokens. For simple cases, such as plain English prose, a single analyzer
class like this may be sufficient. But it’s often necessary to do more complex analysis of the field content.
Even the most complex analysis requirements can usually be decomposed into a series of discrete, relatively
simple processing steps. As you will soon discover, the Solr distribution comes with a large selection of
tokenizers and filters that covers most scenarios you are likely to encounter. Setting up an analyzer chain is
very straightforward; you specify a simple element (no class attribute) with child elements that
name factory classes for the tokenizer and filters to use, in the order you want them to run.
For example:
Note that classes in the org.apache.solr.analysis package may be referred to here with the shorthand
solr. prefix.
In this case, no Analyzer class was specified on the element. Rather, a sequence of more
specialized classes are wired together and collectively act as the Analyzer for the field. The text of the field is
passed to the first item in the list (solr.StandardTokenizerFactory), and the tokens that emerge from the
last one (solr.EnglishPorterFilterFactory) are the terms that are used for indexing or querying any
fields that use the "nametext" fieldType.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 225 of 1426
Field Values versus Indexed Terms
The output of an Analyzer affects the terms indexed in a given field (and the terms used
when parsing queries against those fields) but it has no impact on the stored value for the
fields. For example: an analyzer might split "Brown Cow" into two indexed terms "brown"
and "cow", but the stored value will still be a single String: "Brown Cow"
Analysis Phases
Analysis takes place in two contexts. At index time, when a field is being created, the token stream that
results from analysis is added to an index and defines the set of terms (including positions, sizes, and so on)
for the field. At query time, the values being searched for are analyzed and the terms that result are
matched against those that are stored in the field’s index.
In many cases, the same analysis should be applied to both phases. This is desirable when you want to
query for exact string matches, possibly with case-insensitivity, for example. In other cases, you may want to
apply slightly different analysis steps during indexing than those used at query time.
If you provide a simple definition for a field type, as in the examples above, then it will be used
for both indexing and queries. If you want distinct analyzers for each phase, you may include two
definitions distinguished with a type attribute. For example:
In this theoretical example, at index time the text is tokenized, the tokens are set to lowercase, any that are
not listed in keepwords.txt are discarded and those that remain are mapped to alternate values as defined
by the synonym rules in the file syns.txt. This essentially builds an index from a restricted set of possible
values and then normalizes them to values that may not even occur in the original text.
At query time, the only normalization that happens is to convert the query terms to lowercase. The filtering
and mapping steps that occur at index time are not applied to the query terms. Queries must then, in this
example, be very precise, using only the normalized terms that were stored at index time.
Analysis for Multi-Term Expansion
In some types of queries (i.e., Prefix, Wildcard, Regex, etc.) the input provided by the user is not natural
language intended for Analysis. Things like Synonyms or Stop word filtering do not work in a logical way in
these types of Queries.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 226 of 1426
Apache Solr Reference Guide 7.7
The analysis factories that can work in these types of queries (such as Lowercasing, or Normalizing
Factories) are known as MultiTermAwareComponents. When Solr needs to perform analysis for a query that
results in Multi-Term expansion, only the MultiTermAwareComponents used in the query analyzer are used,
Factory that is not Multi-Term aware will be skipped.
For most use cases, this provides the best possible behavior, but if you wish for absolute control over the
analysis performed on these types of queries, you may explicitly define a multiterm analyzer to use, such as
in the following example:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 227 of 1426
About Tokenizers
The job of a tokenizer is to break up a stream of text into tokens, where each token is (usually) a subsequence of the characters in the text. An analyzer is aware of the field it is configured for, but a tokenizer is
not. Tokenizers read from a character stream (a Reader) and produce a sequence of Token objects (a
TokenStream).
Characters in the input stream may be discarded, such as whitespace or other delimiters. They may also be
added to or replaced, such as mapping aliases or abbreviations to normalized forms. A token contains
various metadata in addition to its text value, such as the location at which the token occurs in the field.
Because a tokenizer may produce tokens that diverge from the input text, you should not assume that the
text of the token is the same text that occurs in the field, or that its length is the same as the original text.
It’s also possible for more than one token to have the same position or refer to the same offset in the
original text. Keep this in mind if you use token metadata for things like highlighting search results in the
field text.
The class named in the tokenizer element is not the actual tokenizer, but rather a class that implements the
TokenizerFactory API. This factory class will be called upon to create new tokenizer instances as needed.
Objects created by the factory must derive from Tokenizer, which indicates that they produce sequences of
tokens. If the tokenizer produces tokens that are usable as is, it may be the only component of the analyzer.
Otherwise, the tokenizer’s output tokens will serve as input to the first filter stage in the pipeline.
A TypeTokenFilterFactory is available that creates a TypeTokenFilter that filters tokens based on their
TypeAttribute, which is set in factory.getStopTypes.
For a complete list of the available TokenFilters, see the section Tokenizers.
When to Use a CharFilter vs. a TokenFilter
There are several pairs of CharFilters and TokenFilters that have related (i.e., MappingCharFilter and
ASCIIFoldingFilter) or nearly identical (i.e., PatternReplaceCharFilterFactory and
PatternReplaceFilterFactory) functionality and it may not always be obvious which is the best choice.
The decision about which to use depends largely on which Tokenizer you are using, and whether you need
to preprocess the stream of characters.
For example, suppose you have a tokenizer such as StandardTokenizer and although you are pretty happy
with how it works overall, you want to customize how some specific characters behave. You could modify the
rules and re-build your own tokenizer with JFlex, but it might be easier to simply map some of the characters
before tokenization with a CharFilter.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 228 of 1426
Apache Solr Reference Guide 7.7
About Filters
Like tokenizers, filters consume input and produce a stream of tokens. Filters also derive from
org.apache.lucene.analysis.TokenStream. Unlike tokenizers, a filter’s input is another TokenStream. The
job of a filter is usually easier than that of a tokenizer since in most cases a filter looks at each token in the
stream sequentially and decides whether to pass it along, replace it or discard it.
A filter may also do more complex analysis by looking ahead to consider multiple tokens at once, although
this is less common. One hypothetical use for such a filter might be to normalize state names that would be
tokenized as two words. For example, the single token "california" would be replaced with "CA", while the
token pair "rhode" followed by "island" would become the single token "RI".
Because filters consume one TokenStream and produce a new TokenStream, they can be chained one after
another indefinitely. Each filter in the chain in turn processes the tokens produced by its predecessor. The
order in which you specify the filters is therefore significant. Typically, the most general filtering is done first,
and later filtering stages are more specialized.
This example starts with Solr’s standard tokenizer, which breaks the field’s text into tokens. All the tokens
are then set to lowercase, which will facilitate case-insensitive matching at query time.
The last filter in the above example is a stemmer filter that uses the Porter stemming algorithm. A stemmer
is basically a set of mapping rules that maps the various forms of a word back to the base, or stem, word
from which they derive. For example, in English the words "hugs", "hugging" and "hugged" are all forms of
the stem word "hug". The stemmer will replace all of these terms with "hug", which is what will be indexed.
This means that a query for "hug" will match the term "hugged", but not "huge".
Conversely, applying a stemmer to your query terms will allow queries containing non stem terms, like
"hugging", to match documents with different variations of the same stem word, such as "hugged". This
works because both the indexer and the query will map to the same stem ("hug").
Word stemming is, obviously, very language specific. Solr includes several language-specific stemmers
created by the Snowball generator that are based on the Porter stemming algorithm. The generic Snowball
Porter Stemmer Filter can be used to configure any of these language stemmers. Solr also includes a
convenience wrapper for the English Snowball stemmer. There are also several purpose-built stemmers for
non-English languages. These stemmers are described in Language Analysis.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 229 of 1426
Tokenizers
Tokenizers are responsible for breaking field data into lexical units, or tokens.
You configure the tokenizer for a text field type in schema.xml with a element, as a child of
:
The class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer
factory classes implement the org.apache.solr.analysis.TokenizerFactory. A TokenizerFactory’s
create() method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it passes a
Reader object that provides the content of the text field.
Arguments may be passed to tokenizer factories by setting attributes on the element.
The following sections describe the tokenizer factory classes included in this release of Solr.
For user tips about Solr’s tokenizers, see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.
Standard Tokenizer
This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter
characters are discarded, with the following exceptions:
• Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet
domain names.
• The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved
as single tokens.
Note that words are split at hyphens.
The Standard Tokenizer supports Unicode standard annex UAX#29 word boundaries with the following
token types: , , , , and .
Factory class: solr.StandardTokenizerFactory
Arguments:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 230 of 1426
Apache Solr Reference Guide 7.7
maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified
by maxTokenLength.
Example:
In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."
Out: "Please", "email", "john.doe", "foo.com", "by", "03", "09", "re", "m37", "xq"
Classic Tokenizer
The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions 3.1 and
previous. It does not use the Unicode standard annex UAX#29 word boundary rules that the Standard
Tokenizer uses. This tokenizer splits the text field into tokens, treating whitespace and punctuation as
delimiters. Delimiter characters are discarded, with the following exceptions:
• Periods (dots) that are not followed by whitespace are kept as part of the token.
• Words are split at hyphens, unless there is a number in the word, in which case the token is not split and
the numbers and hyphen(s) are preserved.
• Recognizes Internet domain names and email addresses and preserves them as a single token.
Factory class: solr.ClassicTokenizerFactory
Arguments:
maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified
by maxTokenLength.
Example:
In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."
Out: "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"
Keyword Tokenizer
This tokenizer treats the entire text field as a single token.
Factory class: solr.KeywordTokenizerFactory
Arguments: None
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 231 of 1426
Example:
In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."
Out: "Please, email john.doe@foo.com by 03-09, re: m37-xq."
Letter Tokenizer
This tokenizer creates tokens from strings of contiguous letters, discarding all non-letter characters.
Factory class: solr.LetterTokenizerFactory
Arguments: None
Example:
In: "I can’t."
Out: "I", "can", "t"
Lower Case Tokenizer
Tokenizes the input stream by delimiting at non-letters and then converting all letters to lowercase.
Whitespace and non-letters are discarded.
Factory class: solr.LowerCaseTokenizerFactory
Arguments: None
Example:
In: "I just *LOVE* my iPhone!"
Out: "i", "just", "love", "my", "iphone"
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 232 of 1426
Apache Solr Reference Guide 7.7
N-Gram Tokenizer
Reads the field text and generates n-gram tokens of sizes in the given range.
Factory class: solr.NGramTokenizerFactory
Arguments:
minGramSize: (integer, default 1) The minimum n-gram size, must be > 0.
maxGramSize: (integer, default 2) The maximum n-gram size, must be >= minGramSize.
Example:
Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at
whitespace. As a result, the space character is included in the encoding.
In: "hey man"
Out: "h", "e", "y", " ", "m", "a", "n", "he", "ey", "y ", " m", "ma", "an"
Example:
With an n-gram size range of 4 to 5:
In: "bicycle"
Out: "bicy", "bicyc", "icyc", "icycl", "cycl", "cycle", "ycle"
Edge N-Gram Tokenizer
Reads the field text and generates edge n-gram tokens of sizes in the given range.
Factory class: solr.EdgeNGramTokenizerFactory
Arguments:
minGramSize: (integer, default is 1) The minimum n-gram size, must be > 0.
maxGramSize: (integer, default is 1) The maximum n-gram size, must be >= minGramSize.
Example:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 233 of 1426
Default behavior (min and max default to 1):
In: "babaloo"
Out: "b"
Example:
Edge n-gram range of 2 to 5
In: "babaloo"
Out:"ba", "bab", "baba", "babal"
ICU Tokenizer
This tokenizer processes multilingual text and tokenizes it appropriately based on its script attribute.
You can customize this tokenizer’s behavior by specifying per-script rule files. To add per-script rules, add a
rulefiles argument, which should contain a comma-separated list of code:rulefile pairs in the following
format: four-letter ISO 15924 script code, followed by a colon, then a resource path. For example, to specify
rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"), you would enter
Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi.
The default configuration for solr.ICUTokenizerFactory provides UAX#29 word break rules tokenization
(like solr.StandardTokenizer), but also includes custom tailorings for Hebrew (specializing handling of
double and single quotation marks), for syllable tokenization for Khmer, Lao, and Myanmar, and dictionarybased word segmentation for CJK characters.
Factory class: solr.ICUTokenizerFactory
Arguments:
rulefile: a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924
script code, followed by a colon, then a resource path.
Example:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 234 of 1426
Apache Solr Reference Guide 7.7
To use this tokenizer, you must add additional .jars to Solr’s classpath (as described in the
section Resources and Plugins on the Filesystem). See the solr/contrib/analysisextras/README.txt for information on which jars you need to add.
Path Hierarchy Tokenizer
This tokenizer creates synonyms from file path hierarchies.
Factory class: solr.PathHierarchyTokenizerFactory
Arguments:
delimiter: (character, no default) You can specify the file path delimiter and replace it with a delimiter you
provide. This can be useful for working with backslash delimiters.
replace: (character, no default) Specifies the delimiter character Solr uses in the tokenized output.
Example:
In: "c:\usr\local\apache"
Out: "c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache"
Regular Expression Pattern Tokenizer
This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression
provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to
match patterns that should be extracted from the text as tokens.
See the Javadocs for java.util.regex.Pattern for more information on Java regular expression syntax.
Factory class: solr.PatternTokenizerFactory
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 235 of 1426
Arguments:
pattern: (Required) The regular expression, as defined by in java.util.regex.Pattern.
group: (Optional, default -1) Specifies which regex group to extract as the token(s). The value -1 means the
regex should be treated as a delimiter that separates tokens. Non-negative group numbers (>= 0) indicate
that character sequences matching that regex group should be converted to tokens. Group zero refers to
the entire regex, groups greater than zero refer to parenthesized sub-expressions of the regex, counted
from left to right.
Example:
A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or
more spaces.
In: "fee,fie, foe , fum, foo"
Out: "fee", "fie", "foe", "fum", "foo"
Example:
Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of
either case is extracted as a token.
In: "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."
Out: "Hello", "My", "Inigo", "Montoya", "You", "Prepare"
Example:
Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an
optional semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex
capture groups are numbered by counting left parenthesis from left to right. Group 3 is the subexpression
"[0-9-]+", which matches one or more digits or hyphens.
In: "SKU: 1234, Part Number 5678, Part: 126-987"
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 236 of 1426
Apache Solr Reference Guide 7.7
Out: "1234", "5678", "126-987"
Simplified Regular Expression Pattern Tokenizer
This tokenizer is similar to the PatternTokenizerFactory described above, but uses Lucene RegExp pattern
matching to construct distinct tokens for the input stream. The syntax is more limited than
PatternTokenizerFactory, but the tokenization is quite a bit faster.
Factory class: solr.SimplePatternTokenizerFactory
Arguments:
pattern: (Required) The regular expression, as defined by in the RegExp javadocs, identifying the characters
to include in tokens. The matching is greedy such that the longest token matching at a given point is
created. Empty tokens are never created.
maxDeterminizedStates: (Optional, default 10000) the limit on total state count for the determined
automaton computed from the regexp.
Example:
To match tokens delimited by simple whitespace characters:
Simplified Regular Expression Pattern Splitting Tokenizer
This tokenizer is similar to the SimplePatternTokenizerFactory described above, but uses Lucene RegExp
pattern matching to identify sequences of characters that should be used to split tokens. The syntax is more
limited than PatternTokenizerFactory, but the tokenization is quite a bit faster.
Factory class: solr.SimplePatternSplitTokenizerFactory
Arguments:
pattern: (Required) The regular expression, as defined by in the RegExp javadocs, identifying the characters
that should split tokens. The matching is greedy such that the longest token separator matching at a given
point is matched. Empty tokens are never created.
maxDeterminizedStates: (Optional, default 10000) the limit on total state count for the determined
automaton computed from the regexp.
Example:
To match tokens delimited by simple whitespace characters:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 237 of 1426
UAX29 URL Email Tokenizer
This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter
characters are discarded, with the following exceptions:
• Periods (dots) that are not followed by whitespace are kept as part of the token.
• Words are split at hyphens, unless there is a number in the word, in which case the token is not split and
the numbers and hyphen(s) are preserved.
• Recognizes and preserves as single tokens the following:
◦ Internet domain names containing top-level domains validated against the white list in the IANA Root
Zone Database when the tokenizer was generated
◦ email addresses
◦ file://, http(s)://, and ftp:// URLs
◦ IPv4 and IPv6 addresses
The UAX29 URL Email Tokenizer supports Unicode standard annex UAX#29 word boundaries with the
following token types: , , , , , , and
.
Factory class: solr.UAX29URLEmailTokenizerFactory
Arguments:
maxTokenLength: (integer, default 255) Solr ignores tokens that exceed the number of characters specified
by maxTokenLength.
Example:
In: "Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com"
Out: "Visit", "http://accarol.com/contact.htm?from=external&a=10", "or", "e", "mail",
"bob.cratchet@accarol.com"
White Space Tokenizer
Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace
characters as tokens. Note that any punctuation will be included in the tokens.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 238 of 1426
Apache Solr Reference Guide 7.7
Factory class: solr.WhitespaceTokenizerFactory
Arguments:
rule
Specifies how to define whitespace for the purpose of tokenization. Valid values:
• java: (Default) Uses Character.isWhitespace(int)
• unicode: Uses Unicode’s WHITESPACE property
Example:
In: "To be, or what?"
Out: "To", "be,", "or", "what?"
OpenNLP Tokenizer and OpenNLP Filters
See OpenNLP Integration for information about using the OpenNLP Tokenizer, along with information about
available OpenNLP token filters.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 239 of 1426
Filter Descriptions
Filters examine a stream of tokens and keep them, transform them or discard them, depending on the filter
type being used.
You configure each filter with a element in schema.xml as a child of , following the
element. Filter definitions should follow a tokenizer or another filter definition because they
take a TokenStream as input. For example:
...
The class attribute names a factory class that will instantiate a filter object as needed. Filter factory classes
must implement the org.apache.solr.analysis.TokenFilterFactory interface. Like tokenizers, filters are
also instances of TokenStream and thus are producers of tokens. Unlike tokenizers, filters also consume
tokens from a TokenStream. This allows you to mix and match filters, in any order you prefer, downstream
of a tokenizer.
Arguments may be passed to tokenizer factories to modify their behavior by setting attributes on the
element. For example:
The following sections describe the filter factories that are included in this release of Solr.
For user tips about Solr’s filters, see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.
ASCII Folding Filter
This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the Basic Latin
Unicode block (the first 127 ASCII characters) to their ASCII equivalents, if one exists. This filter converts
characters from the following Unicode blocks:
• C1 Controls and Latin-1 Supplement (PDF)
• Latin Extended-A (PDF)
• Latin Extended-B (PDF)
• Latin Extended Additional (PDF)
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 240 of 1426
Apache Solr Reference Guide 7.7
• Latin Extended-C (PDF)
• Latin Extended-D (PDF)
• IPA Extensions (PDF)
• Phonetic Extensions (PDF)
• Phonetic Extensions Supplement (PDF)
• General Punctuation (PDF)
• Superscripts and Subscripts (PDF)
• Enclosed Alphanumerics (PDF)
• Dingbats (PDF)
• Supplemental Punctuation (PDF)
• Alphabetic Presentation Forms (PDF)
• Halfwidth and Fullwidth Forms (PDF)
Factory class: solr.ASCIIFoldingFilterFactory
Arguments:
preserveOriginal
(boolean, default false) If true, the original token is preserved: "thé" -> "the", "thé"
Example:
In: "á" (Unicode character 00E1)
Out: "a" (ASCII character 97)
Beider-Morse Filter
Implements the Beider-Morse Phonetic Matching (BMPM) algorithm, which allows identification of similar
names, even if they are spelled differently or in different languages. More information about how this works
is available in the section on Phonetic Matching.
BeiderMorseFilter changed its behavior in Solr 5.0 due to an update to version 3.04 of the
BMPM algorithm. Older version of Solr implemented BMPM version 3.00 (see
http://stevemorse.org/phoneticinfo.htm). Any index built using this filter with earlier
versions of Solr will need to be rebuilt.
Factory class: solr.BeiderMorseFilterFactory
Arguments:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 241 of 1426
nameType
Types of names. Valid values are GENERIC, ASHKENAZI, or SEPHARDIC. If not processing Ashkenazi or
Sephardic names, use GENERIC.
ruleType
Types of rules to apply. Valid values are APPROX or EXACT.
concat
Defines if multiple possible matches should be combined with a pipe ("|").
languageSet
The language set to use. The value "auto" will allow the Filter to identify the language, or a commaseparated list can be supplied.
Example:
Classic Filter
This filter takes the output of the Classic Tokenizer and strips periods from acronyms and "'s" from
possessives.
Factory class: solr.ClassicFilterFactory
Arguments: None
Example:
In: "I.B.M. cat’s can’t"
Tokenizer to Filter: "I.B.M", "cat’s", "can’t"
Out: "IBM", "cat", "can’t"
Common Grams Filter
This filter creates word shingles by combining common tokens such as stop words with regular tokens. This
is useful for creating phrase queries containing common words, such as "the cat." Solr normally ignores
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 242 of 1426
Apache Solr Reference Guide 7.7
stop words in queried phrases, so searching for "the cat" would return all matches for the word "cat."
Factory class: solr.CommonGramsFilterFactory
Arguments:
words
(a common word file in .txt format) Provide the name of a common word file, such as stopwords.txt.
format
(optional) If the stopwords list has been formatted for Snowball, you can specify format="snowball" so
Solr can read the stopwords file.
ignoreCase
(boolean) If true, the filter ignores the case of words when comparing them to the common word file. The
default is false.
Example:
In: "the Cat"
Tokenizer to Filter: "the", "Cat"
Out: "the_cat"
Collation Key Filter
Collation allows sorting of text in a language-sensitive way. It is usually used for sorting, but can also be
used with advanced searches. We’ve covered this in much more detail in the section on Unicode Collation.
Daitch-Mokotoff Soundex Filter
Implements the Daitch-Mokotoff Soundex algorithm, which allows identification of similar names, even if
they are spelled differently. More information about how this works is available in the section on Phonetic
Matching.
Factory class: solr.DaitchMokotoffSoundexFilterFactory
Arguments:
inject
(true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens
are replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the
exact spelling of the target word may not match.
Example:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 243 of 1426
Double Metaphone Filter
This filter creates tokens using the DoubleMetaphone encoding algorithm from commons-codec. For more
information, see the Phonetic Matching section.
Factory class: solr.DoubleMetaphoneFilterFactory
Arguments:
inject
(true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens
are replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the
exact spelling of the target word may not match.
maxCodeLength
(integer) The maximum length of the code to be generated.
Example:
Default behavior for inject (true): keep the original token and add phonetic token(s) at the same position.
In: "four score and Kuczewski"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "Kuczewski"(4)
Out: "four"(1), "FR"(1), "score"(2), "SKR"(2), "and"(3), "ANT"(3), "Kuczewski"(4), "KSSK"(4), "KXFS"(4)
The phonetic tokens have a position increment of 0, which indicates that they are at the same position as the
token they were derived from (immediately preceding). Note that "Kuczewski" has two encodings, which are
added at the same position.
Example:
Discard original token (inject="false").
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 244 of 1426
Apache Solr Reference Guide 7.7
In: "four score and Kuczewski"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "Kuczewski"(4)
Out: "FR"(1), "SKR"(2), "ANT"(3), "KSSK"(4), "KXFS"(4)
Note that "Kuczewski" has two encodings, which are added at the same position.
Edge N-Gram Filter
This filter generates edge n-gram tokens of sizes within the given range.
Factory class: solr.EdgeNGramFilterFactory
Arguments:
minGramSize
(integer, default 1) The minimum gram size.
maxGramSize
(integer, default 1) The maximum gram size.
Example:
Default behavior.
In: "four score and twenty"
Tokenizer to Filter: "four", "score", "and", "twenty"
Out: "f", "s", "a", "t"
Example:
A range of 1 to 4.
In: "four score"
Tokenizer to Filter: "four", "score"
Out: "f", "fo", "fou", "four", "s", "sc", "sco", "scor"
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 245 of 1426
Example:
A range of 4 to 6.
In: "four score and twenty"
Tokenizer to Filter: "four", "score", "and", "twenty"
Out: "four", "scor", "score", "twen", "twent", "twenty"
English Minimal Stem Filter
This filter stems plural English words to their singular form.
Factory class: solr.EnglishMinimalStemFilterFactory
Arguments: None
Example:
In: "dogs cats"
Tokenizer to Filter: "dogs", "cats"
Out: "dog", "cat"
English Possessive Filter
This filter removes singular possessives (trailing 's) from words. Note that plural possessives, e.g., the s' in
"divers' snorkels", are not removed by this filter.
Factory class: solr.EnglishPossessiveFilterFactory
Arguments: None
Example:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 246 of 1426
Apache Solr Reference Guide 7.7
In: "Man’s dog bites dogs' man"
Tokenizer to Filter: "Man’s", "dog", "bites", "dogs'", "man"
Out: "Man", "dog", "bites", "dogs'", "man"
Fingerprint Filter
This filter outputs a single token which is a concatenation of the sorted and de-duplicated set of input
tokens. This can be useful for clustering/linking use cases.
Factory class: solr.FingerprintFilterFactory
Arguments:
separator
The character used to separate tokens combined into the single output token. Defaults to " " (a space
character).
maxOutputTokenSize
The maximum length of the summarized output token. If exceeded, no output token is emitted. Defaults
to 1024.
Example:
In: "the quick brown fox jumped over the lazy dog"
Tokenizer to Filter: "the", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog"
Out: "brown_dog_fox_jumped_lazy_over_quick_the"
Flatten Graph Filter
This filter must be included on index-time analyzer specifications that include at least one graph-aware filter,
including Synonym Graph Filter and Word Delimiter Graph Filter.
Factory class: solr.FlattenGraphFilterFactory
Arguments: None
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 247 of 1426
See the examples below for Synonym Graph Filter and Word Delimiter Graph Filter.
Hunspell Stem Filter
The Hunspell Stem Filter provides support for several languages. You must provide the dictionary (.dic)
and rules (.aff) files for each language you wish to use with the Hunspell Stem Filter. You can download
those language files here.
Be aware that your results will vary widely based on the quality of the provided dictionary and rules files. For
example, some languages have only a minimal word list with no morphological information. On the other
hand, for languages that have no stemmer but do have an extensive dictionary file, the Hunspell stemmer
may be a good choice.
Factory class: solr.HunspellStemFilterFactory
Arguments:
dictionary
(required) The path of a dictionary file.
affix
(required) The path of a rules file.
ignoreCase
(boolean) controls whether matching is case sensitive or not. The default is false.
strictAffixParsing
(boolean) controls whether the affix parsing is strict or not. If true, an error while reading an affix rule
causes a ParseException, otherwise is ignored. The default is true.
Example:
In: "jump jumping jumped"
Tokenizer to Filter: "jump", "jumping", "jumped"
Out: "jump", "jump", "jump"
Hyphenated Words Filter
This filter reconstructs hyphenated words that have been tokenized as two tokens because of a line break or
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 248 of 1426
Apache Solr Reference Guide 7.7
other intervening whitespace in the field test. If a token ends with a hyphen, it is joined with the following
token and the hyphen is discarded.
Note that for this filter to work properly, the upstream tokenizer must not remove trailing hyphen
characters. This filter is generally only useful at index time.
Factory class: solr.HyphenatedWordsFilterFactory
Arguments: None
Example:
In: "A hyphen- ated word"
Tokenizer to Filter: "A", "hyphen-", "ated", "word"
Out: "A", "hyphenated", "word"
ICU Folding Filter
This filter is a custom Unicode normalization form that applies the foldings specified in Unicode TR #30:
Character Foldings in addition to the NFKC_Casefold normalization form as described in ICU Normalizer 2
Filter. This filter is a better substitute for the combined behavior of the ASCII Folding Filter, Lower Case Filter,
and ICU Normalizer 2 Filter.
To use this filter, you must add additional .jars to Solr’s classpath (as described in the section Resources and
Plugins on the Filesystem). See solr/contrib/analysis-extras/README.txt for instructions on which jars
you need to add.
Factory class: solr.ICUFoldingFilterFactory
Arguments:
filter
(string, optional) A Unicode set filter that can be used to e.g., exclude a set of characters from being
processed. See the UnicodeSet javadocs for more information.
Example without a filter:
Example with a filter to exclude Swedish/Finnish characters:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 249 of 1426
For detailed information on this normalization form, see Unicode TR #30: Character Foldings.
ICU Normalizer 2 Filter
This filter factory normalizes text according to one of five Unicode Normalization Forms as described in
Unicode Standard Annex #15:
• NFC: (name="nfc" mode="compose") Normalization Form C, canonical decomposition
• NFD: (name="nfc" mode="decompose") Normalization Form D, canonical decomposition, followed by
canonical composition
• NFKC: (name="nfkc" mode="compose") Normalization Form KC, compatibility decomposition
• NFKD: (name="nfkc" mode="decompose") Normalization Form KD, compatibility decomposition, followed
by canonical composition
• NFKC_Casefold: (name="nfkc_cf" mode="compose") Normalization Form KC, with additional Unicode case
folding. Using the ICU Normalizer 2 Filter is a better-performing substitution for the Lower Case Filter
and NFKC normalization.
Factory class: solr.ICUNormalizer2FilterFactory
Arguments:
name
The name of the normalization form. Valid options are nfc, nfd, nfkc, nfkd, or nfkc_cf (the default).
Required.
mode
The mode of Unicode character composition and decomposition. Valid options are: compose (the default)
or decompose. Required.
filter
A Unicode set filter that can be used to e.g., exclude a set of characters from being processed. See the
UnicodeSet javadocs for more information. Optional.
Example with NFKC_Casefold:
Example with a filter to exclude Swedish/Finnish characters:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 250 of 1426
Apache Solr Reference Guide 7.7
For detailed information about these normalization forms, see Unicode Normalization Forms.
To use this filter, you must add additional .jars to Solr’s classpath (as described in the section Resources and
Plugins on the Filesystem). See solr/contrib/analysis-extras/README.txt for instructions on which jars
you need to add.
ICU Transform Filter
This filter applies ICU Tranforms to text. This filter supports only ICU System Transforms. Custom rule sets
are not supported.
Factory class: solr.ICUTransformFilterFactory
Arguments:
id
(string) The identifier for the ICU System Transform you wish to apply with this filter. For a full list of ICU
System Transforms, see http://demo.icu-project.org/icu-bin/translit?TEMPLATE_FILE=data/
translit_rule_main.html.
Example:
For detailed information about ICU Transforms, see http://userguide.icu-project.org/transforms/general.
To use this filter, you must add additional .jars to Solr’s classpath (as described in the section Resources and
Plugins on the Filesystem). See solr/contrib/analysis-extras/README.txt for instructions on which jars
you need to add.
Keep Word Filter
This filter discards all tokens except those that are listed in the given word list. This is the inverse of the Stop
Words Filter. This filter can be useful for building specialized indices for a constrained set of terms.
Factory class: solr.KeepWordFilterFactory
Arguments:
words
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 251 of 1426
(required) Path of a text file containing the list of keep words, one per line. Blank lines and lines that
begin with "#" are ignored. This may be an absolute path, or a simple filename in the Solr conf directory.
ignoreCase
(true/false) If true then comparisons are done case-insensitively. If this argument is true, then the words
file is assumed to contain only lowercase words. The default is false.
enablePositionIncrements
if luceneMatchVersion is 4.3 or earlier and enablePositionIncrements="false", no position holes will
be left by this filter when it removes tokens. This argument is invalid if luceneMatchVersion is 5.0 or
later.
Example:
Where keepwords.txt contains:
happy funny silly
In: "Happy, sad or funny"
Tokenizer to Filter: "Happy", "sad", "or", "funny"
Out: "funny"
Example:
Same keepwords.txt, case insensitive:
In: "Happy, sad or funny"
Tokenizer to Filter: "Happy", "sad", "or", "funny"
Out: "Happy", "funny"
Example:
Using LowerCaseFilterFactory before filtering for keep words, no ignoreCase flag.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 252 of 1426
Apache Solr Reference Guide 7.7
In: "Happy, sad or funny"
Tokenizer to Filter: "Happy", "sad", "or", "funny"
Filter to Filter: "happy", "sad", "or", "funny"
Out: "happy", "funny"
KStem Filter
KStem is an alternative to the Porter Stem Filter for developers looking for a less aggressive stemmer. KStem
was written by Bob Krovetz, ported to Lucene by Sergio Guzman-Lara (UMASS Amherst). This stemmer is
only appropriate for English language text.
Factory class: solr.KStemFilterFactory
Arguments: None
Example:
In: "jump jumping jumped"
Tokenizer to Filter: "jump", "jumping", "jumped"
Out: "jump", "jump", "jump"
Length Filter
This filter passes tokens whose length falls within the min/max limit specified. All other tokens are
discarded.
Factory class: solr.LengthFilterFactory
Arguments:
min
(integer, required) Minimum token length. Tokens shorter than this are discarded.
max
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 253 of 1426
(integer, required, must be >= min) Maximum token length. Tokens longer than this are discarded.
enablePositionIncrements
if luceneMatchVersion is 4.3 or earlier and enablePositionIncrements="false", no position holes will
be left by this filter when it removes tokens. This argument is invalid if luceneMatchVersion is 5.0 or
later.
Example:
In: "turn right at Albuquerque"
Tokenizer to Filter: "turn", "right", "at", "Albuquerque"
Out: "turn", "right"
Limit Token Count Filter
This filter limits the number of accepted tokens, typically useful for index analysis.
By default, this filter ignores any tokens in the wrapped TokenStream once the limit has been reached, which
can result in reset() being called prior to incrementToken() returning false. For most TokenStream
implementations this should be acceptable, and faster then consuming the full stream. If you are wrapping
a TokenStream which requires that the full stream of tokens be exhausted in order to function properly, use
the consumeAllTokens="true" option.
Factory class: solr.LimitTokenCountFilterFactory
Arguments:
maxTokenCount
(integer, required) Maximum token count. After this limit has been reached, tokens are discarded.
consumeAllTokens
(boolean, defaults to false) Whether to consume (and discard) previous token filters' tokens after the
maximum token count has been reached. See description above.
Example:
In: "1 2 3 4 5 6 7 8 9 10 11 12"
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 254 of 1426
Apache Solr Reference Guide 7.7
Tokenizer to Filter: "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"
Out: "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"
Limit Token Offset Filter
This filter limits tokens to those before a configured maximum start character offset. This can be useful to
limit highlighting, for example.
By default, this filter ignores any tokens in the wrapped TokenStream once the limit has been reached, which
can result in reset() being called prior to incrementToken() returning false. For most TokenStream
implementations this should be acceptable, and faster then consuming the full stream. If you are wrapping
a TokenStream which requires that the full stream of tokens be exhausted in order to function properly, use
the consumeAllTokens="true" option.
Factory class: solr.LimitTokenOffsetFilterFactory
Arguments:
maxStartOffset
(integer, required) Maximum token start character offset. After this limit has been reached, tokens are
discarded.
consumeAllTokens
(boolean, defaults to false) Whether to consume (and discard) previous token filters' tokens after the
maximum start offset has been reached. See description above.
Example:
In: "0 2 4 6 8 A C E"
Tokenizer to Filter: "0", "2", "4", "6", "8", "A", "C", "E"
Out: "0", "2", "4", "6", "8", "A"
Limit Token Position Filter
This filter limits tokens to those before a configured maximum token position.
By default, this filter ignores any tokens in the wrapped TokenStream once the limit has been reached, which
can result in reset() being called prior to incrementToken() returning false. For most TokenStream
implementations this should be acceptable, and faster then consuming the full stream. If you are wrapping
a TokenStream which requires that the full stream of tokens be exhausted in order to function properly, use
the consumeAllTokens="true" option.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 255 of 1426
Factory class: solr.LimitTokenPositionFilterFactory
Arguments:
maxTokenPosition
(integer, required) Maximum token position. After this limit has been reached, tokens are discarded.
consumeAllTokens
(boolean, defaults to false) Whether to consume (and discard) previous token filters' tokens after the
maximum start offset has been reached. See description above.
Example:
In: "1 2 3 4 5"
Tokenizer to Filter: "1", "2", "3", "4", "5"
Out: "1", "2", "3"
Lower Case Filter
Converts any uppercase letters in a token to the equivalent lowercase token. All other characters are left
unchanged.
Factory class: solr.LowerCaseFilterFactory
Arguments: None
Example:
In: "Down With CamelCase"
Tokenizer to Filter: "Down", "With", "CamelCase"
Out: "down", "with", "camelcase"
Managed Stop Filter
This is specialized version of the Stop Words Filter Factory that uses a set of stop words that are managed
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 256 of 1426
Apache Solr Reference Guide 7.7
from a REST API.
Arguments:
managed
The name that should be used for this set of stop words in the managed REST API.
Example: With this configuration the set of words is named "english" and can be managed via
/solr/collection_name/schema/analysis/stopwords/english
See Stop Filter for example input/output.
Managed Synonym Filter
This is specialized version of the Synonym Filter that uses a mapping on synonyms that is managed from a
REST API.
Managed Synonym Filter has been Deprecated
Managed Synonym Filter has been deprecated in favor of Managed Synonym Graph Filter,
which is required for multi-term synonym support.
Factory class: solr.ManagedSynonymFilterFactory
For arguments and examples, see the Synonym Graph Filter below.
Managed Synonym Graph Filter
This is specialized version of the Synonym Graph Filter that uses a mapping on synonyms that is managed
from a REST API.
This filter maps single- or multi-token synonyms, producing a fully correct graph output. This filter is a
replacement for the Managed Synonym Filter, which produces incorrect graphs for multi-token synonyms.
Although this filter produces correct token graphs, it cannot consume an input token graph
correctly.
Arguments:
managed
The name that should be used for this mapping on synonyms in the managed REST API.
Example: With this configuration the set of mappings is named "english" and can be managed via
/solr/collection_name/schema/analysis/synonyms/english
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 257 of 1426
See Synonym Graph Filter below for example input/output.
N-Gram Filter
Generates n-gram tokens of sizes in the given range. Note that tokens are ordered by position and then by
gram size.
Factory class: solr.NGramFilterFactory
Arguments:
minGramSize
(integer, default 1) The minimum gram size.
maxGramSize
(integer, default 2) The maximum gram size.
Example:
Default behavior.
In: "four score"
Tokenizer to Filter: "four", "score"
Out: "f", "o", "u", "r", "fo", "ou", "ur", "s", "c", "o", "r", "e", "sc", "co", "or", "re"
Example:
A range of 1 to 4.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 258 of 1426
Apache Solr Reference Guide 7.7
In: "four score"
Tokenizer to Filter: "four", "score"
Out: "f", "fo", "fou", "four", "o", "ou", "our", "u", "ur", "r", "s", "sc", "sco", "scor", "c", "co", "cor", "core",
"o", "or", "ore", "r", "re", "e"
Example:
A range of 3 to 5.
In: "four score"
Tokenizer to Filter: "four", "score"
Out: "fou", "four", "our", "sco", "scor", "score", "cor", "core", "ore"
Numeric Payload Token Filter
This filter adds a numeric floating point payload value to tokens that match a given type. Refer to the
Javadoc for the org.apache.lucene.analysis.Token class for more information about token types and
payloads.
Factory class: solr.NumericPayloadTokenFilterFactory
Arguments:
payload
(required) A floating point value that will be added to all matching tokens.
typeMatch
(required) A token type name string. Tokens with a matching type name will have their payload set to the
above floating point value.
Example:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 259 of 1426
In: "bing bang boom"
Tokenizer to Filter: "bing", "bang", "boom"
Out: "bing"[0.75], "bang"[0.75], "boom"[0.75]
Pattern Replace Filter
This filter applies a regular expression to each token and, for those that match, substitutes the given
replacement string in place of the matched pattern. Tokens which do not match are passed though
unchanged.
Factory class: solr.PatternReplaceFilterFactory
Arguments:
pattern
(required) The regular expression to test against each token, as per java.util.regex.Pattern.
replacement
(required) A string to substitute in place of the matched pattern. This string may contain references to
capture groups in the regex pattern. See the Javadoc for java.util.regex.Matcher.
replace
("all" or "first", default "all") Indicates whether all occurrences of the pattern in the token should be
replaced, or only the first.
Example:
Simple string replace:
In: "cat concatenate catycat"
Tokenizer to Filter: "cat", "concatenate", "catycat"
Out: "dog", "condogenate", "dogydog"
Example:
String replacement, first occurrence only:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 260 of 1426
Apache Solr Reference Guide 7.7
In: "cat concatenate catycat"
Tokenizer to Filter: "cat", "concatenate", "catycat"
Out: "dog", "condogenate", "dogycat"
Example:
More complex pattern with capture group reference in the replacement. Tokens that start with non-numeric
characters and end with digits will have an underscore inserted before the numbers. Otherwise the token is
passed through.
In: "cat foo1234 9987 blah1234foo"
Tokenizer to Filter: "cat", "foo1234", "9987", "blah1234foo"
Out: "cat", "foo_1234", "9987", "blah1234foo"
Phonetic Filter
This filter creates tokens using one of the phonetic encoding algorithms in the
org.apache.commons.codec.language package. For more information, see the section on Phonetic
Matching.
Factory class: solr.PhoneticFilterFactory
Arguments:
encoder
(required) The name of the encoder to use. The encoder name must be one of the following (case
insensitive): DoubleMetaphone, Metaphone, Soundex, RefinedSoundex, Caverphone (v2.0),
ColognePhonetic, or Nysiis.
inject
(true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens
are replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the
exact spelling of the target word may not match.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 261 of 1426
maxCodeLength
(integer) The maximum length of the code to be generated by the Metaphone or Double Metaphone
encoders.
Example:
Default behavior for DoubleMetaphone encoding.
In: "four score and twenty"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)
Out: "four"(1), "FR"(1), "score"(2), "SKR"(2), "and"(3), "ANT"(3), "twenty"(4), "TNT"(4)
The phonetic tokens have a position increment of 0, which indicates that they are at the same position as the
token they were derived from (immediately preceding).
Example:
Discard original token.
In: "four score and twenty"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)
Out: "FR"(1), "SKR"(2), "ANT"(3), "TWNT"(4)
Example:
Default Soundex encoder.
In: "four score and twenty"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)
Out: "four"(1), "F600"(1), "score"(2), "S600"(2), "and"(3), "A530"(3), "twenty"(4), "T530"(4)
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 262 of 1426
Apache Solr Reference Guide 7.7
Porter Stem Filter
This filter applies the Porter Stemming Algorithm for English. The results are similar to using the Snowball
Porter Stemmer with the language="English" argument. But this stemmer is coded directly in Java and is
not based on Snowball. It does not accept a list of protected words and is only appropriate for English
language text. However, it has been benchmarked as four times faster than the English Snowball stemmer,
so can provide a performance enhancement.
Factory class: solr.PorterStemFilterFactory
Arguments: None
Example:
In: "jump jumping jumped"
Tokenizer to Filter: "jump", "jumping", "jumped"
Out: "jump", "jump", "jump"
Protected Term Filter
This filter enables a form of conditional filtering: it only applies its wrapped filters to terms that are not
contained in a protected set.
Factory class: solr.ProtectedTermFilterFactory
Arguments:
protected
(required) Comma-separated list of files containing protected terms, one per line.
wrappedFilters
(required) Case-insensitive comma-separated list of TokenFilterFactory SPI names (strip trailing
(Token)FilterFactory from the factory name - see the java.util.ServiceLoader interface). Each filter
name must be unique, so if you need to specify the same filter more than once, you must add caseinsensitive unique -id suffixes to each same-SPI-named filter (note that the -id suffix is stripped prior to
SPI lookup).
ignoreCase
(true/false, default false) Ignore case when testing for protected words. If true, the protected list should
contain lowercase words.
Example:
All terms except those in protectedTerms.txt are truncated at 4 characters and lowercased:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 263 of 1426
Example:
This example includes multiple same-named wrapped filters with unique -id suffixes. Note that both the
filter SPI names and -id suffixes are treated case-insensitively.
For all terms except those in protectedTerms.txt, synonyms are added, terms are reversed, and then
synonyms are added for the reversed terms:
Remove Duplicates Token Filter
The filter removes duplicate tokens in the stream. Tokens are considered to be duplicates ONLY if they have
the same text and position values.
Because positions must be the same, this filter might not do what a user expects it to do based on its name.
It is a very specialized filter that is only useful in very specific circumstances. It has been so named for
brevity, even though it is potentially misleading.
Factory class: solr.RemoveDuplicatesTokenFilterFactory
Arguments: None
Example:
One example of where RemoveDuplicatesTokenFilterFactory is useful in situations where a synonym file is
being used in conjunction with a stemmer. In these situations, both the stemmer and the synonym filter can
cause completely identical terms with the same positions to end up in the stream, increasing index size with
no benefit.
Consider the following entry from a synonyms.txt file:
Television, Televisions, TV, TVs
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 264 of 1426
Apache Solr Reference Guide 7.7
When used in the following configuration:
In: "Watch TV"
Tokenizer to Synonym Filter: "Watch"(1) "TV"(2)
Synonym Filter to Stem Filter: "Watch"(1) "Television"(2) "Televisions"(2) "TV"(2) "TVs"(2)
Stem Filter to Remove Dups Filter: "Watch"(1) "Television"(2) "Television"(2) "TV"(2) "TV"(2)
Out: "Watch"(1) "Television"(2) "TV"(2)
Reversed Wildcard Filter
This filter reverses tokens to provide faster leading wildcard and prefix queries. Tokens without wildcards
are not reversed.
Factory class: solr.ReversedWildcardFilterFactory
Arguments:
withOriginal
(boolean) If true, the filter produces both original and reversed tokens at the same positions. If false,
produces only reversed tokens.
maxPosAsterisk
(integer, default = 2) The maximum position of the asterisk wildcard ('*') that triggers the reversal of the
query term. Terms with asterisks at positions above this value are not reversed.
maxPosQuestion
(integer, default = 1) The maximum position of the question mark wildcard ('?') that triggers the reversal
of query term. To reverse only pure suffix queries (queries with a single leading asterisk), set this to 0 and
maxPosAsterisk to 1.
maxFractionAsterisk
(float, default = 0.0) An additional parameter that triggers the reversal if asterisk ('*') position is less than
this fraction of the query token length.
minTrailing
(integer, default = 2) The minimum number of trailing characters in a query token after the last wildcard
character. For good performance this should be set to a value larger than 1.
Example:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 265 of 1426
In: "*foo *bar"
Tokenizer to Filter: "*foo", "*bar"
Out: "oof*", "rab*"
Shingle Filter
This filter constructs shingles, which are token n-grams, from the token stream. It combines runs of tokens
into a single token.
Factory class: solr.ShingleFilterFactory
Arguments:
minShingleSize
(integer, must be >= 2, default 2) The minimum number of tokens per shingle.
maxShingleSize
(integer, must be >= minShingleSize, default 2) The maximum number of tokens per shingle.
outputUnigrams
(boolean, default true) If true, then each individual token is also included at its original position.
outputUnigramsIfNoShingles
(boolean, default false) If true, then individual tokens will be output if no shingles are possible.
tokenSeparator
(string, default is " ") The string to use when joining adjacent tokens to form a shingle.
Example:
Default behavior.
In: "To be, or what?"
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "To"(1), "To be"(1), "be"(2), "be or"(2), "or"(3), "or what"(3), "what"(4)
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 266 of 1426
Apache Solr Reference Guide 7.7
Example:
A shingle size of four, do not include original token.
In: "To be, or not to be."
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "not"(4), "to"(5), "be"(6)
Out: "To be"(1), "To be or"(1), "To be or not"(1), "be or"(2), "be or not"(2), "be or not to"(2), "or not"(3), "or
not to"(3), "or not to be"(3), "not to"(4), "not to be"(4), "to be"(5)
Snowball Porter Stemmer Filter
This filter factory instantiates a language-specific stemmer generated by Snowball. Snowball is a software
package that generates pattern-based word stemmers. This type of stemmer is not as accurate as a tablebased stemmer, but is faster and less complex. Table-driven stemmers are labor intensive to create and
maintain and so are typically commercial products.
Solr contains Snowball stemmers for Armenian, Basque, Catalan, Danish, Dutch, English, Finnish, French,
German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish and Turkish. For
more information on Snowball, visit http://snowball.tartarus.org/.
StopFilterFactory, CommonGramsFilterFactory, and CommonGramsQueryFilterFactory can optionally read
stopwords in Snowball format (specify format="snowball" in the configuration of those FilterFactories).
Factory class: solr.SnowballPorterFilterFactory
Arguments:
language
(default "English") The name of a language, used to select the appropriate Porter stemmer to use. Case is
significant. This string is used to select a package name in the org.tartarus.snowball.ext class
hierarchy.
protected
Path of a text file containing a list of protected words, one per line. Protected words will not be stemmed.
Blank lines and lines that begin with "#" are ignored. This may be an absolute path, or a simple file name
in the Solr conf directory.
Example:
Default behavior:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 267 of 1426
In: "flip flipped flipping"
Tokenizer to Filter: "flip", "flipped", "flipping"
Out: "flip", "flip", "flip"
Example:
French stemmer, English words:
In: "flip flipped flipping"
Tokenizer to Filter: "flip", "flipped", "flipping"
Out: "flip", "flipped", "flipping"
Example:
Spanish stemmer, Spanish words:
In: "cante canta"
Tokenizer to Filter: "cante", "canta"
Out: "cant", "cant"
Stop Filter
This filter discards, or stops analysis of, tokens that are on the given stop words list. A standard stop words
list is included in the Solr conf directory, named stopwords.txt, which is appropriate for typical English
language text.
Factory class: solr.StopFilterFactory
Arguments:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 268 of 1426
Apache Solr Reference Guide 7.7
words
(optional) The path to a file that contains a list of stop words, one per line. Blank lines and lines that begin
with "#" are ignored. This may be an absolute path, or path relative to the Solr conf directory.
format
(optional) If the stopwords list has been formatted for Snowball, you can specify format="snowball" so
Solr can read the stopwords file.
ignoreCase
(true/false, default false) Ignore case when testing for stop words. If true, the stop list should contain
lowercase words.
enablePositionIncrements
if luceneMatchVersion is 4.4 or earlier and enablePositionIncrements="false", no position holes will
be left by this filter when it removes tokens. This argument is invalid if luceneMatchVersion is 5.0 or
later.
Example:
Case-sensitive matching, capitalized words not stopped. Token positions skip stopped words.
In: "To be or what?"
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "To"(1), "what"(4)
Example:
In: "To be or what?"
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "what"(4)
Suggest Stop Filter
Like Stop Filter, this filter discards, or stops analysis of, tokens that are on the given stop words list.
Suggest Stop Filter differs from Stop Filter in that it will not remove the last token unless it is followed by a
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 269 of 1426
token separator. For example, a query "find the" would preserve the 'the' since it was not followed by a
space, punctuation, etc., and mark it as a KEYWORD so that following filters will not change or remove it.
By contrast, a query like “find the popsicle” would remove ‘the’ as a stopword, since it’s followed by a space.
When using one of the analyzing suggesters, you would normally use the ordinary StopFilterFactory in
your index analyzer and then SuggestStopFilter in your query analyzer.
Factory class: solr.SuggestStopFilterFactory
Arguments:
words
(optional; default: StopAnalyzer#ENGLISH_STOP_WORDS_SET ) The name of a stopwords file to parse.
format
(optional; default: wordset) Defines how the words file will be parsed. If words is not specified, then
format must not be specified. The valid values for the format option are:
wordset
This is the default format, which supports one word per line (including any intra-word whitespace) and
allows whole line comments beginning with the # character. Blank lines are ignored.
snowball
This format allows for multiple words specified on each line, and trailing comments may be specified
using the vertical line (|). Blank lines are ignored.
ignoreCase
(optional; default: false) If true, matching is case-insensitive.
Example:
In: "The The"
Tokenizer to Filter: "the"(1), "the"(2)
Out: "the"(2)
Synonym Filter
This filter does synonym mapping. Each token is looked up in the list of synonyms and if a match is found,
then the synonym is emitted in place of the token. The position value of the new tokens are set such they all
occur at the same position as the original token.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 270 of 1426
Apache Solr Reference Guide 7.7
Synonym Filter has been Deprecated
Synonym Filter has been deprecated in favor of Synonym Graph Filter, which is required for
multi-term synonym support.
Factory class: solr.SynonymFilterFactory
For arguments and examples, see the Synonym Graph Filter below.
Synonym Graph Filter
This filter maps single- or multi-token synonyms, producing a fully correct graph output. This filter is a
replacement for the Synonym Filter, which produces incorrect graphs for multi-token synonyms.
If you use this filter during indexing, you must follow it with a Flatten Graph Filter to squash tokens on top of
one another like the Synonym Filter, because the indexer can’t directly consume a graph. To get fully correct
positional queries when your synonym replacements are multiple tokens, you should instead apply
synonyms using this filter at query time.
Although this filter produces correct token graphs, it cannot consume an input token graph
correctly.
Factory class: solr.SynonymGraphFilterFactory
Arguments:
synonyms
(required) The path of a file that contains a list of synonyms, one per line. In the (default) solr format see the format argument below for alternatives - blank lines and lines that begin with “#” are ignored.
This may be a comma-separated list of paths. See Resource and Plugin Loading for more information.
There are two ways to specify synonym mappings:
• A comma-separated list of words. If the token matches any of the words, then all the words in the list
are substituted, which will include the original token.
• Two comma-separated lists of words with the symbol "=>" between them. If the token matches any
word on the left, then the list on the right is substituted. The original token will not be included unless
it is also in the list on the right.
ignoreCase
(optional; default: false) If true, synonyms will be matched case-insensitively.
expand
(optional; default: true) If true, a synonym will be expanded to all equivalent synonyms. If false, all
equivalent synonyms will be reduced to the first in the list.
format
(optional; default: solr) Controls how the synonyms will be parsed. The short names solr (for
SolrSynonymParser) and wordnet (for WordnetSynonymParser ) are supported, or you may alternatively
supply the name of your own SynonymMap.Builder subclass.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 271 of 1426
tokenizerFactory
(optional; default: WhitespaceTokenizerFactory) The name of the tokenizer factory to use when parsing
the synonyms file. Arguments with the name prefix tokenizerFactory.* will be supplied as init params
to the specified tokenizer factory.
Any arguments not consumed by the synonym filter factory, including those without the
tokenizerFactory.* prefix, will also be supplied as init params to the tokenizer factory.
If tokenizerFactory is specified, then analyzer may not be, and vice versa.
analyzer
(optional; default: WhitespaceTokenizerFactory) The name of the analyzer class to use when parsing the
synonyms file. If analyzer is specified, then tokenizerFactory may not be, and vice versa.
For the following examples, assume a synonyms file named mysynonyms.txt:
couch,sofa,divan
teh => the
huge,ginormous,humungous => large
small => tiny,teeny,weeny
Example:
In: "teh small couch"
Tokenizer to Filter: "teh"(1), "small"(2), "couch"(3)
Out: "the"(1), "tiny"(2), "teeny"(2), "weeny"(2), "couch"(3), "sofa"(3), "divan"(3)
Example:
In: "teh ginormous, humungous sofa"
Tokenizer to Filter: "teh"(1), "ginormous"(2), "humungous"(3), "sofa"(4)
Out: "the"(1), "large"(2), "large"(3), "couch"(4), "sofa"(4), "divan"(4)
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 272 of 1426
Apache Solr Reference Guide 7.7
Token Offset Payload Filter
This filter adds the numeric character offsets of the token as a payload value for that token.
Factory class: solr.TokenOffsetPayloadTokenFilterFactory
Arguments: None
Example:
In: "bing bang boom"
Tokenizer to Filter: "bing", "bang", "boom"
Out: "bing"[0,4], "bang"[5,9], "boom"[10,14]
Trim Filter
This filter trims leading and/or trailing whitespace from tokens. Most tokenizers break tokens at whitespace,
so this filter is most often used for special situations.
Factory class: solr.TrimFilterFactory
Arguments:
updateOffsets
if luceneMatchVersion is 4.3 or earlier and updateOffsets="true", trimmed tokens' start and end
offsets will be updated to those of the first and last characters (plus one) remaining in the token. This
argument is invalid if luceneMatchVersion is 5.0 or later.
Example:
The PatternTokenizerFactory configuration used here splits the input on simple commas, it does not remove
whitespace.
In: "one, two , three ,four "
Tokenizer to Filter: "one", " two ", " three ", "four "
Out: "one", "two", "three", "four"
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 273 of 1426
Type As Payload Filter
This filter adds the token’s type, as an encoded byte sequence, as its payload.
Factory class: solr.TypeAsPayloadTokenFilterFactory
Arguments: None
Example:
In: "Pay Bob’s I.O.U."
Tokenizer to Filter: "Pay", "Bob’s", "I.O.U."
Out: "Pay"[], "Bob’s"[], "I.O.U."[]
Type As Synonym Filter
This filter adds the token’s type, as a token at the same position as the token, optionally with a configurable
prefix prepended.
Factory class: solr.TypeAsSynonymFilterFactory
Arguments:
prefix
(optional) The prefix to prepend to the token’s type.
Examples:
With the example below, each token’s type will be emitted verbatim at the same position:
With the example below, for a token "example.com" with type , the token emitted at the same position
will be "_type_":
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 274 of 1426
Apache Solr Reference Guide 7.7
Type Token Filter
This filter blacklists or whitelists a specified list of token types, assuming the tokens have type metadata
associated with them. For example, the UAX29 URL Email Tokenizer emits "" and "" typed
tokens, as well as other types. This filter would allow you to pull out only e-mail addresses from text as
tokens, if you wish.
Factory class: solr.TypeTokenFilterFactory
Arguments:
types
Defines the location of a file of types to filter.
useWhitelist
If true, the file defined in types should be used as include list. If false, or undefined, the file defined in
types is used as a blacklist.
enablePositionIncrements
if luceneMatchVersion is 4.3 or earlier and enablePositionIncrements="false", no position holes will
be left by this filter when it removes tokens. This argument is invalid if luceneMatchVersion is 5.0 or
later.
Example:
Word Delimiter Filter
This filter splits tokens at word delimiters.
Word Delimiter Filter has been Deprecated
Word Delimiter Filter has been deprecated in favor of Word Delimiter Graph Filter, which is
required to produce a correct token graph so that e.g., phrase queries can work correctly.
Factory class: solr.WordDelimiterFilterFactory
For a full description, including arguments and examples, see the Word Delimiter Graph Filter below.
Word Delimiter Graph Filter
This filter splits tokens at word delimiters.
If you use this filter during indexing, you must follow it with a Flatten Graph Filter to squash tokens on top of
one another like the Word Delimiter Filter, because the indexer can’t directly consume a graph. To get fully
correct positional queries when tokens are split, you should instead use this filter at query time.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 275 of 1426
Note: although this filter produces correct token graphs, it cannot consume an input token graph correctly.
The rules for determining delimiters are determined as follows:
• A change in case within a word: "CamelCase" -> "Camel", "Case". This can be disabled by setting
splitOnCaseChange="0".
• A transition from alpha to numeric characters or vice versa: "Gonzo5000" -> "Gonzo", "5000" "4500XL" ->
"4500", "XL". This can be disabled by setting splitOnNumerics="0".
• Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
• A trailing "'s" is removed: "O’Reilly’s" -> "O", "Reilly"
• Any leading or trailing delimiters are discarded: "--hot-spot--" -> "hot", "spot"
Factory class: solr.WordDelimiterGraphFilterFactory
Arguments:
generateWordParts
(integer, default 1) If non-zero, splits words at delimiters. For example:"CamelCase", "hot-spot" ->
"Camel", "Case", "hot", "spot"
generateNumberParts
(integer, default 1) If non-zero, splits numeric strings at delimiters:"1947-32" ->*"1947", "32"
splitOnCaseChange
(integer, default 1) If 0, words are not split on camel-case changes:"BugBlaster-XL" -> "BugBlaster", "XL".
Example 1 below illustrates the default (non-zero) splitting behavior.
splitOnNumerics
(integer, default 1) If 0, don’t split words on transitions from alpha to numeric:"FemBot3000" -> "Fem",
"Bot3000"
catenateWords
(integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor’s" ->
"hotspotsensor"
catenateNumbers
(integer, default 0) If non-zero, maximal runs of number parts will be joined: 1947-32" -> "194732"
catenateAll
(0/1, default 0) If non-zero, runs of word and number parts will be joined: "Zap-Master-9000" ->
"ZapMaster9000"
preserveOriginal
(integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" -> "Zap-Master-9000",
"Zap", "Master", "9000"
protected
(optional) The pathname of a file that contains a list of protected words that should be passed through
without splitting.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 276 of 1426
Apache Solr Reference Guide 7.7
stemEnglishPossessive
(integer, default 1) If 1, strips the possessive 's from each subword.
types
(optional) The pathname of a file that contains character => type mappings, which enable customization
of this filter’s splitting behavior. Recognized character types: LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, and
SUBWORD_DELIM.
The default for any character without a customized mapping is computed from Unicode character
properties. Blank lines and comment lines starting with '#' are ignored. An example file:
# Don't split numbers at '$', '.' or ','
$ => DIGIT
. => DIGIT
\u002C => DIGIT
# Don't split on ZWJ: http://en.wikipedia.org/wiki/Zero-width_joiner
\u200D => ALPHANUM
Example:
Default behavior. The whitespace tokenizer is used here to preserve non-alphanumeric characters.
In: "hot-spot RoboBlaster/9000 100XL"
Tokenizer to Filter: "hot-spot", "RoboBlaster/9000", "100XL"
Out: "hot", "spot", "Robo", "Blaster", "9000", "100", "XL"
Example:
Do not split on case changes, and do not generate number parts. Note that by not generating number parts,
tokens containing only numeric parts are ultimately discarded.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 277 of 1426
In: "hot-spot RoboBlaster/9000 100-42"
Tokenizer to Filter: "hot-spot", "RoboBlaster/9000", "100-42"
Out: "hot", "spot", "RoboBlaster", "9000"
Example:
Concatenate word parts and number parts, but not word and number parts that occur in the same token.
In: "hot-spot 100+42 XL40"
Tokenizer to Filter: "hot-spot"(1), "100+42"(2), "XL40"(3)
Out: "hot"(1), "spot"(2), "hotspot"(2), "100"(3), "42"(4), "10042"(4), "XL"(5), "40"(6)
Example:
Concatenate all. Word and/or number parts are joined together.
In: "XL-4000/ES"
Tokenizer to Filter: "XL-4000/ES"(1)
Out: "XL"(1), "4000"(2), "ES"(3), "XL4000ES"(3)
Example:
Using a protected words list that contains "AstroBlaster" and "XL-5000" (among others).
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 278 of 1426
Apache Solr Reference Guide 7.7
In: "FooBar AstroBlaster XL-5000 ==ES-34-"
Tokenizer to Filter: "FooBar", "AstroBlaster", "XL-5000", "==ES-34-"
Out: "FooBar", "FooBar", "AstroBlaster", "XL-5000", "ES", "34"
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 279 of 1426
CharFilterFactories
CharFilter is a component that pre-processes input characters.
CharFilters can be chained like Token Filters and placed in front of a Tokenizer. CharFilters can add, change,
or remove characters while preserving the original character offsets to support features like highlighting.
solr.MappingCharFilterFactory
This filter creates org.apache.lucene.analysis.MappingCharFilter, which can be used for changing one
string to another (for example, for normalizing é to e.).
This filter requires specifying a mapping argument, which is the path and name of a file containing the
mappings to perform.
Example:
[...]
Mapping file syntax:
• Comment lines beginning with a hash mark (#), as well as blank lines, are ignored.
• Each non-comment, non-blank line consists of a mapping of the form: "source" => "target"
◦ Double-quoted source string, optional whitespace, an arrow (=>), optional whitespace, double-quoted
target string.
• Trailing comments on mapping lines are not allowed.
• The source string must contain at least one character, but the target string may be empty.
• The following character escape sequences are recognized within source and target strings:
Escape Sequence
Resulting Character (ECMA48 alias)
Unicode Character Example Mapping Line
\\
\
U+005C
"\\" => "/"
\"
"
U+0022
"\"and\"" => "'and'"
\b
backspace (BS)
U+0008
"\b" => " "
\t
tab (HT)
U+0009
"\t" => ","
\n
newline (LF)
U+000A
"\n" => "
"
\f
form feed (FF)
U+000C
"\f" => "\n"
\r
carriage return (CR)
U+000D
"\r" => "/carriagereturn/"
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 280 of 1426
Apache Solr Reference Guide 7.7
Escape Sequence
Resulting Character (ECMA48 alias)
Unicode Character Example Mapping Line
\uXXXX
Unicode char referenced by
the 4 hex digits
U+XXXX
"\uFEFF" => ""
◦ A backslash followed by any other character is interpreted as if the character were present without
the backslash.
solr.HTMLStripCharFilterFactory
This filter creates org.apache.solr.analysis.HTMLStripCharFilter. This CharFilter strips HTML from the
input stream and passes the result to another CharFilter or a Tokenizer.
This filter:
• Removes HTML/XML tags while preserving other content.
• Removes attributes within tags and supports optional attribute quoting.
• Removes XML processing instructions, such as:
• Removes XML comments.
• Removes XML elements starting with .
• Removes contents of '); -->
hello
if a
hello
a
[...]
solr.ICUNormalizer2CharFilterFactory
This filter performs pre-tokenization Unicode normalization using ICU4J.
Arguments:
name
A Unicode Normalization Form, one of nfc, nfkc, nfkc_cf. Default is nfkc_cf.
mode
Either compose or decompose. Default is compose. Use decompose with name="nfc" or name="nfkc" to get
NFD or NFKD, respectively.
filter
A UnicodeSet pattern. Codepoints outside the set are always left unchanged. Default is [] (the null set, no
filtering - all codepoints are subject to normalization).
Example:
[...]
solr.PatternReplaceCharFilterFactory
This filter uses regular expressions to replace or change character patterns.
Arguments:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 282 of 1426
Apache Solr Reference Guide 7.7
pattern
the regular expression pattern to apply to the incoming text.
replacement
the text to use to replace matching patterns.
You can configure this filter in schema.xml like this:
[...]
The table below presents examples of regex-based pattern replacement:
Input
Pattern
Replace
ment
Output
Description
see-ing looking
(\w+)(ing)
$1
see-ing look
Removes "ing" from the end of
word.
see-ing looking
(\w+)ing
$1
see-ing look
Same as above. 2nd
parentheses can be omitted.
No.1 NO. no. 543
[nN][oO]\.\s*(\d+ #$1
)
#1 NO. #543
Replace some string literals
abc=1234=5678
(\w+)=(\d+)=(\d+) $3=$1=$2 5678=abc=1234
Guide Version 7.7 - Published: 2019-03-04
Change the order of the
groups.
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 283 of 1426
Language Analysis
This section contains information about tokenizers and filters related to character set conversion or for use
with specific languages.
For the European languages, tokenization is fairly straightforward. Tokens are delimited by white space
and/or a relatively small set of punctuation characters.
In other languages the tokenization rules are often not so simple. Some European languages may also
require special tokenization rules, such as rules for decompounding German words.
For information about language detection at index time, see Detecting Languages During Indexing.
KeywordMarkerFilterFactory
Protects words from being modified by stemmers. A customized protected word list may be specified with
the "protected" attribute in the schema. Any words in the protected word list will not be modified by any
stemmer in Solr.
A sample Solr protwords.txt with comments can be found in the sample_techproducts_configs configset
directory:
KeywordRepeatFilterFactory
Emits each token twice, one with the KEYWORD attribute and once without.
If placed before a stemmer, the result will be that you will get the unstemmed token preserved on the same
position as the stemmed one. Queries matching the original exact term will get a better score while still
maintaining the recall benefit of stemming. Another advantage of keeping the original token is that wildcard
truncation will work as expected.
To configure, add the KeywordRepeatFilterFactory early in the analysis chain. It is recommended to also
include RemoveDuplicatesTokenFilterFactory to avoid duplicates when tokens are not stemmed.
A sample fieldType configuration could look like this:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 284 of 1426
Apache Solr Reference Guide 7.7
When adding the same token twice, it will also score twice (double), so you may have to retune your ranking rules.
StemmerOverrideFilterFactory
Overrides stemming algorithms by applying a custom mapping, then protecting these terms from being
modified by stemmers.
A customized mapping of words to stems, in a tab-separated file, can be specified to the dictionary
attribute in the schema. Words in this mapping will be stemmed to the stems from the file, and will not be
further changed by any stemmer.
A sample stemdict.txt file is shown below:
# these must be tab-separated
monkeys monkey
otters otter
# some crazy ones that a stemmer would never do
dogs
cat
If you have a checkout of Solr’s source code locally, you can also find this example in Solr’s test resources at
solr/core/src/test-files/solr/collection1/conf/stemdict.txt.
Dictionary Compound Word Token Filter
This filter splits, or decompounds, compound words into individual words using a dictionary of the
component words. Each input token is passed through unchanged. If it can also be decompounded into
subwords, each subword is also added to the stream at the same logical position.
Compound words are most commonly found in Germanic languages.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 285 of 1426
Factory class: solr.DictionaryCompoundWordTokenFilterFactory
Arguments:
dictionary
(required) The path of a file that contains a list of simple words, one per line. Blank lines and lines that
begin with "#" are ignored. See Resource and Plugin Loading for more information.
minWordSize
(integer, default 5) Any token shorter than this is not decompounded.
minSubwordSize
(integer, default 2) Subwords shorter than this are not emitted as tokens.
maxSubwordSize
(integer, default 15) Subwords longer than this are not emitted as tokens.
onlyLongestMatch
(true/false) If true (the default), only the longest matching subwords will generate new tokens.
Example:
Assume that germanwords.txt contains at least the following words: dumm kopf donau dampf schiff
In: "Donaudampfschiff dummkopf"
Tokenizer to Filter: "Donaudampfschiff"(1), "dummkopf"(2),
Out: "Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2), "kopf"(2)
Unicode Collation
Unicode Collation is a language-sensitive method of sorting text that can also be used for advanced search
purposes.
Unicode Collation in Solr is fast, because all the work is done at index time.
Rather than specifying an analyzer within , the
solr.CollationField and solr.ICUCollationField field type classes provide this functionality.
solr.ICUCollationField, which is backed by the ICU4J library, provides more flexible configuration, has
more locales, is significantly faster, and requires less memory and less index space, since its keys are smaller
than those produced by the JDK implementation that backs solr.CollationField.
To use solr.ICUCollationField, you must add additional .jars to Solr’s classpath (as described in the
section Resources and Plugins on the Filesystem). See solr/contrib/analysis-extras/README.txt for
instructions on which jars you need to add.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 286 of 1426
Apache Solr Reference Guide 7.7
solr.ICUCollationField and solr.CollationField fields can be created in two ways:
• Based upon a system collator associated with a Locale.
• Based upon a tailored RuleBasedCollator ruleset.
Arguments for solr.ICUCollationField, specified as attributes within the element:
Using a System collator:
locale
(required) RFC 3066 locale ID. See the ICU locale explorer for a list of supported locales.
strength
Valid values are primary, secondary, tertiary, quaternary, or identical. See Comparison Levels in ICU
Collation Concepts for more information.
decomposition
Valid values are no or canonical. See Normalization in ICU Collation Concepts for more information.
Using a Tailored ruleset:
custom
(required) Path to a UTF-8 text file containing rules supported by the ICU RuleBasedCollator
strength
Valid values are primary, secondary, tertiary, quaternary, or identical. See Comparison Levels in ICU
Collation Concepts for more information.
decomposition
Valid values are no or canonical. See Normalization in ICU Collation Concepts for more information.
Expert options:
alternate
Valid values are shifted or non-ignorable. Can be used to ignore punctuation/whitespace.
caseLevel
(true/false) If true, in combination with strength="primary", accents are ignored but case is taken into
account. The default is false. See CaseLevel in ICU Collation Concepts for more information.
caseFirst
Valid values are lower or upper. Useful to control which is sorted first when case is not ignored.
numeric
(true/false) If true, digits are sorted according to numeric value, e.g., foobar-9 sorts before foobar-10. The
default is false.
variableTop
Single character or contraction. Controls what is variable for alternate.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 287 of 1426
Sorting Text for a Specific Language
In this example, text is sorted according to the default German rules provided by ICU4J.
Locales are typically defined as a combination of language and country, but you can specify just the
language if you want. For example, if you specify "de" as the language, you will get sorting that works well
for the German language. If you specify "de" as the language and "CH" as the country, you will get German
sorting specifically tailored for Switzerland.
...
...
In the example above, we defined the strength as "primary". The strength of the collation determines how
strict the sort order will be, but it also depends upon the language. For example, in English, "primary"
strength ignores differences in case and accents.
Another example:
...
...
...
The type will be used for the fields where the data contains Polish text. The "secondary" strength will ignore
case differences, but, unlike "primary" strength, a letter with diacritic(s) will be sorted differently from the
same base letter without diacritics.
An example using the "city_sort" field to sort:
q=*:*&fl=city&sort=city_sort+asc
Sorting Text for Multiple Languages
There are two approaches to supporting multiple languages: if there is a small list of languages you wish to
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 288 of 1426
Apache Solr Reference Guide 7.7
support, consider defining collated fields for each language and using copyField. However, adding a large
number of sort fields can increase disk and indexing costs. An alternative approach is to use the Unicode
default collator.
The Unicode default or ROOT locale has rules that are designed to work well for most languages. To use the
default locale, simply define the locale as the empty string. This Unicode default sort is still significantly
more advanced than the standard Solr sort.
Sorting Text with Custom Rules
You can define your own set of sorting rules. It’s easiest to take existing rules that are close to what you
want and customize them.
In the example below, we create a custom rule set for German called DIN 5007-2. This rule set treats umlauts
in German differently: it treats ö as equivalent to oe, ä as equivalent to ae, and ü as equivalent to ue. For
more information, see the ICU RuleBasedCollator javadocs.
This example shows how to create a custom rule set for solr.ICUCollationField and dump it to a file:
// get the default rules for Germany
// these are called DIN 5007-1 sorting
RuleBasedCollator baseCollator = (RuleBasedCollator) Collator.getInstance(new ULocale("de", "DE"
));
// define some tailorings, to make it DIN 5007-2 sorting.
// For example, this makes ö equivalent to oe
String DIN5007_2_tailorings =
"& ae , a\u0308 & AE , A\u0308"+
"& oe , o\u0308 & OE , O\u0308"+
"& ue , u\u0308 & UE , u\u0308";
// concatenate the default rules to the tailorings, and dump it to a String
RuleBasedCollator tailoredCollator = new RuleBasedCollator(baseCollator.getRules() +
DIN5007_2_tailorings);
String tailoredRules = tailoredCollator.getRules();
// write these to a file, be sure to use UTF-8 encoding!!!
FileOutputStream os = new FileOutputStream(new File("/solr_home/conf/customRules.dat"));
IOUtils.write(tailoredRules, os, "UTF-8");
This rule set can now be used for custom collation in Solr:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 289 of 1426
JDK Collation
As mentioned above, ICU Unicode Collation is better in several ways than JDK Collation, but if you cannot use
ICU4J for some reason, you can use solr.CollationField.
The principles of JDK Collation are the same as those of ICU Collation; you just specify language, country
and variant arguments instead of the combined locale argument.
Arguments for solr.CollationField, specified as attributes within the element:
Using a System collator (see Oracle’s list of locales supported in Java):
language
(required) ISO-639 language code
country
ISO-3166 country code
variant
Vendor or browser-specific code
strength
Valid values are primary, secondary, tertiary or identical. See Java Collator javadocs for more
information.
decomposition
Valid values are no, canonical, or full. See Java Collator javadocs for more information.
Using a Tailored ruleset:
custom
(required) Path to a UTF-8 text file containing rules supported by the JDK RuleBasedCollator
strength
Valid values are primary, secondary, tertiary or identical. See Java Collator javadocs for more
information.
decomposition
Valid values are no, canonical, or full. See Java Collator javadocs for more information.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 290 of 1426
Apache Solr Reference Guide 7.7
A solr.CollationField example:
...
...
ASCII & Decimal Folding Filters
ASCII Folding
This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII
characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. Only those characters
with reasonable ASCII alternatives are converted.
This can increase recall by causing more matches. On the other hand, it can reduce precision because
language-specific character differences may be lost.
Factory class: solr.ASCIIFoldingFilterFactory
Arguments: None
Example:
In: "Björn Ångström"
Tokenizer to Filter: "Björn", "Ångström"
Out: "Bjorn", "Angstrom"
Decimal Digit Folding
This filter converts any character in the Unicode "Decimal Number" general category (Nd) into their
equivalent Basic Latin digits (0-9).
This can increase recall by causing more matches. On the other hand, it can reduce precision because
language-specific character differences may be lost.
Factory class: solr.DecimalDigitFilterFactory
Arguments: None
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 291 of 1426
Example:
OpenNLP Integration
The lucene/analysis/opennlp module provides OpenNLP integration via several analysis components: a
tokenizer, a part-of-speech tagging filter, a phrase chunking filter, and a lemmatization filter. In addition to
these analysis components, Solr also provides an update request processor to extract named entities - see
Update Processor Factories That Can Be Loaded as Plugins.
The OpenNLP Tokenizer must be used with all other OpenNLP analysis components, for
two reasons: first, the OpenNLP Tokenizer detects and marks the sentence boundaries
required by all the OpenNLP filters; and second, since the pre-trained OpenNLP models
used by these filters were trained using the corresponding language-specific sentencedetection/tokenization models, the same tokenization, using the same models, must be
used at runtime for optimal performance.
To use the OpenNLP components, you must add additional .jars to Solr’s classpath (as described in the
section Resources and Plugins on the Filesystem). See solr/contrib/analysis-extras/README.txt for
instructions on which jars you need to add.
OpenNLP Tokenizer
The OpenNLP Tokenizer takes two language-specific binary model files as parameters: a sentence detector
model and a tokenizer model. The last token in each sentence is flagged, so that following OpenNLP-based
filters can use this information to apply operations to tokens one sentence at a time. See the OpenNLP
website for information on downloading pre-trained models.
Factory class: solr.OpenNLPTokenizerFactory
Arguments:
sentenceModel
(required) The path of a language-specific OpenNLP sentence detection model file. See Resource and
Plugin Loading for more information.
tokenizerModel
(required) The path of a language-specific OpenNLP tokenization model file. See Resource and Plugin
Loading for more information.
Example:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 292 of 1426
Apache Solr Reference Guide 7.7
OpenNLP Part-Of-Speech Filter
This filter sets each token’s type attribute to the part of speech (POS) assigned by the configured model. See
the OpenNLP website for information on downloading pre-trained models.
Lucene currently does not index token types, so if you want to keep this information, you
have to preserve it either in a payload or as a synonym; see the examples below.
Factory class: solr.OpenNLPPOSFilterFactory
Arguments:
posTaggerModel
(required) The path of a language-specific OpenNLP POS tagger model file. See Resource and Plugin
Loading for more information.
Examples:
The OpenNLP tokenizer will tokenize punctuation, which is useful for following token filters, but ordinarily
you don’t want to include punctuation in your index, so the TypeTokenFilter (described here) is included in
the examples below, with stop.pos.txt containing the following:
stop.pos.txt
#
$
''
``
,
-LRB-RRB:
.
Index the POS for each token as a payload:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 293 of 1426
Index the POS for each token as a synonym, after prefixing the POS with "@" (see the TypeAsSynonymFilter
description):
Only index nouns - the keep.pos.txt file contains lines NN, NNS, NNP and NNPS:
OpenNLP Phrase Chunking Filter
This filter sets each token’s type attribute based on the output of an OpenNLP phrase chunking model. The
chunk labels replace the POS tags that previously were in each token’s type attribute. See the OpenNLP
website for information on downloading pre-trained models.
Prerequisite: the OpenNLP Tokenizer and the OpenNLP Part-Of-Speech Filter must precede this filter.
Lucene currently does not index token types, so if you want to keep this information, you
have to preserve it either in a payload or as a synonym; see the examples below.
Factory class: solr.OpenNLPChunkerFilter
Arguments:
chunkerModel
(required) The path of a language-specific OpenNLP phrase chunker model file. See Resource and Plugin
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 294 of 1426
Apache Solr Reference Guide 7.7
Loading for more information.
Examples:
Index the phrase chunk label for each token as a payload:
Index the phrase chunk label for each token as a synonym, after prefixing it with "#" (see the
TypeAsSynonymFilter description):
OpenNLP Lemmatizer Filter
This filter replaces the text of each token with its lemma. Both a dictionary-based lemmatizer and a modelbased lemmatizer are supported. If both are configured, the dictionary-based lemmatizer is tried first, and
then the model-based lemmatizer is consulted for out-of-vocabulary tokens. See the OpenNLP website for
information on downloading pre-trained models.
Factory class: solr.OpenNLPLemmatizerFilter
Arguments:
Either dictionary or lemmatizerModel must be provided, and both may be provided - see the examples
below:
dictionary
(optional) The path of a lemmatization dictionary file. See Resource and Plugin Loading for more
information. The dictionary file must be encoded as UTF-8, with one entry per line, in the form
word[tab]lemma[tab]part-of-speech, e.g., wrote[tab]write[tab]VBD.
lemmatizerModel
(optional) The path of a language-specific OpenNLP lemmatizer model file. See Resource and Plugin
Loading for more information.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 295 of 1426
Examples:
Perform dictionary-based lemmatization, and fall back to model-based lemmatization for out-of-vocabulary
tokens (see the OpenNLP Part-Of-Speech Filter section above for information about using TypeTokenFilter
to avoid indexing punctuation):
Perform dictionary-based lemmatization only:
Perform model-based lemmatization only, preserving the original token and emitting the lemma as a
synonym (see the KeywordRepeatFilterFactory description)):
Language-Specific Factories
These factories are each designed to work with specific languages. The languages covered here are:
• Arabic
• Brazilian Portuguese
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 296 of 1426
Apache Solr Reference Guide 7.7
• Bulgarian
• Catalan
• Traditional Chinese
• Simplified Chinese
• Czech
• Danish
• Dutch
• Finnish
• French
• Galician
• German
• Greek
• Hebrew, Lao, Myanmar, Khmer
• Hindi
• Indonesian
• Italian
• Irish
• Japanese
• Latvian
• Norwegian
• Persian
• Polish
• Portuguese
• Romanian
• Russian
• Scandinavian
• Serbian
• Spanish
• Swedish
• Thai
• Turkish
• Ukrainian
Arabic
Solr provides support for the Light-10 (PDF) stemming algorithm, and Lucene includes an example stopword
list.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 297 of 1426
This algorithm defines both character normalization and stemming, so these are split into two filters to
provide more flexibility.
Factory classes: solr.ArabicStemFilterFactory, solr.ArabicNormalizationFilterFactory
Arguments: None
Example:
Brazilian Portuguese
This is a Java filter written specifically for stemming the Brazilian dialect of the Portuguese language. It uses
the Lucene class org.apache.lucene.analysis.br.BrazilianStemmer. Although that stemmer can be
configured to use a list of protected words (which should not be stemmed), this factory does not accept any
arguments to specify such a list.
Factory class: solr.BrazilianStemFilterFactory
Arguments: None
Example:
In: "praia praias"
Tokenizer to Filter: "praia", "praias"
Out: "pra", "pra"
Bulgarian
Solr includes a light stemmer for Bulgarian, following this algorithm (PDF), and Lucene includes an example
stopword list.
Factory class: solr.BulgarianStemFilterFactory
Arguments: None
Example:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 298 of 1426
Apache Solr Reference Guide 7.7
Catalan
Solr can stem Catalan using the Snowball Porter Stemmer with an argument of language="Catalan". Solr
includes a set of contractions for Catalan, which can be stripped using solr.ElisionFilterFactory.
Factory class: solr.SnowballPorterFilterFactory
Arguments:
language
(required) stemmer language, "Catalan" in this case
Example:
In: "llengües llengua"
Tokenizer to Filter: "llengües"(1) "llengua"(2),
Out: "llengu"(1), "llengu"(2)
Traditional Chinese
The default configuration of the ICU Tokenizer is suitable for Traditional Chinese text. It follows the Word
Break rules from the Unicode Text Segmentation algorithm for non-Chinese text, and uses a dictionary to
segment Chinese words. To use this tokenizer, you must add additional .jars to Solr’s classpath (as described
in the section Resources and Plugins on the Filesystem). See the solr/contrib/analysisextras/README.txt for information on which jars you need to add.
Standard Tokenizer can also be used to tokenize Traditional Chinese text. Following the Word Break rules
from the Unicode Text Segmentation algorithm, it produces one token per Chinese character. When
combined with CJK Bigram Filter, overlapping bigrams of Chinese characters are formed.
CJK Width Filter folds fullwidth ASCII variants into the equivalent Basic Latin forms.
Examples:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 299 of 1426
CJK Bigram Filter
Forms bigrams (overlapping 2-character sequences) of CJK characters that are generated from Standard
Tokenizer or ICU Tokenizer.
By default, all CJK characters produce bigrams, but finer grained control is available by specifying
orthographic type arguments han, hiragana, katakana, and hangul. When set to false, characters of the
corresponding type will be passed through as unigrams, and will not be included in any bigrams.
When a CJK character has no adjacent characters to form a bigram, it is output in unigram form. If you want
to always output both unigrams and bigrams, set the outputUnigrams argument to true.
In all cases, all non-CJK input is passed through unmodified.
Arguments:
han
(true/false) If false, Han (Chinese) characters will not form bigrams. Default is true.
hiragana
(true/false) If false, Hiragana (Japanese) characters will not form bigrams. Default is true.
katakana
(true/false) If false, Katakana (Japanese) characters will not form bigrams. Default is true.
hangul
(true/false) If false, Hangul (Korean) characters will not form bigrams. Default is true.
outputUnigrams
(true/false) If true, in addition to forming bigrams, all characters are also passed through as unigrams.
Default is false.
See the example under Traditional Chinese.
Simplified Chinese
For Simplified Chinese, Solr provides support for Chinese sentence and word segmentation with the HMM
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 300 of 1426
Apache Solr Reference Guide 7.7
Chinese Tokenizer. This component includes a large dictionary and segments Chinese text into words with
the Hidden Markov Model. To use this tokenizer, you must add additional .jars to Solr’s classpath (as
described in the section Resources and Plugins on the Filesystem). See the solr/contrib/analysisextras/README.txt for information on which jars you need to add.
The default configuration of the ICU Tokenizer is also suitable for Simplified Chinese text. It follows the Word
Break rules from the Unicode Text Segmentation algorithm for non-Chinese text, and uses a dictionary to
segment Chinese words. To use this tokenizer, you must add additional .jars to Solr’s classpath (as described
in the section Resources and Plugins on the Filesystem). See the solr/contrib/analysisextras/README.txt for information on which jars you need to add.
Also useful for Chinese analysis:
CJK Width Filter folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds halfwidth
Katakana variants into their equivalent fullwidth forms.
Examples:
HMM Chinese Tokenizer
For Simplified Chinese, Solr provides support for Chinese sentence and word segmentation with the
solr.HMMChineseTokenizerFactory in the analysis-extras contrib module. This component includes a
large dictionary and segments Chinese text into words with the Hidden Markov Model. To use this tokenizer,
you must add additional .jars to Solr’s classpath (as described in the section Resources and Plugins on the
Filesystem). See solr/contrib/analysis-extras/README.txt for instructions on which jars you need to
add.
Factory class: solr.HMMChineseTokenizerFactory
Arguments: None
Examples:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 301 of 1426
To use the default setup with fallback to English Porter stemmer for English words, use:
Or to configure your own analysis setup, use the solr.HMMChineseTokenizerFactory along with your
custom filter setup. See an example of this in the Simplified Chinese section.
Czech
Solr includes a light stemmer for Czech, following this algorithm, and Lucene includes an example stopword
list.
Factory class: solr.CzechStemFilterFactory
Arguments: None
Example:
In: "prezidenští, prezidenta, prezidentského"
Tokenizer to Filter: "prezidenští", "prezidenta", "prezidentského"
Out: "preziden", "preziden", "preziden"
Danish
Solr can stem Danish using the Snowball Porter Stemmer with an argument of language="Danish".
Also relevant are the Scandinavian normalization filters.
Factory class: solr.SnowballPorterFilterFactory
Arguments:
language
(required) stemmer language, "Danish" in this case
Example:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 302 of 1426
Apache Solr Reference Guide 7.7
In: "undersøg undersøgelse"
Tokenizer to Filter: "undersøg"(1) "undersøgelse"(2),
Out: "undersøg"(1), "undersøg"(2)
Dutch
Solr can stem Dutch using the Snowball Porter Stemmer with an argument of language="Dutch".
Factory class: solr.SnowballPorterFilterFactory
Arguments:
language
(required) stemmer language, "Dutch" in this case
Example:
In: "kanaal kanalen"
Tokenizer to Filter: "kanaal", "kanalen"
Out: "kanal", "kanal"
Finnish
Solr includes support for stemming Finnish, and Lucene includes an example stopword list.
Factory class: solr.FinnishLightStemFilterFactory
Arguments: None
Example:
In: "kala kalat"
Tokenizer to Filter: "kala", "kalat"
Out: "kala", "kala"
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 303 of 1426
French
Elision Filter
Removes article elisions from a token stream. This filter can be useful for languages such as French, Catalan,
Italian, and Irish.
Factory class: solr.ElisionFilterFactory
Arguments:
articles
The pathname of a file that contains a list of articles, one per line, to be stripped. Articles are words such
as "le", which are commonly abbreviated, such as in l’avion (the plane). This file should include the
abbreviated form, which precedes the apostrophe. In this case, simply "l". If no articles attribute is
specified, a default set of French articles is used.
ignoreCase
(boolean) If true, the filter ignores the case of words when comparing them to the common word file.
Defaults to false
Example:
In: "L’histoire d’art"
Tokenizer to Filter: "L’histoire", "d’art"
Out: "histoire", "art"
French Light Stem Filter
Solr includes three stemmers for French: one in the solr.SnowballPorterFilterFactory, a lighter stemmer
called solr.FrenchLightStemFilterFactory, and an even less aggressive stemmer called
solr.FrenchMinimalStemFilterFactory. Lucene includes an example stopword list.
Factory classes: solr.FrenchLightStemFilterFactory, solr.FrenchMinimalStemFilterFactory
Arguments: None
Examples:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 304 of 1426
Apache Solr Reference Guide 7.7
In: "le chat, les chats"
Tokenizer to Filter: "le", "chat", "les", "chats"
Out: "le", "chat", "le", "chat"
Galician
Solr includes a stemmer for Galician following this algorithm, and Lucene includes an example stopword list.
Factory class: solr.GalicianStemFilterFactory
Arguments: None
Example:
In: "felizmente Luzes"
Tokenizer to Filter: "felizmente", "luzes"
Out: "feliz", "luz"
German
Solr includes four stemmers for German: one in the solr.SnowballPorterFilterFactory
language="German", a stemmer called solr.GermanStemFilterFactory, a lighter stemmer called
solr.GermanLightStemFilterFactory, and an even less aggressive stemmer called
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 305 of 1426
solr.GermanMinimalStemFilterFactory. Lucene includes an example stopword list.
Factory classes: solr.GermanStemFilterFactory, solr.LightGermanStemFilterFactory,
solr.MinimalGermanStemFilterFactory
Arguments: None
Examples:
In: "haus häuser"
Tokenizer to Filter: "haus", "häuser"
Out: "haus", "haus"
Greek
This filter converts uppercase letters in the Greek character set to the equivalent lowercase character.
Factory class: solr.GreekLowerCaseFilterFactory
Arguments: None
Use of custom charsets is no longer supported as of Solr 3.1. If you need to index text in
these encodings, please use Java’s character set conversion facilities (InputStreamReader,
etc.) during I/O, so that Lucene can analyze this text as Unicode instead.
Example:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 306 of 1426
Apache Solr Reference Guide 7.7
Hindi
Solr includes support for stemming Hindi following this algorithm (PDF), support for common spelling
differences through the solr.HindiNormalizationFilterFactory, support for encoding differences
through the solr.IndicNormalizationFilterFactory following this algorithm, and Lucene includes an
example stopword list.
Factory classes: solr.IndicNormalizationFilterFactory, solr.HindiNormalizationFilterFactory,
solr.HindiStemFilterFactory
Arguments: None
Example:
Indonesian
Solr includes support for stemming Indonesian (Bahasa Indonesia) following this algorithm (PDF), and
Lucene includes an example stopword list.
Factory class: solr.IndonesianStemFilterFactory
Arguments: None
Example:
In: "sebagai sebagainya"
Tokenizer to Filter: "sebagai", "sebagainya"
Out: "bagai", "bagai"
Italian
Solr includes two stemmers for Italian: one in the solr.SnowballPorterFilterFactory
language="Italian", and a lighter stemmer called solr.ItalianLightStemFilterFactory. Lucene includes
an example stopword list.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 307 of 1426
Factory class: solr.ItalianStemFilterFactory
Arguments: None
Example:
In: "propaga propagare propagamento"
Tokenizer to Filter: "propaga", "propagare", "propagamento"
Out: "propag", "propag", "propag"
Irish
Solr can stem Irish using the Snowball Porter Stemmer with an argument of language="Irish". Solr
includes solr.IrishLowerCaseFilterFactory, which can handle Irish-specific constructs. Solr also includes
a set of contractions for Irish which can be stripped using solr.ElisionFilterFactory.
Factory class: solr.SnowballPorterFilterFactory
Arguments:
language
(required) stemmer language, "Irish" in this case
Example:
In: "siopadóireacht síceapatacha b’fhearr m’athair"
Tokenizer to Filter: "siopadóireacht", "síceapatacha", "b’fhearr", "m’athair"
Out: "siopadóir", "síceapaite", "fearr", "athair"
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 308 of 1426
Apache Solr Reference Guide 7.7
Japanese
Solr includes support for analyzing Japanese, via the Lucene Kuromoji morphological analyzer, which
includes several analysis components - more details on each below:
• JapaneseIterationMarkCharFilter normalizes Japanese horizontal iteration marks (odoriji) to their
expanded form.
• JapaneseTokenizer tokenizes Japanese using morphological analysis, and annotates each term with
part-of-speech, base form (a.k.a. lemma), reading and pronunciation.
• JapaneseBaseFormFilter replaces original terms with their base forms (a.k.a. lemmas).
• JapanesePartOfSpeechStopFilter removes terms that have one of the configured parts-of-speech.
• JapaneseKatakanaStemFilter normalizes common katakana spelling variations ending in a long sound
character (U+30FC) by removing the long sound character.
Also useful for Japanese analysis, from lucene-analyzers-common:
• CJKWidthFilter folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds halfwidth
Katakana variants into their equivalent fullwidth forms.
Japanese Iteration Mark CharFilter
Normalizes horizontal Japanese iteration marks (odoriji) to their expanded form. Vertical iteration marks are
not supported.
Factory class: JapaneseIterationMarkCharFilterFactory
Arguments:
normalizeKanji
set to false to not normalize kanji iteration marks (default is true)
normalizeKana
set to false to not normalize kana iteration marks (default is true)
Japanese Tokenizer
Tokenizer for Japanese that uses morphological analysis, and annotates each term with part-of-speech, base
form (a.k.a. lemma), reading and pronunciation.
JapaneseTokenizer has a search mode (the default) that does segmentation useful for search: a heuristic is
used to segment compound terms into their constituent parts while also keeping the original compound
terms as synonyms.
Factory class: solr.JapaneseTokenizerFactory
Arguments:
mode
Use search mode to get a noun-decompounding effect useful for search. search mode improves
segmentation for search at the expense of part-of-speech accuracy. Valid values for mode are:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 309 of 1426
• normal: default segmentation
• search: segmentation useful for search (extra compound splitting)
• extended: search mode plus unigramming of unknown words (experimental)
For some applications it might be good to use search mode for indexing and normal mode for
queries to increase precision and prevent parts of compounds from being matched and highlighted.
userDictionary
filename for a user dictionary, which allows overriding the statistical model with your own entries for
segmentation, part-of-speech tags and readings without a need to specify weights. See
lang/userdict_ja.txt for a sample user dictionary file.
userDictionaryEncoding
user dictionary encoding (default is UTF-8)
discardPunctuation
set to false to keep punctuation, true to discard (the default)
Japanese Base Form Filter
Replaces original terms' text with the corresponding base form (lemma). (JapaneseTokenizer annotates
each term with its base form.)
Factory class: JapaneseBaseFormFilterFactory
(no arguments)
Japanese Part Of Speech Stop Filter
Removes terms with one of the configured parts-of-speech. JapaneseTokenizer annotates terms with partsof-speech.
Factory class : JapanesePartOfSpeechStopFilterFactory
Arguments:
tags
filename for a list of parts-of-speech for which to remove terms; see conf/lang/stoptags_ja.txt in the
sample_techproducts_config configset for an example.
enablePositionIncrements
if luceneMatchVersion is 4.3 or earlier and enablePositionIncrements="false", no position holes will
be left by this filter when it removes tokens. This argument is invalid if luceneMatchVersion is 5.0 or
later.
Japanese Katakana Stem Filter
Normalizes common katakana spelling variations ending in a long sound character (U+30FC) by removing
the long sound character.
solr.CJKWidthFilterFactory should be specified prior to this filter to normalize half-width katakana to fullwidth.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 310 of 1426
Apache Solr Reference Guide 7.7
Factory class: JapaneseKatakanaStemFilterFactory
Arguments:
minimumLength
terms below this length will not be stemmed. Default is 4, value must be 2 or more.
CJK Width Filter
Folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds halfwidth Katakana variants
into their equivalent fullwidth forms.
Factory class: CJKWidthFilterFactory
(no arguments)
Example:
Hebrew, Lao, Myanmar, Khmer
Lucene provides support, in addition to UAX#29 word break rules, for Hebrew’s use of the double and single
quote characters, and for segmenting Lao, Myanmar, and Khmer into syllables with the
solr.ICUTokenizerFactory in the analysis-extras contrib module. To use this tokenizer, you must add
additional .jars to Solr’s classpath (as described in the section Resources and Plugins on the Filesystem). See
solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add.
See the ICUTokenizer for more information.
Latvian
Solr includes support for stemming Latvian, and Lucene includes an example stopword list.
Factory class: solr.LatvianStemFilterFactory
Arguments: None
Example:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 311 of 1426
In: "tirgiem tirgus"
Tokenizer to Filter: "tirgiem", "tirgus"
Out: "tirg", "tirg"
Norwegian
Solr includes two classes for stemming Norwegian, NorwegianLightStemFilterFactory and
NorwegianMinimalStemFilterFactory. Lucene includes an example stopword list.
Another option is to use the Snowball Porter Stemmer with an argument of language="Norwegian".
Also relevant are the Scandinavian normalization filters.
Norwegian Light Stemmer
The NorwegianLightStemFilterFactory requires a "two-pass" sort for the -dom and -het endings. This
means that in the first pass the word "kristendom" is stemmed to "kristen", and then all the general rules
apply so it will be further stemmed to "krist". The effect of this is that "kristen," "kristendom,"
"kristendommen," and "kristendommens" will all be stemmed to "krist."
The second pass is to pick up -dom and -het endings. Consider this example:
One pass
Two passes
Before
After
Before
After
forlegen
forleg
forlegen
forleg
forlegenhet
forlegen
forlegenhet
forleg
forlegenheten
forlegen
forlegenheten
forleg
forlegenhetens
forlegen
forlegenhetens
forleg
firkantet
firkant
firkantet
firkant
firkantethet
firkantet
firkantethet
firkant
firkantetheten
firkantet
firkantetheten
firkant
Factory class: solr.NorwegianLightStemFilterFactory
Arguments:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 312 of 1426
Apache Solr Reference Guide 7.7
variant
Choose the Norwegian language variant to use. Valid values are:
• nb: Bokmål (default)
• nn: Nynorsk
• no: both
Example:
In: "Forelskelsen"
Tokenizer to Filter: "forelskelsen"
Out: "forelske"
Norwegian Minimal Stemmer
The NorwegianMinimalStemFilterFactory stems plural forms of Norwegian nouns only.
Factory class: solr.NorwegianMinimalStemFilterFactory
Arguments:
variant
Choose the Norwegian language variant to use. Valid values are:
• nb: Bokmål (default)
• nn: Nynorsk
• no: both
Example:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 313 of 1426
In: "Bilens"
Tokenizer to Filter: "bilens"
Out: "bil"
Persian
Persian Filter Factories
Solr includes support for normalizing Persian, and Lucene includes an example stopword list.
Factory class: solr.PersianNormalizationFilterFactory
Arguments: None
Example:
Polish
Solr provides support for Polish stemming with the solr.StempelPolishStemFilterFactory, and
solr.MorphologikFilterFactory for lemmatization, in the contrib/analysis-extras module. The
solr.StempelPolishStemFilterFactory component includes an algorithmic stemmer with tables for Polish.
To use either of these filters, you must add additional .jars to Solr’s classpath (as described in the section
Resources and Plugins on the Filesystem). See solr/contrib/analysis-extras/README.txt for instructions
on which jars you need to add.
Factory class: solr.StempelPolishStemFilterFactory and solr.MorfologikFilterFactory
Arguments: None
Example:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 314 of 1426
Apache Solr Reference Guide 7.7
In: ""studenta studenci"
Tokenizer to Filter: "studenta", "studenci"
Out: "student", "student"
More information about the Stempel stemmer is available in the Lucene javadocs.
Note the lower case filter is applied after the Morfologik stemmer; this is because the Polish dictionary
contains proper names and then proper term case may be important to resolve disambiguities (or even
lookup the correct lemma at all).
The Morfologik dictionary parameter value is a constant specifying which dictionary to choose. The
dictionary resource must be named path/to/language.dict and have an associated .info metadata file.
See the Morfologik project for details. If the dictionary attribute is not provided, the Polish dictionary is
loaded and used by default.
Portuguese
Solr includes four stemmers for Portuguese: one in the solr.SnowballPorterFilterFactory, an alternative
stemmer called solr.PortugueseStemFilterFactory, a lighter stemmer called
solr.PortugueseLightStemFilterFactory, and an even less aggressive stemmer called
solr.PortugueseMinimalStemFilterFactory. Lucene includes an example stopword list.
Factory classes: solr.PortugueseStemFilterFactory, solr.PortugueseLightStemFilterFactory,
solr.PortugueseMinimalStemFilterFactory
Arguments: None
Example:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 315 of 1426
In: "praia praias"
Tokenizer to Filter: "praia", "praias"
Out: "pra", "pra"
Romanian
Solr can stem Romanian using the Snowball Porter Stemmer with an argument of language="Romanian".
Factory class: solr.SnowballPorterFilterFactory
Arguments:
language
(required) stemmer language, "Romanian" in this case
Example:
Russian
Russian Stem Filter
Solr includes two stemmers for Russian: one in the solr.SnowballPorterFilterFactory
language="Russian", and a lighter stemmer called solr.RussianLightStemFilterFactory. Lucene includes
an example stopword list.
Factory class: solr.RussianLightStemFilterFactory
Arguments: None
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 316 of 1426
Apache Solr Reference Guide 7.7
Example:
Scandinavian
Scandinavian is a language group spanning three languages Norwegian, Swedish and Danish which are very
similar.
Swedish å, ä, ö are in fact the same letters as Norwegian and Danish å, æ, ø and thus interchangeable when
used between these languages. They are however folded differently when people type them on a keyboard
lacking these characters.
In that situation almost all Swedish people use a, a, o instead of å, ä, ö. Norwegians and Danes on the other
hand usually type aa, ae and oe instead of å, æ and ø. Some do however use a, a, o, oo, ao and sometimes
permutations of everything above.
There are two filters for helping with normalization between Scandinavian languages: one is
solr.ScandinavianNormalizationFilterFactory trying to preserve the special characters (æäöå) and
another solr.ScandinavianFoldingFilterFactory which folds these to the more broad ø/ö->o, etc.
See also each language section for other relevant filters.
Scandinavian Normalization Filter
This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants
(aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ.
It’s a semantically less destructive solution than ScandinavianFoldingFilter, most useful when a person
with a Norwegian or Danish keyboard queries a Swedish index and vice versa. This filter does not perform
the common Swedish folds of å and ä to a nor ö to o.
Factory class: solr.ScandinavianNormalizationFilterFactory
Arguments: None
Example:
In: "blåbærsyltetøj blåbärsyltetöj blaabaarsyltetoej blabarsyltetoj"
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 317 of 1426
Tokenizer to Filter: "blåbærsyltetøj", "blåbärsyltetöj", "blaabaersyltetoej", "blabarsyltetoj"
Out: "blåbærsyltetøj", "blåbærsyltetøj", "blåbærsyltetøj", "blabarsyltetoj"
Scandinavian Folding Filter
This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o. It also discriminate against use of double
vowels aa, ae, ao, oe and oo, leaving just the first one.
It’s a semantically more destructive solution than ScandinavianNormalizationFilter, but can in addition
help with matching raksmorgas as räksmörgås.
Factory class: solr.ScandinavianFoldingFilterFactory
Arguments: None
Example:
In: "blåbærsyltetøj blåbärsyltetöj blaabaarsyltetoej blabarsyltetoj"
Tokenizer to Filter: "blåbærsyltetøj", "blåbärsyltetöj", "blaabaersyltetoej", "blabarsyltetoj"
Out: "blabarsyltetoj", "blabarsyltetoj", "blabarsyltetoj", "blabarsyltetoj"
Serbian
Serbian Normalization Filter
Solr includes a filter that normalizes Serbian Cyrillic and Latin characters. Note that this filter only works with
lowercased input.
See the Solr wiki for tips & advice on using this filter: https://wiki.apache.org/solr/SerbianLanguageSupport
Factory class: solr.SerbianNormalizationFilterFactory
Arguments:
haircut
Select the extend of normalization. Valid values are:
• bald: (Default behavior) Cyrillic characters are first converted to Latin; then, Latin characters have
their diacritics removed, with the exception of LATIN SMALL LETTER D WITH STROKE (U+0111) which is
converted to “dj”
• regular: Only Cyrillic to Latin normalization will be applied, preserving the Latin diatrics
Example:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 318 of 1426
Apache Solr Reference Guide 7.7
Spanish
Solr includes two stemmers for Spanish: one in the solr.SnowballPorterFilterFactory
language="Spanish", and a lighter stemmer called solr.SpanishLightStemFilterFactory. Lucene includes
an example stopword list.
Factory class: solr.SpanishStemFilterFactory
Arguments: None
Example:
In: "torear toreara torearlo"
Tokenizer to Filter: "torear", "toreara", "torearlo"
Out: "tor", "tor", "tor"
Swedish
Swedish Stem Filter
Solr includes two stemmers for Swedish: one in the solr.SnowballPorterFilterFactory
language="Swedish", and a lighter stemmer called solr.SwedishLightStemFilterFactory. Lucene includes
an example stopword list.
Also relevant are the Scandinavian normalization filters.
Factory class: solr.SwedishStemFilterFactory
Arguments: None
Example:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 319 of 1426
In: "kloke klokhet klokheten"
Tokenizer to Filter: "kloke", "klokhet", "klokheten"
Out: "klok", "klok", "klok"
Thai
This filter converts sequences of Thai characters into individual Thai words. Unlike European languages, Thai
does not use whitespace to delimit words.
Factory class: solr.ThaiTokenizerFactory
Arguments: None
Example:
Turkish
Solr includes support for stemming Turkish with the solr.SnowballPorterFilterFactory; support for caseinsensitive search with the solr.TurkishLowerCaseFilterFactory; support for stripping apostrophes and
following suffixes with solr.ApostropheFilterFactory (see Role of Apostrophes in Turkish Information
Retrieval); support for a form of stemming that truncating tokens at a configurable maximum length
through the solr.TruncateTokenFilterFactory (see Information Retrieval on Turkish Texts); and Lucene
includes an example stopword list.
Factory class: solr.TurkishLowerCaseFilterFactory
Arguments: None
Example:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 320 of 1426
Apache Solr Reference Guide 7.7
Another example, illustrating diacritics-insensitive search:
Ukrainian
Solr provides support for Ukrainian lemmatization with the solr.MorphologikFilterFactory, in the
contrib/analysis-extras module. To use this filter, you must add additional .jars to Solr’s classpath (as
described in the section Resources and Plugins on the Filesystem). See solr/contrib/analysisextras/README.txt for instructions on which jars you need to add.
Lucene also includes an example Ukrainian stopword list, in the lucene-analyzers-morfologik jar.
Factory class: solr.MorfologikFilterFactory
Arguments:
dictionary
(required) lemmatizer dictionary - the lucene-analyzers-morfologik jar contains a Ukrainian dictionary
at org/apache/lucene/analysis/uk/ukrainian.dict.
Example:
The Morfologik dictionary parameter value is a constant specifying which dictionary to choose. The
dictionary resource must be named path/to/language.dict and have an associated .info metadata file.
See the Morfologik project for details. If the dictionary attribute is not provided, the Polish dictionary is
loaded and used by default.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 321 of 1426
Phonetic Matching
Phonetic matching algorithms may be used to encode tokens so that two different spellings that are
pronounced similarly will match.
For overviews of and comparisons between algorithms, see http://en.wikipedia.org/wiki/Phonetic_algorithm
and http://ntz-develop.blogspot.com/2011/03/phonetic-algorithms.html
Beider-Morse Phonetic Matching (BMPM)
For examples of how to use this encoding in your analyzer, see Beider Morse Filter in the Filter Descriptions
section.
Beider-Morse Phonetic Matching (BMPM) is a "soundalike" tool that lets you search using a new phonetic
matching system. BMPM helps you search for personal names (or just surnames) in a Solr/Lucene index,
and is far superior to the existing phonetic codecs, such as regular soundex, metaphone, caverphone, etc.
In general, phonetic matching lets you search a name list for names that are phonetically equivalent to the
desired name. BMPM is similar to a soundex search in that an exact spelling is not required. Unlike soundex,
it does not generate a large quantity of false hits.
From the spelling of the name, BMPM attempts to determine the language. It then applies phonetic rules for
that particular language to transliterate the name into a phonetic alphabet. If it is not possible to determine
the language with a fair degree of certainty, it uses generic phonetic instead. Finally, it applies languageindependent rules regarding such things as voiced and unvoiced consonants and vowels to further insure
the reliability of the matches.
For example, assume that the matches found when searching for Stephen in a database are "Stefan",
"Steph", "Stephen", "Steve", "Steven", "Stove", and "Stuffin". "Stefan", "Stephen", and "Steven" are
probably relevant, and are names that you want to see. "Stuffin", however, is probably not relevant. Also
rejected were "Steph", "Steve", and "Stove". Of those, "Stove" is probably not one that we would have
wanted. But "Steph" and "Steve" are possibly ones that you might be interested in.
For Solr, BMPM searching is available for the following languages:
• English
• French
• German
• Greek
• Hebrew written in Hebrew letters
• Hungarian
• Italian
• Polish
• Romanian
• Russian written in Cyrillic letters
• Russian transliterated into English letters
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 322 of 1426
Apache Solr Reference Guide 7.7
• Spanish
• Turkish
The name matching is also applicable to non-Jewish surnames from the countries in which those languages
are spoken.
For more information, see here: http://stevemorse.org/phoneticinfo.htm and
http://stevemorse.org/phonetics/bmpm.htm.
Daitch-Mokotoff Soundex
To use this encoding in your analyzer, see Daitch-Mokotoff Soundex Filter in the Filter Descriptions section.
The Daitch-Mokotoff Soundex algorithm is a refinement of the Russel and American Soundex algorithms,
yielding greater accuracy in matching especially Slavic and Yiddish surnames with similar pronunciation but
differences in spelling.
The main differences compared to the other soundex variants are:
• coded names are 6 digits long
• initial character of the name is coded
• rules to encoded multi-character n-grams
• multiple possible encodings for the same name (branching)
Note: the implementation used by Solr (commons-codec’s DaitchMokotoffSoundex ) has additional
branching rules compared to the original description of the algorithm.
For more information, see http://en.wikipedia.org/wiki/Daitch%E2%80%93Mokotoff_Soundex and
http://www.avotaynu.com/soundex.htm
Double Metaphone
To use this encoding in your analyzer, see Double Metaphone Filter in the Filter Descriptions section.
Alternatively, you may specify encoder="DoubleMetaphone" with the Phonetic Filter, but note that the
Phonetic Filter version will not provide the second ("alternate") encoding that is generated by the Double
Metaphone Filter for some tokens.
Encodes tokens using the double metaphone algorithm by Lawrence Philips. See the original article at
http://www.drdobbs.com/the-double-metaphone-search-algorithm/184401251?pgno=2
Metaphone
To use this encoding in your analyzer, specify encoder="Metaphone" with the Phonetic Filter.
Encodes tokens using the Metaphone algorithm by Lawrence Philips, described in "Hanging on the
Metaphone" in Computer Language, Dec. 1990.
Another reference for more information is Double Metaphone Search Algorithm, by Lawrence Philips.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 323 of 1426
Soundex
To use this encoding in your analyzer, specify encoder="Soundex" with the Phonetic Filter.
Encodes tokens using the Soundex algorithm, which is used to relate similar names, but can also be used as
a general purpose scheme to find words with similar phonemes.
See also http://en.wikipedia.org/wiki/Soundex.
Refined Soundex
To use this encoding in your analyzer, specify encoder="RefinedSoundex" with the Phonetic Filter.
Encodes tokens using an improved version of the Soundex algorithm.
See http://en.wikipedia.org/wiki/Soundex.
Caverphone
To use this encoding in your analyzer, specify encoder="Caverphone" with the Phonetic Filter.
Caverphone is an algorithm created by the Caversham Project at the University of Otago. The algorithm is
optimised for accents present in the southern part of the city of Dunedin, New Zealand.
See http://en.wikipedia.org/wiki/Caverphone and the Caverphone 2.0 specification at
http://caversham.otago.ac.nz/files/working/ctp150804.pdf
Kölner Phonetik a.k.a. Cologne Phonetic
To use this encoding in your analyzer, specify encoder="ColognePhonetic" with the Phonetic Filter.
The Kölner Phonetik, an algorithm published by Hans Joachim Postel in 1969, is optimized for the German
language.
See http://de.wikipedia.org/wiki/K%C3%B6lner_Phonetik
NYSIIS
To use this encoding in your analyzer, specify encoder="Nysiis" with the Phonetic Filter.
NYSIIS is an encoding used to relate similar names, but can also be used as a general purpose scheme to
find words with similar phonemes.
See http://en.wikipedia.org/wiki/NYSIIS and http://www.dropby.com/NYSIIS.html
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 324 of 1426
Apache Solr Reference Guide 7.7
Running Your Analyzer
Once you’ve defined a field type in your Schema, and specified the analysis steps that you want applied to it,
you should test it out to make sure that it behaves the way you expect it to.
Luckily, there is a very handy page in the Solr admin interface that lets you do just that. You can invoke the
analyzer for any text field, provide sample input, and display the resulting token stream.
For example, let’s look at some of the "Text" field types available in the bin/solr -e techproducts
example configuration, and use the Analysis Screen
(http://localhost:8983/solr/#/techproducts/analysis) to compare how the tokens produced at index
time for the sentence "Running an Analyzer" match up with a slightly different query text of "run my
analyzer"
We can begin with “text_ws” - one of the most simplified Text field types available:
By looking at the start and end positions for each term, we can see that the only thing this field type does is
tokenize text on whitespace. Notice in this image that the term "Running" has a start position of 0 and an
end position of 7, while "an" has a start position of 8 and an end position of 10, and "Analyzer" starts at 11
and ends at 19. If the whitespace between the terms was also included, the count would be 21; since it is 19,
we know that whitespace has been removed from this query.
Note also that the indexed terms and the query terms are still very different. "Running" doesn’t match
"run", "Analyzer" doesn’t match "analyzer" (to a computer), and obviously "an" and "my" are totally
different words. If our objective is to allow queries like "run my analyzer" to match indexed text like
"Running an Analyzer" then we will evidently need to pick a different field type with index & query time text
analysis that does more processing of the inputs.
In particular we will want:
• Case insensitivity, so "Analyzer" and "analyzer" match.
• Stemming, so words like "Run" and "Running" are considered equivalent terms.
• Stop Word Pruning, so small words like "an" and "my" don’t affect the query.
For our next attempt, let’s try the “text_general” field type:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 325 of 1426
With the verbose output enabled, we can see how each stage of our new analyzers modify the tokens they
receive before passing them on to the next stage. As we scroll down to the final output, we can see that we
do start to get a match on "analyzer" from each input string, thanks to the "LCF" stage — which if you hover
over with your mouse, you’ll see is the “LowerCaseFilter”:
The “text_general” field type is designed to be generally useful for any language, and it has definitely
gotten us closer to our objective than “text_ws” from our first example by solving the problem of case
sensitivity. It’s still not quite what we are looking for because we don’t see stemming or stopword rules
being applied. So now let us try the “text_en” field type:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 326 of 1426
Apache Solr Reference Guide 7.7
Now we can see the "SF" (StopFilter) stage of the analyzers solving the problem of removing Stop Words
("an"), and as we scroll down, we also see the "PSF" (PorterStemFilter) stage apply stemming rules
suitable for our English language input, such that the terms produced by our "index analyzer" and the terms
produced by our "query analyzer" match the way we expect.
At this point, we can continue to experiment with additional inputs, verifying that our analyzers produce
matching tokens when we expect them to match, and disparate tokens when we do not expect them to
match, as we iterate and tweak our field type configuration.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 327 of 1426
Indexing and Basic Data Operations
This section describes how Solr adds data to its index. It covers the following topics:
• Introduction to Solr Indexing: An overview of Solr’s indexing process.
• Post Tool: Information about using post.jar to quickly upload some content to your system.
• Uploading Data with Index Handlers: Information about using Solr’s Index Handlers to upload
XML/XSLT, JSON and CSV data.
• Transforming and Indexing Custom JSON: Index any JSON of your choice
• Uploading Data with Solr Cell using Apache Tika: Information about using the Solr Cell framework to
upload data for indexing.
• Uploading Structured Data Store Data with the Data Import Handler: Information about uploading
and indexing data from a structured data store.
• Updating Parts of Documents: Information about how to use atomic updates and optimistic
concurrency with Solr.
• Detecting Languages During Indexing: Information about using language identification during the
indexing process.
• De-Duplication: Information about configuring Solr to mark duplicate documents as they are indexed.
• Content Streams: Information about streaming content to Solr Request Handlers.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 328 of 1426
Apache Solr Reference Guide 7.7
Indexing Using Client APIs
Using client APIs, such as SolrJ, from your applications is an important option for updating Solr indexes. See
the Client APIs section for more information.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 329 of 1426
Introduction to Solr Indexing
This section describes the process of indexing: adding content to a Solr index and, if necessary, modifying
that content or deleting it.
By adding content to an index, we make it searchable by Solr.
A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV)
files, data extracted from tables in a database, and files in common file formats such as Microsoft Word or
PDF.
Here are the three most common ways of loading data into a Solr index:
• Using the Solr Cell framework built on Apache Tika for ingesting binary files or structured files such as
Office, Word, PDF, and other proprietary formats.
• Uploading XML files by sending HTTP requests to the Solr server from any environment where such
requests can be generated.
• Writing a custom Java application to ingest data through Solr’s Java Client API (which is described in
more detail in Client APIs). Using the Java API may be the best choice if you’re working with an
application, such as a Content Management System (CMS), that offers a Java API.
Regardless of the method used to ingest data, there is a common basic data structure for data being fed
into a Solr index: a document containing multiple fields, each with a name and containing content, which may
be empty. One of the fields is usually designated as a unique ID field (analogous to a primary key in a
database), although the use of a unique ID field is not strictly required by Solr.
If the field name is defined in the Schema that is associated with the index, then the analysis steps
associated with that field will be applied to its content when the content is tokenized. Fields that are not
explicitly defined in the Schema will either be ignored or mapped to a dynamic field definition (see
Documents, Fields, and Schema Design), if one matching the field name exists.
For more information on indexing in Solr, see the Solr Wiki.
The Solr Example Directory
When starting Solr with the "-e" option, the example/ directory will be used as base directory for the
example Solr instances that are created. This directory also includes an example/exampledocs/ subdirectory
containing sample documents in a variety of formats that you can use to experiment with indexing into the
various examples.
The curl Utility for Transferring Files
Many of the instructions and examples in this section make use of the curl utility for transferring content
through a URL. curl posts and retrieves data over HTTP, FTP, and many other protocols. Most Linux
distributions include a copy of curl. You’ll find curl downloads for Linux, Windows, and many other
operating systems at http://curl.haxx.se/download.html. Documentation for curl is available here:
http://curl.haxx.se/docs/manpage.html.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 330 of 1426
Apache Solr Reference Guide 7.7
Using curl or other command line tools for posting data is just fine for examples or tests,
but it’s not the recommended method for achieving the best performance for updates in
production environments. You will achieve better performance with Solr Cell or the other
methods described in this section.
Instead of curl, you can use utilities such as GNU wget (http://www.gnu.org/software/
wget/) or manage GETs and POSTS with Perl, although the command line options will
differ.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 331 of 1426
Post Tool
Solr includes a simple command line tool for POSTing various types of content to a Solr server.
The tool is bin/post. The bin/post tool is a Unix shell script; for Windows (non-Cygwin) usage, see the
section Post Tool Windows Support below.
To run it, open a window and enter:
bin/post -c gettingstarted example/films/films.json
This will contact the server at localhost:8983. Specifying the collection/core name is mandatory. The
-help (or simply -h) option will output information on its usage (i.e., bin/post -help).
Using the bin/post Tool
Specifying either the collection/core name or the full update url is mandatory when using bin/post.
The basic usage of bin/post is:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 332 of 1426
Apache Solr Reference Guide 7.7
$ bin/post -h
Usage: post -c [OPTIONS]
or post -help
collection name defaults to DEFAULT_SOLR_COLLECTION if not specified
OPTIONS
=======
Solr options:
-url (overrides collection, host, and port)
-host (default: localhost)
-p or -port (default: 8983)
-commit yes|no (default: yes)
-u or -user (sets BasicAuth credentials)
Web crawl options:
-recursive (default: 1)
-delay (default: 10)
Directory crawl options:
-delay (default: 0)
stdin/args options:
-type (default: application/xml)
Other options:
-filetypes [,,...] (default:
xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log)
-params "=[&=...]" (values must be URL-encoded; these pass through to
Solr update request)
-out yes|no (default: no; yes outputs Solr response to console)
...
Examples Using bin/post
There are several ways to use bin/post. This section presents several examples.
Indexing XML
Add all documents with file extension .xml to collection or core named gettingstarted.
bin/post -c gettingstarted *.xml
Add all documents with file extension .xml to the gettingstarted collection/core on Solr running on port
8984.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 333 of 1426
bin/post -c gettingstarted -p 8984 *.xml
Send XML arguments to delete a document from gettingstarted.
bin/post -c gettingstarted -d '42 '
Indexing CSV
Index all CSV files into gettingstarted:
bin/post -c gettingstarted *.csv
Index a tab-separated file into gettingstarted:
bin/post -c signals -params "separator=%09" -type text/csv data.tsv
The content type (-type) parameter is required to treat the file as the proper type, otherwise it will be
ignored and a WARNING logged as it does not know what type of content a .tsv file is. The CSV handler
supports the separator parameter, and is passed through using the -params setting.
Indexing JSON
Index all JSON files into gettingstarted.
bin/post -c gettingstarted *.json
Indexing Rich Documents (PDF, Word, HTML, etc.)
Index a PDF file into gettingstarted.
bin/post -c gettingstarted a.pdf
Automatically detect content types in a folder, and recursively scan it for documents for indexing into
gettingstarted.
bin/post -c gettingstarted afolder/
Automatically detect content types in a folder, but limit it to PPT and HTML files and index into
gettingstarted.
bin/post -c gettingstarted -filetypes ppt,html afolder/
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 334 of 1426
Apache Solr Reference Guide 7.7
Indexing to a Password Protected Solr (Basic Auth)
Index a PDF as the user "solr" with password "SolrRocks":
bin/post -u solr:SolrRocks -c gettingstarted a.pdf
Post Tool Windows Support
bin/post exists currently only as a Unix shell script, however it delegates its work to a cross-platform
capable Java program. The SimplePostTool can be run directly in supported environments, including
Windows.
SimplePostTool
The bin/post script currently delegates to a standalone Java program called SimplePostTool.
This tool, bundled into a executable JAR, can be run directly using java -jar
example/exampledocs/post.jar. See the help output and take it from there to post files, recurse a website
or file system folder, or send direct commands to a Solr server.
$ java -jar example/exampledocs/post.jar -h
SimplePostTool version 5.0.0
Usage: java [SystemProperties] -jar post.jar [-h|-] [
[...]]
.
.
.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 335 of 1426
Uploading Data with Index Handlers
Index Handlers are Request Handlers designed to add, delete and update documents to the index. In
addition to having plugins for importing rich documents using Tika or from structured data sources using
the Data Import Handler, Solr natively supports indexing structured documents in XML, CSV and JSON.
The recommended way to configure and use request handlers is with path based names that map to paths
in the request url. However, request handlers can also be specified with the qt (query type) parameter if the
requestDispatcher is appropriately configured. It is possible to access the same handler using more than
one name, which can be useful if you wish to specify different sets of default options.
A single unified update request handler supports XML, CSV, JSON, and javabin update requests, delegating
to the appropriate ContentStreamLoader based on the Content-Type of the ContentStream.
UpdateRequestHandler Configuration
The default configuration file has the update request handler configured by default.
XML Formatted Index Updates
Index update commands can be sent as XML message to the update handler using Content-type:
application/xml or Content-type: text/xml.
Adding Documents
The XML schema recognized by the update handler for adding documents is very straightforward:
• The element introduces one more documents to be added.
• The element introduces the fields making up a document.
• The element presents the content for a specific field.
For example:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 336 of 1426
Apache Solr Reference Guide 7.7
Patrick Eagar
Sports
796.35
128
12.40
Summer of the all-rounder: Test and championship cricket in England
1982
0002166313
1982
Collins
...
The add command supports some optional attributes which may be specified.
commitWithin
Add the document within the specified number of milliseconds.
overwrite
Default is true. Indicates if the unique key constraints should be checked to overwrite previous versions
of the same document (see below).
If the document schema defines a unique key, then by default an /update operation to add a document will
overwrite (i.e., replace) any document in the index with the same unique key. If no unique key has been
defined, indexing performance is somewhat faster, as no check has to be made for an existing documents to
replace.
If you have a unique key field, but you feel confident that you can safely bypass the uniqueness check (e.g.,
you build your indexes in batch, and your indexing code guarantees it never adds the same document more
than once) you can specify the overwrite="false" option when adding your documents.
XML Update Commands
Commit and Optimize During Updates
The operation writes all documents loaded since the last commit to one or more segment files on
the disk. Before a commit has been issued, newly indexed content is not visible to searches. The commit
operation opens a new searcher, and triggers any event listeners that have been configured.
Commits may be issued explicitly with a message, and can also be triggered from
parameters in solrconfig.xml.
The operation requests Solr to merge internal data structures. For a large index, optimization
will take some time to complete, but by merging many small segment files into larger segments, search
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 337 of 1426
performance may improve. If you are using Solr’s replication mechanism to distribute searches across many
systems, be aware that after an optimize, a complete index will need to be transferred.
You should only consider using optimize on static indexes, i.e., indexes that can be
optimized as part of the regular update process (say once-a-day updates). Applications
requiring NRT functionality should not use optimize.
The and elements accept these optional attributes:
waitSearcher
Default is true. Blocks until a new searcher is opened and registered as the main query searcher, making
the changes visible.
expungeDeletes
(commit only) Default is false. Merges segments that have more than 10% deleted docs, expunging the
deleted documents in the process. Resulting segments will respect maxMergedSegmentMB.
expungeDeletes is "less expensive" than optimize, but the same warnings apply.
maxSegments
(optimize only) Default is unlimited, resulting segments respect the maxMergedSegmentMB setting. Makes a
best effort attempt to merge the segments down to no more than this number of segments but does not
guarantee that the goal will be achieved. Unless there is tangible evidence that optimizing to a small
number of segments is beneficial, this parameter should be omitted and the default behavior accepted.
Here are examples of and using optional attributes:
Delete Operations
Documents can be deleted from the index in two ways. "Delete by ID" deletes the document with the
specified ID, and can be used only if a UniqueID field has been defined in the schema. "Delete by Query"
deletes all documents matching a specified query, although commitWithin is ignored for a Delete by Query.
A single delete message can contain multiple delete operations.
0002166313
0031745983
subject:sport
publisher:penguin
When using the Join query parser in a Delete By Query, you should use the score
parameter with a value of "none" to avoid a ClassCastException. See the section on the
Join Query Parser for more details on the score parameter.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 338 of 1426
Apache Solr Reference Guide 7.7
Rollback Operations
The rollback command rolls back all add and deletes made to the index since the last commit. It neither calls
any event listeners nor creates a new searcher. Its syntax is simple: .
Grouping Operations
You can post several commands in a single XML file by grouping them with the surrounding
element.
0002166313
Using curl to Perform Updates
You can use the curl utility to perform any of the above commands, using its --data-binary option to
append the XML message to the curl command, and generating a HTTP POST request. For example:
curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml" --data-binary '
Patrick Eagar
Sports
796.35
0002166313
1982
Collins
'
For posting XML messages contained in a file, you can use the alternative form:
curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml" --data-binary
@myfile.xml
The approach above works well, but using the --data-binary option causes curl to load the whole
myfile.xml into memory before posting it to server. This may be problematic when dealing with multigigabyte files. This alternative curl command performs equivalent operations but with minimal curl
memory usage:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 339 of 1426
curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml" -T "myfile.xml"
-X POST
Short requests can also be sent using a HTTP GET command, if enabled in RequestDispatcher in SolrConfig
element, URL-encoding the request, as in the following. Note the escaping of "<" and ">":
curl http://localhost:8983/solr/my_collection/update?stream.body=%3Ccommit/%3E&wt=xml
Responses from Solr take the form shown here:
0
127
The status field will be non-zero in case of failure.
Using XSLT to Transform XML Index Updates
The UpdateRequestHandler allows you to index any arbitrary XML using the parameter to apply an XSL
transformation. You must have an XSLT stylesheet in the conf/xslt directory of your configset that can
transform the incoming data to the expected format, and use the tr parameter to
specify the name of that stylesheet.
Here is an example XSLT stylesheet:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 340 of 1426
Apache Solr Reference Guide 7.7
This stylesheet transforms Solr’s XML search result format into Solr’s Update XML syntax. One example
usage would be to copy a Solr 1.3 index (which does not have CSV response writer) into a format which can
be indexed into another Solr file (provided that all fields are stored):
http://localhost:8983/solr/my_collection/select?q=*:*&wt=xslt&tr=updateXml.xsl&rows=1000
You can also use the stylesheet in XsltUpdateRequestHandler to transform an index when updating:
curl "http://localhost:8983/solr/my_collection/update?commit=true&tr=updateXml.xsl" -H "ContentType: text/xml" --data-binary @myexporteddata.xml
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 341 of 1426
JSON Formatted Index Updates
Solr can accept JSON that conforms to a defined structure, or can accept arbitrary JSON-formatted
documents. If sending arbitrarily formatted JSON, there are some additional parameters that need to be
sent with the update request, described below in the section Transforming and Indexing Custom JSON.
Solr-Style JSON
JSON formatted update requests may be sent to Solr’s /update handler using Content-Type:
application/json or Content-Type: text/json.
JSON formatted updates can take 3 basic forms, described in depth below:
• A single document to add, expressed as a top level JSON Object. To differentiate this from a set of
commands, the json.command=false request parameter is required.
• A list of documents to add, expressed as a top level JSON Array containing a JSON Object per document.
• A sequence of update commands, expressed as a top level JSON Object (aka: Map).
Adding a Single JSON Document
The simplest way to add Documents via JSON is to send each document individually as a JSON Object, using
the /update/json/docs path:
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/my_collection/update/json/docs' --data-binary '
{
"id": "1",
"title": "Doc 1"
}'
Adding Multiple JSON Documents
Adding multiple documents at one time via JSON can be done via a JSON Array of JSON Objects, where each
object represents a document:
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/my_collection/update' --data-binary '
[
{
"id": "1",
"title": "Doc 1"
},
{
"id": "2",
"title": "Doc 2"
}
]'
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 342 of 1426
Apache Solr Reference Guide 7.7
A sample JSON file is provided at example/exampledocs/books.json and contains an array of objects that
you can add to the Solr techproducts example:
curl 'http://localhost:8983/solr/techproducts/update?commit=true' --data-binary
@example/exampledocs/books.json -H 'Content-type:application/json'
Sending JSON Update Commands
In general, the JSON update syntax supports all of the update commands that the XML update handler
supports, through a straightforward mapping. Multiple commands, adding and deleting documents, may be
contained in one message:
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/my_collection/update' --data-binary '
{
"add": {
"doc": {
"id": "DOC1",
"my_field": 2.3,
"my_multivalued_field": [ "aaa", "bbb" ]
①
}
},
"add": {
"commitWithin": 5000, ②
"overwrite": false, ③
"doc": {
"f1": "v1", ④
"f1": "v2"
}
},
"commit": {},
"optimize": { "waitSearcher":false },
"delete": { "id":"ID" }, ⑤
"delete": { "query":"QUERY" } ⑥
}'
① Can use an array for a multi-valued field
② Commit this document within 5 seconds
③ Don’t check for existing documents with the same uniqueKey
④ Can use repeated keys for a multi-valued field
⑤ Delete by ID (uniqueKey field)
⑥ Delete by Query
As with other update handlers, parameters such as commit, commitWithin, optimize, and overwrite may be
specified in the URL instead of in the body of the message.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 343 of 1426
The JSON update format allows for a simple delete-by-id. The value of a delete can be an array which
contains a list of zero or more specific document id’s (not a range) to be deleted. For example, a single
document:
{ "delete":"myid" }
Or a list of document IDs:
{ "delete":["id1","id2"] }
The value of a "delete" can be an array which contains a list of zero or more id’s to be deleted. It is not a
range (start and end).
You can also specify _version_ with each "delete":
{
"delete":"id":50,
"_version_":12345
}
You can specify the version of deletes in the body of the update request as well.
JSON Update Convenience Paths
In addition to the /update handler, there are a few additional JSON specific request handler paths available
by default in Solr, that implicitly override the behavior of some request parameters:
Path
Default Parameters
/update/json
stream.contentType=application/json
/update/json/docs
stream.contentType=application/json
json.command=false
The /update/json path may be useful for clients sending in JSON formatted update commands from
applications where setting the Content-Type proves difficult, while the /update/json/docs path can be
particularly convenient for clients that always want to send in documents – either individually or as a list –
without needing to worry about the full JSON command syntax.
Custom JSON Documents
Solr can support custom JSON. This is covered in the section Transforming and Indexing Custom JSON.
CSV Formatted Index Updates
CSV formatted update requests may be sent to Solr’s /update handler using Content-Type:
application/csv or Content-Type: text/csv.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 344 of 1426
Apache Solr Reference Guide 7.7
A sample CSV file is provided at example/exampledocs/books.csv that you can use to add some documents
to the Solr techproducts example:
curl 'http://localhost:8983/solr/my_collection/update?commit=true' --data-binary
@example/exampledocs/books.csv -H 'Content-type:application/csv'
CSV Update Parameters
The CSV handler allows the specification of many parameters in the URL in the form: f.parameter
.optional_fieldname=value.
The table below describes the parameters for the update handler.
separator
Character used as field separator; default is ",". This parameter is global; for per-field usage, see the
split parameter.
Example: separator=%09
trim
If true, remove leading and trailing whitespace from values. The default is false. This parameter can be
either global or per-field.
Examples: f.isbn.trim=true or trim=false
header
Set to true if first line of input contains field names. These will be used if the fieldnames parameter is
absent. This parameter is global.
fieldnames
Comma-separated list of field names to use when adding documents. This parameter is global.
Example: fieldnames=isbn,price,title
literal.field_name
A literal value for a specified field name. This parameter is global.
Example: literal.color=red
skip
Comma separated list of field names to skip. This parameter is global.
Example: skip=uninteresting,shoesize
skipLines
Number of lines to discard in the input stream before the CSV data starts, including the header, if
present. Default=0. This parameter is global.
Example: skipLines=5
encapsulator
The character optionally used to surround values to preserve characters such as the CSV separator or
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 345 of 1426
whitespace. This standard CSV format handles the encapsulator itself appearing in an encapsulated value
by doubling the encapsulator.
This parameter is global; for per-field usage, see split.
Example: encapsulator="
escape
The character used for escaping CSV separators or other reserved characters. If an escape is specified,
the encapsulator is not used unless also explicitly specified since most formats use either encapsulation
or escaping, not both. |g |
Example: escape=\
keepEmpty
Keep and index zero length (empty) fields. The default is false. This parameter can be global or per-field.
Example: f.price.keepEmpty=true
map
Map one value to another. Format is value:replacement (which can be empty). This parameter can be
global or per-field.
Example: map=left:right or f.subject.map=history:bunk
split
If true, split a field into multiple values by a separate parser. This parameter is used on a per-field basis.
overwrite
If true (the default), check for and overwrite duplicate documents, based on the uniqueKey field declared
in the Solr schema. If you know the documents you are indexing do not contain any duplicates then you
may see a considerable speed up setting this to false.
This parameter is global.
commit
Issues a commit after the data has been ingested. This parameter is global.
commitWithin
Add the document within the specified number of milliseconds. This parameter is global.
Example: commitWithin=10000
rowid
Map the rowid (line number) to a field specified by the value of the parameter, for instance if your CSV
doesn’t have a unique key and you want to use the row id as such. This parameter is global.
Example: rowid=id
rowidOffset
Add the given offset (as an integer) to the rowid before adding it to the document. Default is 0. This
parameter is global.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 346 of 1426
Apache Solr Reference Guide 7.7
Example: rowidOffset=10
Indexing Tab-Delimited files
The same feature used to index CSV documents can also be easily used to index tab-delimited files (TSV files)
and even handle backslash escaping rather than CSV encapsulation.
For example, one can dump a MySQL table to a tab delimited file with:
SELECT * INTO OUTFILE '/tmp/result.txt' FROM mytable;
This file could then be imported into Solr by setting the separator to tab (%09) and the escape to backslash
(%5c).
curl 'http://localhost:8983/solr/my_collection/update/csv?commit=true&separator=%09&escape=%5c'
--data-binary @/tmp/result.txt
CSV Update Convenience Paths
In addition to the /update handler, there is an additional CSV specific request handler path available by
default in Solr, that implicitly override the behavior of some request parameters:
Path
Default Parameters
/update/csv
stream.contentType=application/csv
The /update/csv path may be useful for clients sending in CSV formatted update commands from
applications where setting the Content-Type proves difficult.
Nested Child Documents
Solr supports indexing nested documents such as a blog post parent document and comments as child
documents — or products as parent documents and sizes, colors, or other variations as child documents.
The parent with all children is referred to as a "block" and it explains some of the nomenclature of related
features. At query time, the Block Join Query Parsers can search these relationships, and the [child]
Document Transformer can attach child documents to the result documents. In terms of performance,
indexing the relationships between documents usually yields much faster queries than an equivalent "query
time join", since the relationships are already stored in the index and do not need to be computed. However,
nested documents are less flexible than query time joins as it imposes rules that some applications may not
be able to accept.
Note
A big limitation is that the whole block of parent-children documents must be updated or
deleted together, not separately. In other words, even if a single child document or the
parent document is changed, the whole block of parent-child documents must be indexed
together. Solr does not enforce this rule; if it’s violated, you may get sporadic query failures
or incorrect results.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 347 of 1426
Nested documents may be indexed via either the XML or JSON data syntax, and is also supported by SolrJ
with javabin.
Schema Notes
• The schema must include an indexed, non-stored field _root_. The value of that field is populated
automatically and is the same for all documents in the block, regardless of the inheritance depth.
• Nested documents are very much documents in their own right even if certain nested documents hold
different information from the parent. Therefore:
◦ the schema must be able to represent the fields of any document
◦ it may be infeasible to use required
◦ even child documents need a unique id
• You must include a field that identifies the parent document as a parent; it can be any field that suits this
purpose, and it will be used as input for the block join query parsers.
• If you associate a child document as a field (e.g., comment), that field need not be defined in the schema,
and probably shouldn’t be as it would be confusing. There is no child document field type.
XML Examples
For example, here are two documents and their child documents. It illustrates two styles of adding child
documents; the first is associated via a field "comment" (preferred), and the second is done in the classic
way now referred to as an "anonymous" or "unlabelled" child document. This field label relationship is
available to the URP chain in Solr but is ultimately discarded. Solr 8 will save the relationship.
1
Solr adds block join support
parentDocument
2
SolrCloud supports it too!
3
New Lucene and Solr release is out
parentDocument
4
Lots of new features
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 348 of 1426
Apache Solr Reference Guide 7.7
In this example, we have indexed the parent documents with the field content_type, which has the value
"parentDocument". We could have also used a boolean field, such as isParent, with a value of "true", or any
other similar approach.
JSON Examples
This example is equivalent to the XML example above. Again, the field labelled relationship is preferred. The
labelled relationship here is one child document but could have been wrapped in array brackets. For the
anonymous relationship, note the special _childDocuments_ key whose contents must be an array of child
documents.
[
{
"id": "1",
"title": "Solr adds block join support",
"content_type": "parentDocument",
"comment": {
"id": "2",
"comments": "SolrCloud supports it too!"
}
},
{
"id": "3",
"title": "New Lucene and Solr release is out",
"content_type": "parentDocument",
"_childDocuments_": [
{
"id": "4",
"comments": "Lots of new features"
}
]
}
]
Transforming and Indexing Custom JSON
If you have JSON documents that you would like to index without transforming them into Solr’s structure,
you can add them to Solr by including some parameters with the update request.
These parameters provide information on how to split a single JSON file into multiple Solr documents and
how to map fields to Solr’s schema. One or more valid JSON documents can be sent to the
/update/json/docs path with the configuration params.
Mapping Parameters
These parameters allow you to define how a JSON file should be read for multiple Solr documents.
split
Defines the path at which to split the input JSON into multiple Solr documents and is required if you have
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 349 of 1426
multiple documents in a single JSON file. If the entire JSON makes a single Solr document, the path must
be “/”.
It is possible to pass multiple split paths by separating them with a pipe (|), for example:
split=/|/foo|/foo/bar. If one path is a child of another, they automatically become a child document.
f
Provides multivalued mapping to map document field names to Solr field names. The format of the
parameter is target-field-name:json-path, as in f=first:/first. The json-path is required. The
target-field-name is the Solr document field name, and is optional. If not specified, it is automatically
derived from the input JSON. The default target field name is the fully qualified name of the field.
Wildcards can be used here, see Using Wildcards for Field Names below for more information.
mapUniqueKeyOnly
(boolean) This parameter is particularly convenient when the fields in the input JSON are not available in
the schema and schemaless mode is not enabled. This will index all the fields into the default search field
(using the df parameter, below) and only the uniqueKey field is mapped to the corresponding field in the
schema. If the input JSON does not have a value for the uniqueKey field then a UUID is generated for the
same.
df
If the mapUniqueKeyOnly flag is used, the update handler needs a field where the data should be indexed
to. This is the same field that other handlers use as a default search field.
srcField
This is the name of the field to which the JSON source will be stored into. This can only be used if split=/
(i.e., you want your JSON input file to be indexed as a single Solr document). Note that atomic updates
will cause the field to be out-of-sync with the document.
echo
This is for debugging purpose only. Set it to true if you want the docs to be returned as a response.
Nothing will be indexed.
For example, if we have a JSON file that includes two documents, we could define an update request like this:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 350 of 1426
Apache Solr Reference Guide 7.7
V1 API
curl 'http://localhost:8983/solr/techproducts/update/json/docs'\
'?split=/exams'\
'&f=first:/first'\
'&f=last:/last'\
'&f=grade:/grade'\
'&f=subject:/exams/subject'\
'&f=test:/exams/test'\
'&f=marks:/exams/marks'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test"
: "term1",
"marks" : 90},
{
"subject": "Biology",
"test"
: "term1",
"marks" : 86}
]
}'
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 351 of 1426
V2 API Standalone Solr
curl 'http://localhost:8983/api/cores/techproducts/update/json/docs'\
'?split=/exams'\
'&f=first:/first'\
'&f=last:/last'\
'&f=grade:/grade'\
'&f=subject:/exams/subject'\
'&f=test:/exams/test'\
'&f=marks:/exams/marks'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test"
: "term1",
"marks" : 90},
{
"subject": "Biology",
"test"
: "term1",
"marks" : 86}
]
}'
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 352 of 1426
Apache Solr Reference Guide 7.7
V2 API SolrCloud
curl 'http://localhost:8983/api/collections/techproducts/update/json/docs'\
'?split=/exams'\
'&f=first:/first'\
'&f=last:/last'\
'&f=grade:/grade'\
'&f=subject:/exams/subject'\
'&f=test:/exams/test'\
'&f=marks:/exams/marks'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test"
: "term1",
"marks" : 90},
{
"subject": "Biology",
"test"
: "term1",
"marks" : 86}
]
}'
With this request, we have defined that "exams" contains multiple documents. In addition, we have mapped
several fields from the input document to Solr fields.
When the update request is complete, the following two documents will be added to the index:
{
"first":"John",
"last":"Doe",
"marks":90,
"test":"term1",
"subject":"Maths",
"grade":8
}
{
"first":"John",
"last":"Doe",
"marks":86,
"test":"term1",
"subject":"Biology",
"grade":8
}
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 353 of 1426
In the prior example, all of the fields we wanted to use in Solr had the same names as they did in the input
JSON. When that is the case, we can simplify the request by only specifying the json-path portion of the f
parameter, as in this example:
V1 API
curl 'http://localhost:8983/solr/techproducts/update/json/docs'\
'?split=/exams'\
'&f=/first'\
'&f=/last'\
'&f=/grade'\
'&f=/exams/subject'\
'&f=/exams/test'\
'&f=/exams/marks'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test"
: "term1",
"marks" : 90},
{
"subject": "Biology",
"test"
: "term1",
"marks" : 86}
]
}'
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 354 of 1426
Apache Solr Reference Guide 7.7
V2 API Standalone Solr
curl 'http://localhost:8983/api/cores/techproducts/update/json/docs'\
'?split=/exams'\
'&f=/first'\
'&f=/last'\
'&f=/grade'\
'&f=/exams/subject'\
'&f=/exams/test'\
'&f=/exams/marks'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test"
: "term1",
"marks" : 90},
{
"subject": "Biology",
"test"
: "term1",
"marks" : 86}
]
}'
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 355 of 1426
V2 API SolrCloud
curl 'http://localhost:8983/api/collections/techproducts/update/json/docs'\
'?split=/exams'\
'&f=/first'\
'&f=/last'\
'&f=/grade'\
'&f=/exams/subject'\
'&f=/exams/test'\
'&f=/exams/marks'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test"
: "term1",
"marks" : 90},
{
"subject": "Biology",
"test"
: "term1",
"marks" : 86}
]
}'
In this example, we simply named the field paths (such as /exams/test). Solr will automatically attempt to
add the content of the field from the JSON input to the index in a field with the same name.
Documents will be rejected during indexing if the fields do not exist in the schema before
indexing. So, if you are NOT using schemaless mode, you must pre-create all fields. If you
are working in Schemaless Mode, however, fields that don’t exist will be created on the fly
with Solr’s best guess for the field type.
Reusing Parameters in Multiple Requests
You can store and re-use parameters with Solr’s Request Parameters API.
Say we wanted to define parameters to split documents at the exams field, and map several other fields. We
could make an API request such as:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 356 of 1426
Apache Solr Reference Guide 7.7
V1 API
curl http://localhost:8983/solr/techproducts/config/params -H 'Contenttype:application/json' -d '{
"set": {
"my_params": {
"split": "/exams",
"f":
["first:/first","last:/last","grade:/grade","subject:/exams/subject","test:/exams/test"]
}}}'
V2 API Standalone Solr
curl http://localhost:8983/api/cores/techproducts/config/params -H 'Contenttype:application/json' -d '{
"set": {
"my_params": {
"split": "/exams",
"f":
["first:/first","last:/last","grade:/grade","subject:/exams/subject","test:/exams/test"]
}}}'
V2 API SolrCloud
curl http://localhost:8983/api/collections/techproducts/config/params -H 'Contenttype:application/json' -d '{
"set": {
"my_params": {
"split": "/exams",
"f":
["first:/first","last:/last","grade:/grade","subject:/exams/subject","test:/exams/test"]
}}}'
When we send the documents, we’d use the useParams parameter with the name of the parameter set we
defined:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 357 of 1426
V1 API
curl 'http://localhost:8983/solr/techproducts/update/json/docs?useParams=my_params' -H
'Content-type:application/json' -d '{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [{
"subject": "Maths",
"test": "term1",
"marks": 90
},
{
"subject": "Biology",
"test": "term1",
"marks": 86
}
]
}'
V2 API Standalone Solr
curl 'http://localhost:8983/api/cores/techproducts/update/json?useParams=my_params' -H
'Content-type:application/json' -d '{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [{
"subject": "Maths",
"test": "term1",
"marks": 90
},
{
"subject": "Biology",
"test": "term1",
"marks": 86
}
]
}'
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 358 of 1426
Apache Solr Reference Guide 7.7
V2 API SolrCloud
curl 'http://localhost:8983/api/collections/techproducts/update/json?useParams=my_params' -H
'Content-type:application/json' -d '{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [{
"subject": "Maths",
"test": "term1",
"marks": 90
},
{
"subject": "Biology",
"test": "term1",
"marks": 86
}
]
}'
Using Wildcards for Field Names
Instead of specifying all the field names explicitly, it is possible to specify wildcards to map fields
automatically.
There are two restrictions: wildcards can only be used at the end of the json-path, and the split path cannot
use wildcards.
A single asterisk * maps only to direct children, and a double asterisk ** maps recursively to all descendants.
The following are example wildcard path mappings:
• f=$FQN:/**: maps all fields to the fully qualified name ($FQN) of the JSON field. The fully qualified name is
obtained by concatenating all the keys in the hierarchy with a period (.) as a delimiter. This is the default
behavior if no f path mappings are specified.
• f=/docs/*: maps all the fields under docs and in the name as given in JSON
• f=/docs/**: maps all the fields under docs and its children in the name as given in JSON
• f=searchField:/docs/*: maps all fields under /docs to a single field called ‘searchField’
• f=searchField:/docs/**: maps all fields under /docs and its children to searchField
With wildcards we can further simplify our previous example as follows:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 359 of 1426
V1 API
curl 'http://localhost:8983/solr/techproducts/update/json/docs'\
'?split=/exams'\
'&f=/**'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test"
: "term1",
"marks" : 90},
{
"subject": "Biology",
"test"
: "term1",
"marks" : 86}
]
}'
V2 API Standalone Solr
curl 'http://localhost:8983/api/cores/techproducts/update/json'\
'?split=/exams'\
'&f=/**'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test"
: "term1",
"marks" : 90},
{
"subject": "Biology",
"test"
: "term1",
"marks" : 86}
]
}'
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 360 of 1426
Apache Solr Reference Guide 7.7
V2 API SolrCloud
curl 'http://localhost:8983/api/collections/techproducts/update/json'\
'?split=/exams'\
'&f=/**'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test"
: "term1",
"marks" : 90},
{
"subject": "Biology",
"test"
: "term1",
"marks" : 86}
]
}'
Because we want the fields to be indexed with the field names as they are found in the JSON input, the
double wildcard in f=/** will map all fields and their descendants to the same fields in Solr.
It is also possible to send all the values to a single field and do a full text search on that. This is a good option
to blindly index and query JSON documents without worrying about fields and schema.
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 361 of 1426
V1 API
curl 'http://localhost:8983/solr/techproducts/update/json/docs'\
'?split=/'\
'&f=txt:/**'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test"
: "term1",
"marks" : 90},
{
"subject": "Biology",
"test"
: "term1",
"marks" : 86}
]
}'
V2 API Standalone Solr
curl 'http://localhost:8983/api/cores/techproducts/update/json'\
'?split=/'\
'&f=txt:/**'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test"
: "term1",
"marks" : 90},
{
"subject": "Biology",
"test"
: "term1",
"marks" : 86}
]
}'
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 362 of 1426
Apache Solr Reference Guide 7.7
V2 API SolrCloud
curl 'http://localhost:8983/api/collections/techproducts/update/json'\
'?split=/'\
'&f=txt:/**'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test"
: "term1",
"marks" : 90},
{
"subject": "Biology",
"test"
: "term1",
"marks" : 86}
]
}'
In the above example, we’ve said all of the fields should be added to a field in Solr named 'txt'. This will add
multiple fields to a single field, so whatever field you choose should be multi-valued.
The default behavior is to use the fully qualified name (FQN) of the node. So, if we don’t define any field
mappings, like this:
V1 API
curl 'http://localhost:8983/solr/techproducts/update/json/docs?split=/exams'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test"
: "term1",
"marks" : 90},
{
"subject": "Biology",
"test"
: "term1",
"marks" : 86}
]
}'
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 363 of 1426
V2 API Standalone Solr
curl 'http://localhost:8983/api/cores/techproducts/update/json?split=/exams'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test"
: "term1",
"marks" : 90},
{
"subject": "Biology",
"test"
: "term1",
"marks" : 86}
]
}'
V2 API SolrCloud
curl 'http://localhost:8983/api/collections/techproducts/update/json?split=/exams'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test"
: "term1",
"marks" : 90},
{
"subject": "Biology",
"test"
: "term1",
"marks" : 86}
]
}'
The indexed documents would be added to the index with fields that look like this:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 364 of 1426
Apache Solr Reference Guide 7.7
{
"first":"John",
"last":"Doe",
"grade":8,
"exams.subject":"Maths",
"exams.test":"term1",
"exams.marks":90},
{
"first":"John",
"last":"Doe",
"grade":8,
"exams.subject":"Biology",
"exams.test":"term1",
"exams.marks":86}
Multiple Documents in a Single Payload
This functionality supports documents in the JSON Lines format (.jsonl), which specifies one document per
line.
For example:
V1 API
curl 'http://localhost:8983/solr/techproducts/update/json/docs' -H 'Contenttype:application/json' -d '
{ "first":"Steve", "last":"Jobs", "grade":1, "subject":"Social Science", "test":"term1",
"marks":90}
{ "first":"Steve", "last":"Woz", "grade":1, "subject":"Political Science", "test":"term1",
"marks":86}'
V2 API Standalone Solr
curl 'http://localhost:8983/api/collections/techproducts/update/json' -H 'Contenttype:application/json' -d '
{ "first":"Steve", "last":"Jobs", "grade":1, "subject":"Social Science", "test":"term1",
"marks":90}
{ "first":"Steve", "last":"Woz", "grade":1, "subject":"Political Science", "test":"term1",
"marks":86}'
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 365 of 1426
V2 API SolrCloud
curl 'http://localhost:8983/api/collections/techproducts/update/json' -H 'Contenttype:application/json' -d '
{ "first":"Steve", "last":"Jobs", "grade":1, "subject":"Social Science", "test":"term1",
"marks":90}
{ "first":"Steve", "last":"Woz", "grade":1, "subject":"Political Science", "test":"term1",
"marks":86}'
Or even an array of documents, as in this example:
V1 API
curl 'http://localhost:8983/solr/techproducts/update/json/docs' -H 'Contenttype:application/json' -d '[
{"first":"Steve", "last":"Jobs", "grade":1, "subject":"Computer Science", "test":"term1",
"marks":90},
{"first":"Steve", "last":"Woz", "grade":1, "subject":"Calculus", "test":"term1",
"marks":86}]'
V2 API Standalone Solr
curl 'http://localhost:8983/api/cores/techproducts/update/json' -H 'Contenttype:application/json' -d '[
{"first":"Steve", "last":"Jobs", "grade":1, "subject":"Computer Science", "test":"term1",
"marks":90},
{"first":"Steve", "last":"Woz", "grade":1, "subject":"Calculus", "test":"term1",
"marks":86}]'
V2 API SolrCloud
curl 'http://localhost:8983/api/collections/techproducts/update/json' -H 'Contenttype:application/json' -d '[
{"first":"Steve", "last":"Jobs", "grade":1, "subject":"Computer Science", "test":"term1",
"marks":90},
{"first":"Steve", "last":"Woz", "grade":1, "subject":"Calculus", "test":"term1",
"marks":86}]'
Indexing Nested Documents
The following is an example of indexing nested documents:
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 366 of 1426
Apache Solr Reference Guide 7.7
V1 API
curl 'http://localhost:8983/solr/techproducts/update/json/docs?split=/|/orgs'\
-H 'Content-type:application/json' -d '{
"name": "Joe Smith",
"phone": 876876687,
"orgs": [
{
"name": "Microsoft",
"city": "Seattle",
"zip": 98052
},
{
"name": "Apple",
"city": "Cupertino",
"zip": 95014
}
]
}'
V2 API Standalone Solr
curl 'http://localhost:8983/api/cores/techproducts/update/json?split=/|/orgs'\
-H 'Content-type:application/json' -d '{
"name": "Joe Smith",
"phone": 876876687,
"orgs": [
{
"name": "Microsoft",
"city": "Seattle",
"zip": 98052
},
{
"name": "Apple",
"city": "Cupertino",
"zip": 95014
}
]
}'
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 367 of 1426
V2 API SolrCloud
curl 'http://localhost:8983/api/collections/techproducts/update/json?split=/|/orgs'\
-H 'Content-type:application/json' -d '{
"name": "Joe Smith",
"phone": 876876687,
"orgs": [
{
"name": "Microsoft",
"city": "Seattle",
"zip": 98052
},
{
"name": "Apple",
"city": "Cupertino",
"zip": 95014
}
]
}'
With this example, the documents indexed would be, as follows:
{
"name":"Joe Smith",
"phone":876876687,
"_childDocuments_":[
{
"name":"Microsoft",
"city":"Seattle",
"zip":98052},
{
"name":"Apple",
"city":"Cupertino",
"zip":95014}]}
Tips for Custom JSON Indexing
1. Schemaless mode: This handles field creation automatically. The field guessing may not be exactly as you
expect, but it works. The best thing to do is to setup a local server in schemaless mode, index a few
sample docs and create those fields in your real setup with proper field types before indexing
2. Pre-created Schema: Post your docs to the /update/json/docs endpoint with echo=true. This gives you
the list of field names you need to create. Create the fields before you actually index
3. No schema, only full-text search: All you need to do is to do full-text search on your JSON. Set the
configuration as given in the Setting JSON Defaults section.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 368 of 1426
Apache Solr Reference Guide 7.7
Setting JSON Defaults
It is possible to send any JSON to the /update/json/docs endpoint and the default configuration of the
component is as follows:
_src_
true
text
So, if no parameters are passed, the entire JSON file would get indexed to the _src_ field and all the values
in the input JSON would go to a field named text. If there is a value for the uniqueKey it is stored and if no
value could be obtained from the input JSON, a UUID is created and used as the uniqueKey field value.
Alternately, use the Request Parameters feature to set these parameters, as shown earlier in the section
Reusing Parameters in Multiple Requests.
V1 API
curl http://localhost:8983/solr/techproducts/config/params -H 'Contenttype:application/json' -d '{
"set": {
"full_txt": {
"srcField": "_src_",
"mapUniqueKeyOnly" : true,
"df": "text"
}}}'
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 369 of 1426
V2 API Standalone Solr
curl http://localhost:8983/api/cores/techproducts/config/params -H 'Contenttype:application/json' -d '{
"set": {
"full_txt": {
"srcField": "_src_",
"mapUniqueKeyOnly" : true,
"df": "text"
}}}'
V2 API SolrCloud
curl http://localhost:8983/api/collections/techproducts/config/params -H 'Contenttype:application/json' -d '{
"set": {
"full_txt": {
"srcField": "_src_",
"mapUniqueKeyOnly" : true,
"df": "text"
}}}'
To use these parameters, send the parameter useParams=full_txt with each request.
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 370 of 1426
Apache Solr Reference Guide 7.7
Uploading Data with Solr Cell using Apache Tika
If the documents you need to index are in a binary format, such as Word, Excel, PDFs, etc., Solr includes a
request handler which uses Apache Tika to extract text for indexing to Solr.
Solr uses code from the Tika project to provide a framework for incorporating many different file-format
parsers such as Apache PDFBox and Apache POI into Solr itself.
Working with this framework, Solr’s ExtractingRequestHandler uses Tika internally to support uploading
binary files for data extraction and indexing. Downloading Tika is not required to use Solr Cell.
When this framework was under development, it was called the Solr Content Extraction Library, or CEL; from
that abbreviation came this framework’s name: Solr Cell. The names Solr Cell and
ExtractingRequestHandler are used interchangeably for this feature.
Key Solr Cell Concepts
When using the Solr Cell framework, it is helpful to keep the following in mind:
• Tika will automatically attempt to determine the input document type (e.g., Word, PDF, HTML) and
extract the content appropriately. If you like, you can explicitly specify a MIME type for Tika with the
stream.type parameter. See http://tika.apache.org/1.19.1/formats.html for the file types supported.
• Briefly, Tika internally works by synthesizing an XHTML document from the core content of the parsed
document which is passed to a configured SAX ContentHandler provided by Solr Cell. Solr responds to
Tika’s SAX events to create one or more text fields from the content. Tika exposes document metadata as
well (apart from the XHTML).
• Tika produces metadata such as Title, Subject, and Author according to specifications such as the
DublinCore. The metadata available is highly dependent on the file types and what they in turn contain.
Some of the general metadata created is described in the section Metadata Created by Tika below. Solr
Cell supplies some metadata of its own too.
• Solr Cell concatenates text from the internal XHTML into a content field. You can configure which
elements should be included/ignored, and which should map to another field.
• Solr Cell maps each piece of metadata onto a field. By default it maps to the same name but several
parameters control how this is done.
• When Solr Cell finishes creating the internal SolrInputDocument, the rest of the Lucene/Solr indexing
stack takes over. The next step after any update handler is the Update Request Processor chain.
Solr Cell is a contrib, which means it’s not automatically included with Solr but must be configured. The
example configsets have Solr Cell configured, but if you are not using those, you will want to pay attention to
the section Configuring the ExtractingRequestHandler in solrconfig.xml below.
Solr Cell Performance Implications
Rich document formats are frequently not well documented, and even in cases where there is
documentation for the format, not everyone who creates documents will follow the specifications faithfully.
This creates a situation where Tika may encounter something that it is simply not able to handle gracefully,
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 371 of 1426
despite taking great pains to support as many formats as possible. PDF files are particularly problematic,
mostly due to the PDF format itself.
In case of a failure processing any file, the ExtractingRequestHandler does not have a secondary
mechanism to try to extract some text from the file; it will throw an exception and fail.
If any exceptions cause the ExtractingRequestHandler and/or Tika to crash, Solr as a whole will also crash
because the request handler is running in the same JVM that Solr uses for other operations.
Indexing can also consume all available Solr resources, particularly with large PDFs, presentations, or other
files that have a lot of rich media embedded in them.
For these reasons, Solr Cell is not recommended for use in a production system.
It is a best practice to use Solr Cell as a proof-of-concept tool during development and then run Tika as an
external process that sends the extracted documents to Solr (via SolrJ) for indexing. This way, any extraction
failures that occur are isolated from Solr itself and can be handled gracefully.
For a few examples of how this could be done, see this blog post by Erick Erickson, Indexing with SolrJ.
Trying out Solr Cell
You can try out the Tika framework using the schemaless example included in Solr.
This command will simply start Solr and create a core/collection named "gettingstarted" with the _default
configset.
bin/solr -e techproducts
Once Solr is started, you can use curl to send a sample PDF included with Solr via HTTP POST:
curl 'http://localhost:8983/solr/techproducts/update/extract?literal.id=doc1&commit=true' -F
"myfile=@example/exampledocs/solr-word.pdf"
The URL above calls the ExtractingRequestHandler, uploads the file solr-word.pdf, and assigns it the
unique ID doc1. Here’s a closer look at the components of this command:
• The literal.id=doc1 parameter provides a unique ID for the document being indexed. Without this, the
ID would be set to the absolute path to the file.
There are alternatives to this, such as mapping a metadata field to the ID, generating a new UUID, or
generating an ID from a signature (hash) of the content.
• The commit=true parameter causes Solr to perform a commit after indexing the document, making it
immediately searchable. For optimum performance when loading many documents, don’t call the
commit command until you are done.
• The -F flag instructs curl to POST data using the Content-Type multipart/form-data and supports the
uploading of binary files. The @ symbol instructs curl to upload the attached file.
• The argument myfile=@example/exampledocs/solr-word.pdf uploads the sample file.Note this includes
© 2019, Apache Software Foundation
Guide Version 7.7 - Published: 2019-03-04
Page 372 of 1426
Apache Solr Reference Guide 7.7
the path, so if you upload a different file, always be sure to include either the relative or absolute path to
the file.
You can also use bin/post to do the same thing:
bin/post -c techproducts example/exampledocs/solr-word.pdf -params "literal.id=a"
Now you can execute a query and find that document with a request like
http://localhost:8983/solr/gettingstarted/select?q=pdf. The document will look something like this:
You may notice there are many metadata fields associated with this document. Solr’s configuration is by
default in "schemaless" (data driven) mode, and thus all metadata fields extracted get their own field.
You might instead want to ignore them generally except for a few you specify. To do that, use the uprefix
parameter to map unknown (to the schema) metadata field names to a schema field name that is effectively
ignored. The dynamic field ignored_* is good for this purpose.
For the fields you do want to map, explicitly set them using fmap.IN=OUT and/or ensure the field is defined in
the schema. Here’s an example:
Guide Version 7.7 - Published: 2019-03-04
© 2019, Apache Software Foundation
Apache Solr Reference Guide 7.7
Page 373 of 1426
bin/post -c techproducts example/exampledocs/solr-word.pdf -params
"literal.id=doc1&uprefix=attr_"
The above example won’t work as expected if you run it after you’ve already indexed the
document one or more times.
Previously we added the document without these parameters so all fields were added to
the index at that time. The uprefix parameter only applies to fields that are undefined, so
these won’t be prefixed if the document is reindexed later. However, you would see the
new last_modified_dt field.
The easiest way to try this parameter is to start over with a fresh collection.
ExtractingRequestHandler Parameters and Configuration
Solr Cell Parameters
The following parameters are accepted by the ExtractingRequestHandler.
These parameters can be set for each indexing request (as request parameters), or they can be set for all
requests to the request handler generally by defining them in solrconfig.xml, as described in Configuring
the ExtractingRequestHandler in solrconfig.xml.
capture
Captures XHTML elements with the specified name for a supplementary addition to the Solr document.
This parameter can be useful for copying chunks of the XHTML into a separate field. For instance, it could
be used to grab paragraphs (