RIM Laboratoire 3 Lab 1 Instructions
Lab%201%20instructions
Lab%201%20instructions
Lab%201%20instructions
User Manual:
Open the PDF directly: View PDF .
Page Count: 6
Méthodes d’accès aux données 2018 - 2019
Indexing and Search with Apache
Lucene
Lab Nº 1
A. Objectives
Lucene
Lucene
!"
# $
! !
B. Organization
% &
Report'("
()(!
Deadline'*Moodle
C. Import the project
+",-
+.(""
D. Understanding the Lucene API
/.01+02 ,34
Méthodes d’accès aux données 2018 - 2019
‐Lucene ( "
' 5,6 3
-5&6() ‐
"(
" Lucene + '
'33334747,33"0
lucene-6.6.1/docsSearchFiles
) ch.heigvd.iict.dmg.demo "
- + " ) ") "
"('
, +stopword8. "
"
& +8. "
"
9 8/"
"8
: +"stopword
82(
E.Using Luke
Luke;"!""
Lucene ! !(
;Luke
indexDemo
"-
'<"'333+=3)3
544>6?#00@
"
F. Indexing and Searching the CACM
collection
A" Lucene
%
.
"!' publication id!
authors 56! title summary 56
?$@"
"
<) !"(!
"
/.01+02 &34
Méthodes d’accès aux données 2018 - 2019
Indexing
BLucene
, ;StandardAnalyzer
2. -
"(author!
titlesummary
9 <
Lucene
: =publication id "
(
5. Lucene))Lucene
'
'33334747,33333
3 " id! title!
summaryauthor
4 A
8/ ) Lucene
5) FieldType6; Luke )
')TODO student
)""(
Using different Analyzers
Lucene%"
"%
, "%'
WhitespaceAnalyzer
EnglishAnalyzer
ShingleAnalyzerWrapper5%&6
ShingleAnalyzerWrapper5%96
StopAnalyzer""
common_words.txt
;"
& #) Luke
"'
,>(
% )
(
/.01+02 934
Méthodes d’accès aux données 2018 - 2019
Reading Index
Luke"
< Lucene
!HighFreqTerms"
""('
, A "8 /"
38
& #,>"(
Searching
B EnglishAnalyzer 2
( "
" ; QueryParser
%("Lucene
! ( compiler program
"'
*'
9,CD'2EBB5,&::>:&D6
,:FD'B(B0#5,,FF4F4F6
&4F&'B22#25,,&>&9>46
,,C9';+2+5,>D4D:4F6
,:4F'1"+5>DDF&9:&F6
,DCC'5>DDF&9:&F6
,4:G'AEB0;AEBB125>DD>G4D,6
,&9G'2+25>D&:F&F&6
&D::'*2E25>D&:F&F&6
&D&9'/0#+"5>D&>>,,D6
.' publication id HI'IH title HI5IH Lucene
scoreHI6I
A " (
summary'
, IBI
& IIIBI
9 IBI !
III+I
: "II
F II IBI
5 F6
( ( QueryParser!
,>
/.01+02 :34
Méthodes d’accès aux données 2018 - 2019
Lucene( '
'33334747,3(3333
(33)0J)7
Tuning the Lucene Score
%
Lucene, 4!
Okapi BM251. <&Lucene
0+(*LuceneK
'
'33334747,333333
3+*
Lucene"'
"L(L(!LL!LL!'
'("5'L
√
freq
6$
:(""
5'L
log
(
numDocs
docFreq+1
)
+1
6$
'!(("?L?
5!6
'(
+'L
overlap
maxOverlap
(:(%$
(K"!
K"
'L '
oL0LM56L
oL0"
")!
*
1 M&F''33")3")3E)7M&F
2 See https://lucene.apache.org/core/6_6_1/core/org/apache/lucene/index/IndexWriterConfig.html#setSimilarity-
org.apache.lucene.search.similarities.Similarity- and
https://lucene.apache.org/core/6_6_1/core/org/apache/lucene/search/IndexSearcher.html#setSimilarity-
org.apache.lucene.search.similarities.Similarity-
/.01+02 F34
Méthodes d’accès aux données 2018 - 2019
"0+"'
, 2'
org.apache.lucene.search.ClassicSimilarity
& E'
LL5L(6
LL5L(!LL+6
LL5L!LL E6
Note that search time is too late to modify this norm part of scoring.
You need to re-index the documents using your specialized similarity
class that implements computeNorm().
9 ; "
'
'
1+log freq
'5
(
numDocs
docFreq+1
)
+1
',
'
√
overlap
maxOverlap
: * *
*5*6 IndexWriterIndexSearcher ;
EnglishAnalyzer
F 2("ClassicSimilarity
" "
ClassicSimilarity " )3 ,>
+"
/.01+02 434