RIM Laboratoire 3 Lab 1 Instructions

Lab%201%20instructions

User Manual:

Open the PDF directly: View PDF .
Page Count: 6

Download
Open PDF In Browser	View PDF

Méthodes d’accès aux données 2018 - 2019
Prof. Nastaran Fatemi
Gary Marigliano

Indexing and Search with Apache
Lucene
Lab Nº 1
A.

Objectives

The goal of this lab is to discover the Lucene platform and to learn its
functionalities by using its Java API. Lucene is a library for indexing and
searching text files, written in Java and available as open source under the
Apache License. It is not a standalone application; it is designed to be
integrated easily into applications that have to search text in local files or
on the Internet. It attempts to balance efficiency, flexibility, and
conceptual simplicity at the API level.

B.

Organization

The lab should be realized in a group of maximum 2 students.
Report: It is required to submit a report containing both the answers to
the questions asked in different questions, and the source code of your
implementation.
Deadline: See deadline on Moodle.

C.

Import the project

Download the lab 1 Maven project on Moodle and import it in your favorite
IDE. The required libraries will be downloaded automatically.

D.

Understanding the Lucene API

HEIG-VD - TIC

1/6

Méthodes d’accès aux données 2018 - 2019
Implementing a full‐text search application using Lucene requires two
steps: (1) creating a lucene index on the documents and/or database
objects and (2) parsing the user query and looking up the pre ‐built index
to answer the query.
Follow
the
Lucene
Demo
API
at:
http://lucene.apache.org/core/6_6_1/demo/overview-summary.html
to
build an index of the directory lucene-6.6.1/docs. The SearchFiles class
is already in the package ch.heigvd.iict.dmg.demo in the downloaded
project. Do a few simple searches to make sure it works. Answer the
following questions:
1. Does the command line demo use stopword removal? Explain how
you find out the answer.
2. Does the command line demo use stemming? Explain how you find
out the answer.
3. Is the search of the command line demo case insensitive? How did
you find out the answer?
4. Does it matter whether stemming occurs before or after stopword
removal? Consider this as a general question.

E.Using Luke
Luke is a GUI tool written in Java, which lets you browse the contents of a
Lucene index, examine individual documents, and run queries over the
index. Use Luke to examine the content of the index created in the
previous section. This index is located in the folder called indexDemo in the
downloaded project.
Note: You can download it from https://github.com/DmitryKey/luke/
(release 6.6.0) the file “Lucene-Installation-Instructions” contains also
some information on how to use it.

F. Indexing and
collection

Searching

the

CACM

We are now going to employ the full text search provided by Lucene to
analyze a publication list. A text file containing a list of publications is
provided to you in the file cacm.txt. Each line in the text file contains the
following information, separated by tabulations: the publication id, the
authors (if any), the title and the summary (if any). The authors of a
publication are separated by “;”. There might be publications without any
author or without the summary field.
Your task is to index the publication list, to perform a few queries, and to
report the results in the following subsections.

HEIG-VD - TIC

2/6

Méthodes d’accès aux données 2018 - 2019
Indexing
Refer to the slides of the Lucene course for an example of indexing.
1. Use StandardAnalyzer for this part.
2. Import the maven project and implement a program that creates an
index of the publication list and allows searching queries on author,
title and summary attributes.
3. You need to define one field for each attribute. Note that a
document in Lucene can have multiple values for a given field. For
example the author field accepts multiple author values.
4. Keep the publication id in the index and show it in the results set of
your queries.
5. Lucene provides different types of fields. Take a look at Lucene Field
class:
http://lucene.apache.org/core/6_6_1/core/org/apache/lucene/docume
nt/Field.html and explain which field type can be used for id, title,
summary and author.
6. What should be added to the code to have access to the term vector
in the index? Have a look at the Lucene slides presented in your
course (look at different methods of FieldType). Use Luke to check
that the term vector is included in the index.
Attach the code of indexing to your report. Tip: look at the TODO student
comments to know where to implement the requested features.

Using different Analyzers
Lucene provides different analyzers to process a document. In this lab we
will use the same analyzer for indexing and for searching.
1. Index the publication list using each of the following analyzers:
 WhitespaceAnalyzer
 EnglishAnalyzer
 ShingleAnalyzerWrapper (using shingle size 2)
 ShingleAnalyzerWrapper (using shingle size 3)
 StopAnalyzer with a custom stop list. A list of common words is
provided in the file common_words.txt of the publication
dataset. Use this list as stopwords.
2. Look at the index using Luke and for each created index find out the
following information:
a. The number of indexed documents and indexed terms.
b. The number of indexed terms in the summary field.
c. The top 10 frequent terms of the summary field in the index.
d. The size of the index on disk.
e. The required time for indexing.

HEIG-VD - TIC

3/6

Méthodes d’accès aux données 2018 - 2019
Reading Index
Luke is a practical way of getting info on your index that you can use for
verification and debugging. You can also read the index via the Lucene API
to get basic statistics. For this, use the class HighFreqTerms and write the
code to answer the following questions:
1. What is the author with the highest number of publications? How
many publications does he/she have?
2. List the top 10 terms in the title field with their frequency.
Attach your code into the report.

Searching
Rebuild your index using EnglishAnalyzer. Complete your program to
perform a search based on a given query and show the results in the
specific format as given in the following example. Use QueryParser for
analyzing the query as shown in the Lucene slides.
For example, the query compiler program could return the results as
following:
Searching for: compiler program
3189: An Algebraic Compiler for the FORTRAN Assembly Program (1.2440429)
1459: Requirements for Real-Time Languages (1.1556565)
2652: Reduction of Compilation Costs Through Language Contraction (1.1202306)
1183: A Note on the Use of a Digital Computerfor Doing Tedious Algebra and Programming (1.0969465)
1465: Program Translation Viewed as a General Data Processing Problem (0.99523425)
1988: A Formalism for Translator Interactions (0.99523425)
1647: WATFOR-The University of Waterloo FORTRAN IV Compiler (0.9907691)
1237: Conversion of Decision Tables To Computer Programs (0.9245252)
2944: Shifting Garbage Collection Overhead to Compile Time (0.9245252)
2923: High-Level Data Flow Analysis (0.9200119)

Each result is formatted as:
score + ")".

publication id + ":" + title +"(" + Lucene

Write the necessary code to perform the following queries on the
summary filed:
1. Publications containing the term "Information Retrieval".
2. Publications containing both "Information" and "Retrieval".
3. Publications containing at least the term "Retrieval" and, possibly
"Information" but not "Database".
4. Publications containing a term starting with "Info".
5. Publications containing the term "Information" close to "Retrieval"
(max distance 5).
For each query provide the text of the query used by QueryParser, the
total number of results and the top 10 results. Attach your code to the
report.

HEIG-VD - TIC

4/6

Méthodes d’accès aux données 2018 - 2019
The Lucene querying syntax can be found at:
http://lucene.apache.org/core/6_6_1/queryparser/org/apache/lucene/
queryparser/classic/package-summary.html#package_description.

Tuning the Lucene Score
The goal of this part of the lab is to use a customized formula for the
calculation of the similarity score.
Lucene, starting from version 6, computes a similarity score based on
Okapi BM251. You can force2 Lucene to compute a similarity score based
on TF-IDF of document and query terms. See Lucene’s similarity formula is
described here:
http://lucene.apache.org/core/6_6_1/core/org/apache/lucene/search/
similarities/TFIDFSimilarity.html.
The scoring function used in Lucene is the following:

where q is the query, d a document, t a term, and:



tf: a function of the term frequency within the document (default: √ freq
);
idf: inverse document frequency of t within the whole collection
numDocs
+ 1);
(default: log
docFreq+1
boost: the boosting factor, if required in the query with the “ “ operator
on a given field (if not specified, set to the default field).
coord: overlapping rate of terms of the query in the given document.
overlap
Default:
maxOverlap
queryNorm: query normalization according to its length; it corresponds
to the sum of square values of terms’ weight, the global value is
multiplied by each term’s weight.
norm: encapsulates indexing time boost and length factors:
o Field boost - set by calling field.setBoost() before adding the field
to a document.
o lengthNorm - computed when the document is added to the index in
accordance with the number of tokens of this field in the document,
so that shorter fields contribute more to the score. lengthNorm is
computed by the Similarity class in effect at indexing.

(








)

BM25: https://en.wikipedia.org/wiki/Okapi_BM25
See https://lucene.apache.org/core/6_6_1/core/org/apache/lucene/index/IndexWriterConfig.html#setSimilarityorg.apache.lucene.search.similarities.Similarity- and
https://lucene.apache.org/core/6_6_1/core/org/apache/lucene/search/IndexSearcher.html#setSimilarityorg.apache.lucene.search.similarities.Similarity1
2

HEIG-VD - TIC

5/6

Méthodes d’accès aux données 2018 - 2019
To define the new score based on TF-IDF proceed as following:
1. Create a custom similarity class that inherits the class:
 org.apache.lucene.search.ClassicSimilarity

2. Override and implement default similarity functions:
 public float tf(float freq)
 public float idf(long docFreq, long numDocs)
 public float coord(int overlap, int maxOverlap)
Note that search time is too late to modify this norm part of scoring.
You need to re-index the documents using your specialized similarity
class that implements computeNorm().
3. Use the following functions to implement the similarity functions
mentioned above:
 tf : 1+log ⁡freq
numDocs
+1
 idf : log (
docFreq+1
 lengthNorm : 1
overlap
 coord :
maxOverlap

(

)

√

4. Set the custom similarity in Indexer and Searcher using the method
setSimilarity(mySimilarity) of IndexWriter and IndexSearcher. Use
the EnglishAnalyzer as before.
5. Compute the query compiler program with the ClassicSimilarity and
new similarity function using the above parameters and show the
ClassicSimilarity and new rankings/scores for top 10 publications.
Describe the effect of using the new parameters.
Attach your code into the report.

HEIG-VD - TIC

6/6

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 6
Page Layout                     : SinglePage
Language                        : en-GB
Title                           : RIM - Laboratoire 3
Author                          : pwaelti
Creator                         : Writer
Producer                        : LibreOffice 6.1
Create Date                     : 2018:09:24 09:38:08+02:00

EXIF Metadata provided by EXIF.tools

RIM Laboratoire 3 Lab 1 Instructions

Lab%201%20instructions

Lab%201%20instructions

Lab%201%20instructions

Navigation menu

Versions of this User Manual:

Views

Navigation