The Structure Of SearchEngine System Tutorial: Instructions

Instructions

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 5

DownloadThe Structure Of SearchEngine System Tutorial: Instructions
Open PDF In BrowserView PDF
CAB431 Tutorial (Week 3): Pre-Processing: Parsing,
Tokenizing and Stopping words removal
********************************************************
TASK 1: Parsing - read files from RCV1v2, find the documentID and record it to
a collection of BowDocument Objects.
•
•

The documentID is simply assigned by the ‘itemid’ in 

In this step, the created BowDocument can be initialed with found
documentID and an empty Map (e.g., dictionary, or HashMap) of key-value

•

pair of (String term: int frequency).

Build up a collection of BowDocument for given dataset, this collection
should be a map structure with documentID as key and BowDocument object

•

as value.

Create a method (or function) to print out all documentIDs by iterating above
collection and calling BowDocument’s method getDocId().

TASK 2: Tokenizing – update Task 1 program to fill term:freq map for every
document.
•
•
•

You only need to tokenize the ‘’ part of document, exclude all
tags, and discard punctuations, numbers.

Use addTerm() of BowDocument to add new term to term map or increase
term frequency when the term occur again.

Create a method displayDocInfo(int aDocId) to display term list with a given

docuemntID, by searching collection of BowDocument in Task 1, and calling

getTermFreqMap() of found document. The output should be like:


Doc docId has termCount different terms:
Term1, 3
Term2, 1
Term3, 4
….

1

•

Please think about the terms with high frequency, which may be some useful
words for describing the document in the future.

TASK 3: Stopping words – use given stopping words list to ignore/remove all
stopping words from the term list of documents.
•

Download the stopping words list from QUT Blackboard, read through first,

•

compare with your notes of high frequency terms.

•

stopWordsList.

•

not exist in stopping words list, ignore such term if it is in.

Update your program to read in given stopping words list and store to a list
Update your program, when adding term to term map, check the term if or
Call the method displayDocInfo() again and compare the output with Task 2.

TASK 4: Sort and display document term:freq list by term or/and by frequency.
You may do this after you have finished and passed all the above 3 tasks in
week 3 tutorial.

2

Examples of output
Document 741200 contains 50 terms and have total 100 words.
quot:4
lead:3
car:3
german:2
over:2
soper:2
victori:2
dalma:1
merced:1
steve:1
austrian:1
fifth:1
han:1
downpour:1
handl:1
…

Document 741000 contains 42 terms and have total 110 words.
under:7
over:4
tee:2
peter:2
trinidad:1
par:1
lehman:1
scotland:1
sunday:1
darren:1
mark:1
…

3

APPENDIX
Create a “Wrapper

Class” of Bag-of-Words representation of a

document.

The BowDocument class should have properties of documentID and a
HashMap in which terms are keys and their frequencies are values.
The

BowDocument

class

should

have

the

following

methods

(functions) besides a constructor of BowDocument class:
private String docId;
private HashMap termFreqMap;
/**
* Constructor
*

Set

the

ID

of

the

document,

and

initiate

term:frequency map.
* call addTerm to add terms to map
* @param docId
*/
public BowDocument(String docId){
//your code here
}
/**
* Add a term occurrence to the BOW representation
* @param term
*/
public void addTerm(String term){
//your code here
}
/**
*
* @param term

4

an

empty

* @return the term occurrence count for the given term
* return 0 if the term does not appear in the document
*/
public int getTermCount(String term){
//your code here
}
/**
*
* @return sorted list of all terms occurring in the document
*/
public ArrayList getSortedTermList(){
//your code here
}
/**
* @return map of term:freq pairs.
*/
public HashMap getTermFreqMap(){
//your code here
}
/**
*
* @return the ID of this document
*/
public String getDocId(){
//your code here
}

5



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : Yes
Author                          : Tao
Company                         : QUT
Create Date                     : 2019:03:01 12:42:47+10:00
Modify Date                     : 2019:03:01 12:42:49+10:00
Source Modified                 : D:20190301024229
Subject                         : 
Tagged PDF                      : Yes
XMP Toolkit                     : Adobe XMP Core 5.4-c006 80.159825, 2016/09/16-03:31:08
Metadata Date                   : 2019:03:01 12:42:49+10:00
Creator Tool                    : Acrobat PDFMaker 11 for Word
Document ID                     : uuid:2bf19058-f4aa-4c9e-8ff6-0ace942c9dea
Instance ID                     : uuid:469967e6-8f15-4a3c-9009-60e485febf05
Format                          : application/pdf
Title                           : The structure of the SearchEngine system tutorial:
Description                     : 
Creator                         : Tao
Producer                        : Adobe PDF Library 11.0
Keywords                        : 
Page Layout                     : OneColumn
Page Count                      : 5
EXIF Metadata provided by EXIF.tools

Navigation menu