RIF Data Loader Manual

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 19

DownloadRIF Data Loader Manual
Open PDF In BrowserView PDF
The Rapid Inquiry Facility (RIF)
Version 4.0
Data Loader user guide

Authors (2017):
Parkes, B….
Small Area Health Statistics Unit (SAHSU)
MRC-PHE Centre for Environment and Health
Department of Epidemiology and Biostatistics
School of Public Health
Imperial College London
Medical Faculty Building
St Mary's Campus, Norfolk Place
LONDON W2 1PG
Website www.sahsu.org

Contents
The Rapid Inquiry Facility (RIF)................................................................................................................ 1
Version 4.0 .............................................................................................................................................. 1
Data Loader user guide ........................................................................................................................... 1
1. Introduction to the Data Loader ......................................................................................................... 3
1.1 Purpose ......................................................................................................................................... 3
1.2 Overview Schematic...................................................................................................................... 3
2. Requirements ...................................................................................................................................... 3
3. How to use .......................................................................................................................................... 4
3.1 Configuring properties .................................................................................................................. 4
3.2 Starting up ..................................................................................................................................... 5
3.3. Step 1: Define Geographies ......................................................................................................... 6
3.4 Step 2: Define Health Themes ...................................................................................................... 7
3.5 Step 3: There is no step 3! ............................................................................................................ 7
3.6 Step 4: Define Custom Data Types (Optional) .............................................................................. 7
3.7 Step 5: Define Data Importing Hints (Optional) ............................................................................ 8
3.8 Step 6: Define Denominators ...................................................................................................... 10
3.8 Step 7: Define Numerators ......................................................................................................... 11
3.9 Step 8: Define Covariates ............................................................................................................ 12
3.10 Configuration editor dialog ....................................................................................................... 13
3.10.1 Data Field Properties. 1. Extract Phase .............................................................................. 14
3.10.2 Data Field Properties. 2. Clean Phase ................................................................................ 15
3.10.3 Data Field Properties. 3. Convert Phase ............................................................................ 16
3.10.3 Data Field Properties. 4. Optimise Phase........................................................................... 16
3.10.3 Data Field Properties. 5. Check Phase ............................................................................... 17
3.11 Running a data load .................................................................................................................. 17
3.12 Database functions for tmp_sahsu_db ..................................................................................... 18

1. Introduction to the Data Loader
The Data Loader tool is a standalone utility to import health and population data from different
formats. It is an example for an Extract Transform Load (ETL) tool that can massage different data
sources into forms that are needed by the main RIF database. The data loader is used in parallel with
the RIF Tile Maker, a separate utility for converting and importing map data into the RIF database.

1.1 Purpose
The data loader tool transforms imported data sets into cleaned files that can then be loaded into
the RIF production database using simple scripts it makes for both PostgreSQL and SQL Server
databases. The data loader tool itself uses a temporary database to provide a means of iterating
through transformation steps that use temporary tables. Because the output of the data loader
includes both a finished version of an imported data set and scripts that could load it into the RIF
production database, the data loader’s database can itself be viewed as a temporary artefact. Once
data managers have processed all the files they want to load into the RIF database, they can elect to
delete the data loader’s temporary database.

1.2 Overview Schematic

This schematic gives an overview of the processes, data-files and components involved in the data
loader process.

2. Requirements
The RIF Data Loader is a standalone application written in Java. It uses the Java Swing user interface
library. The Data Loader is supplied as a JAR file (rifDataLoaderTool.jar) that contains all the
necessary dependencies.

The data loader needs read/write/update access to a temporary database, either MS SQL server or
Postgres

3. How to use
3.1 Configuring properties
The properties file for the data loader tool is named RIFDataLoaderToolStartupProperties.properties
which resides in C:\GitHub\rapidInquiryFacility\rifDataLoaderTool\src\main\resources. The
properties file contains various configuration options such as which database to use and the
database login details. The following table lists and describes the properties that can be configured.
Table 1. RIFDataLoaderToolStartupProperties.properties.
Property
databaseType

Valid values
(separated by ;)
ms;pg

databasePasswordFile

A valid path to an
existing password file

pg.driverClassName

org.postgresql.Driver

pg.jdbcDriverPrefix

jdbc:postgresql

pg.host

localhost; server name

pg.port

5432

pg.databaseName

tmp_sahsu_db

ms.driverClassName
ms.jdbcDriverPrefix

com.microsoft.sqlserver
.jdbc.SQLServerDriver
jdbc:sqlserver

ms.host

localhost; server name

ms.port

1433

ms.databaseName

tmp_sahsu_db

Description
Specifies which temporary database type to use.
ms = Microsoft SQL Server; pg = Postgres
Separate folders using double forward slashes.
E.g.
C://rif_scripts//db//RIFDatabaseProperties.txt
Only relevant when using Postgres db. Type of
Postgres driver to use
Only relevant when using Postgres db. Prefix of
jdbc driver
Only relevant when using Postgres db. Name of
server hosting Postgres database
Only relevant when using Postgres db. Port to
use to connect to Postgres database
Only relevant when using Postgres db. Name of
temporary database to use.
Only relevant when using MS SQL Server. Type
of SQL server driver to use
Only relevant when using MS SQL server. Prefix
of jdbc driver
Only relevant when using MS SQL server. Name
of server hosting database
Only relevant when using MS SQL server. Port to
use to connect to database
Only relevant when using MS SQL server. Name
of temporary database to use.

The Data Loader read a file which contains the userid and password to access the temporary
database used by the Data Loader, the location and name of the file is defined by
databasePasswordFile set in the properties file described above. The contents of the file should
contain a valid username and password to access the temporary database used by the dataloader.
E.g:
userID=postgres

password=XXXXXXX

3.2 Starting up
When the data loader is started up, the following screen is loaded:
Figure 1. RIF data loader tool, main screen

At any point during a data loader session, all the configuration settings can be saved in an XML file
by clicking File-Save As:

Previously defined configuration data can be loaded using File->Load… and opening an XML file:

The process of configuring the data loader is divided into 8 steps:

3.3. Step 1: Define Geographies
Click the ‘Browse’ button under the ‘define geographies’ area and select an appropriate XML file that
defines the geographies used by the data:

Once an xml file is selected in the ‘Geography Editor Dialog’, a summary of the geographies is
displayed. Press ‘OK’ if the geographies file is satisfactory.

3.4 Step 2: Define Health Themes
Once the geographies XML file has been selected, the ‘Add’ button is enabled in the ‘Step 2’ section.
This brings up the ‘Health Theme Editor Dialog’:

Here the use enters the name and description of the health theme being defined by the data being
loaded.

3.5 Step 3: There is no step 3!
3.6 Step 4: Define Custom Data Types (Optional)
Once a health theme has been defined, the ‘Edit’ button is enabled in step 4 allowing the user to edit
and add to the 11 pre-defined data types:

Only use this feature if you are confident in the changes required. New data fields may require new
database functions to be written to either clean or validate the data for the new data type.
Examples of database types that might be created include: maternal age (which could have a
minimum maximum plausible values defined); birth weight (minimum and maximum values).

3.7 Step 5: Define Data Importing Hints (Optional)
Once a health theme has been defined, the ‘Edit’ button is enabled in step 5 allowing the user
defined configuration hints:

When the Data Loader imports a CSV file, it tries to guess table and field properties based on parts of
the name of the imported file and the names of the CSV field columns. Here you can define regular
expression patterns so that when the fields are imported, the software can make better guesses.
The more configuration options that the system supports, the more work this can cause the data
manager. The data manager might have over 100 separate controls to adjust to configure the CSV
file, consequently the data loader includes a hint feature which allows it set intelligent default values
for fields.
The hints are based on naming conventions of data sets and fields. Data managers can associate
regular expression patterns with default values of general data set properties. These include:
version, a description and the target area of the RIF production database. Data managers can also
define regular expression patterns to match CSV field names with field properties.
As an example of a data set hint, the regular expression ^cancer.* could be used to set the target
area of the CSV file cancer_data_2012.csv so that it is set to “Health Numerator Data”. As an
example of a field hint, .*year.* could be used to set the data type to RIF data type “Year” for any
field name that contains ‘year’. A field hint of ^year$ would be more specific, meaning that the exact
field name was year. In this case, we may decide that this is a required field for the RIF, whose
numerator tables expect to have a field by that name.
Field configuration hints are defined under the ‘Data Set Field Configuration Hints tab:

The field hint editor screen sets the default values for the actual data set fields. See section 3.10 to
describe the phases and fields in this screen.

3.8 Step 6: Define Denominators
The ‘Add’ button in step 6 is enabled when the health theme has been defined (step 2). Clicking the
add button brings up a file selection box prompting the user to select a CSV file that contains tabular
denominator data. The delimiter is a comma by default, but other characters can be selected in the
denominator field. Once a suitable denominator file has been selected, the ‘preview’ button must be
clicked before the denominator file will be loaded:

If the data previewed in the selection dialog looks satisfactory, the ‘OK’ button brings up the
‘Denominator Configuration Editor Dialog’:

The dialog screen is used (with different titles) for editing all of the numerator, denominator and
confounder data sets; its functionality and validations are described for all three purposes in section
3.10.

3.8 Step 7: Define Numerators
The ‘Add’ button in step 6 is enabled when the denominators have been successfully defined (step
6). Clicking the add button brings up a file selection box prompting the user to select a CSV file that
contains tabular numerator data. The delimiter is a comma by default, but other characters can be
selected in the delimiter field. Once a suitable numerator file has been selected, the ‘preview’
button must be clicked before the numerator file will be loaded:

If the data previewed in the selection dialog looks satisfactory, the ‘OK’ button brings up the
‘Numerator Configuration Editor Dialog’:

The dialog screen is used (with different titles) for editing all of the numerator, denominator and
confounder data sets; its functionality and validations are described for all three purposes in section
3.10.

3.9 Step 8: Define Covariates
The ‘Add’ button in step 8 is enabled when the denominators have been successfully defined (step
6). Clicking the add button brings up a file selection box prompting the user to select a CSV file that
contains tabular numerator data. The delimiter is a comma by default, but other characters can be
selected in the delimiter field. Once a suitable covariates file has been selected, the ‘preview’ button
must be clicked before the covariates file will be loaded:

If the data previewed in the selection dialog looks satisfactory, the ‘OK’ button brings up the
‘Covariate Configuration Editor Dialog’:

The dialog screen is used (with different titles) for editing all of the numerator, denominator and
confounder data sets; its functionality and validations are described for all three purposes in section
3.10.

3.10 Configuration editor dialog
The configuration edit dialog used for editing denominator, numerator and covariate data. The title
of the screen and the validation rules change depending on which type pf data is being edited, other
than that the functionality stays the same.

Data Set Name defaults to the data file name without the extension. The name data set name can
be edited.
Version is used to keep track of the version number when several different file sets are employed.
Data Set Description is a descriptive field. Can contain details of where the data set came from etc.
Geography defines the link to the geography defined. Links to the geography meta data XML file.
The Data Set Fields list is populated with the fields defined in the csv file. The properties of each
field must be defined carefully before the data loader will run. As well as the field description (which
is free text), the properties are divided into 5 phases:
3.10.1 Data Field Properties. 1. Extract Phase
Load Field Name. The name of the field when it is first imported into the database. Defaults to the
name of the header in the csv file or it can be auto-generated if no header data is available.
Field Purpose. Drop-down list defining the purpose of the field. For some destination areas of the
RIF schema, processing the data set requires that a field satisfy a role. Selectable values are:
Field Purpose
Other

Covariate

Description
Default setting. No special
meaning in the data, will be
loaded as is.
Used if this field is to be a
covariate.

Geographical resolution

Used when the column defines
the geography of the record.
Examples would be state,
county, district, country

Health code

Used for fields that define a
health condition. For example a
column containing the ICD-10
code would be of type ‘health
code’. Typically used for
numerator data.
Used for fields that contain the
number of subjects. For
example the population in
denominator data, or the
incidence count for numerator
data.

Total count

Validation rules

Covariate data requires at least
one column that serves as a
covariate.
At least one field must be a
Geographical Resolution for
denominator, numerator and
covariate data.
The denominator and
numerators screens will
validate that the geographical
resolution fields match those
defined in the geographies file
(step 1). It has to be an exact
match.
Numerator screen requires at
least one field which is a health
code.

The denominator and
numerator screens will validate
that there is exactly 1 field
whose field purpose is ‘Total
Count’. ‘Total count’ type fields
must have a RIF Data Type of
‘Integer’

Field requirement level. Drop down list defining the level of validation required for this field.
Selectable values are:
•

•
•

Required by RIF – (default value) fields defined with this requirement level must have valid
values to successfully be loaded using the data loader. This means that the convert field
name must be one of a set of expected pre-defined values. Used for most denominator,
numerator and covariate data.
Extra field – will be loaded, but is not validated to contain valid values. . An extra field is one
the RIF manager wants to promote in a data set but which is not required by the RIF.
Ignore field – field is not to be loaded in the RIF and is ignored.

3.10.2 Data Field Properties. 2. Clean Phase
RIF Data Type. Drop-down list defining the data type this field should have in the RIF. Describes a set
of predefined data types, each of which has a data type, a validation aspect and a cleaning aspect.
Selectable values are:
RIF Data Type
Integer

Description
Must contain valid
integer values

Database functions
is_valid_integer

Year

the year that the data
applies to

clean_year

UK Postal Code

UK post code field

Sex

Sex field

clean_uk_postal_code
is_valid_uk_postal_code
convert_age_sex*
Must be exactly 1 field
whose type is ‘Sex’ for
denominator and
numerator data.

Quintilised field

Double

A field with 5 valid
values
Free text field
ASCII text field
ICD code for disease
identification
Numeric field

Age

Age.

Date

Date field

Text
ASCII Text
ICD

clean_icd
(or clean_icd_code)
is_numeric
(or is_valid_double?)
clean_age;
convert_age_sex*

Validation rules
Fields whose purpose
is ‘Total Count’ must
of RIF Data Type
‘Integer’
Must be exactly 1 field
whose type is ‘Year’
for all three data
types.

Must be exactly 1 field
whose type is ‘Age’ for
denominator and
numerator data.

clean_date
(or date
matches_format?)
* The convert_age_sex function combines the age and sex fields and groups the age into 5 year
groups. The format of the age_sex field is: xyy where x defines the sex (1 = m, 2 = f, 3 = other), and

yy represents the age group: 00 to 04 are ages 0 to 4; 05 is age 5 to 9, 06 is age 10 to 14 etc. up to
age 85 and above which is 21. So code 211 is female, aged 35 to 39.
Clean Field Name. The name of the field that the RIF manager would want to use to rename an
imported field. Often it is the same as Load Field Name but it can be different if the load field
names are auto-generated when data are first imported. Drop down field that can be edited (combo
box?), behaviour and validation changes depending on the ‘Field Purpose’:
•

•

•

If Field Purpose is ‘Covariate’, ‘Health Code’ or ‘Other’ the clean field name will default to
the same as the ‘Load Field Name’ (or the value set in the ‘Data set Field configuration Hints’
section if it exists) and can be edited.
If field Purpose is geographical resolution, the clean field name drop-down will be populated
with the geography display names defined in the geographies XML file, and the field cannot
be edited. The field is validated such that clean field names for geographies exactly match
the values in the XML file
If field purpose is ‘Total Count’ the clean field name will default to ‘total’ (or the value set in
the ‘Data set Field configuration Hints’ section if it exists) and can be edited.

Field Change Audit Level. Drop down that defines the level of auditing the data loader is to do when
loading the data. Describes the extent to which changes in the field should be recorded in audit logs.
Values are:
•
•
•

‘None’ - will not record any field changes at all
‘Include field name only’ - simply records that the field was changed
‘Include detailed change description’ - A detailed change description will include the field,
original and revised values

The audit records are held in the temporary database (tables aud_chg_*; aud_val_*; cln_val_*) and
also written as output for the dataloader session in the num_XXnameXX_date.zip file, ‘audit_trail’
folder, aud_chg_XXnameXX.csv and aud_val_XXnameXX.csv files.
Is required. Check box indicating if the data in this field can be empty. If the option is ticked then
validation will detect errors when this field is empty.
3.10.3 Data Field Properties. 3. Convert Phase
Once the data sets have been cleaned, they need to be mapped to fields that are expected in part of
the RIF schema. Sometimes this step involves combining fields.
Convert Field Name. Name of a field that is expected in the RIF. Drop-down box that can be edited.
Will always default to the value in Clean Field Name but can be changed.
Convert Function Name. The name of the function used to map clean fields to fields in a converted
version of the dataset. Will have the value ‘convert_age_sex’ for ‘age’ and ‘sex’ fields, blank for
everything else.
3.10.3 Data Field Properties. 4. Optimise Phase
Create Index. Check box indicating if an index will be created for this field. For fields that have a
Field Requirement Level of ‘Required by RIF’, the create index is checked and cannot be edited. For

fields that have a Field Requirement Level of ‘Extra field’, the create index is enabled and can the
user can decide to check it or leave it blank. For fields that have a Field Requirement Level of
‘Ignore’, the create index is disabled and unchecked.
3.10.3 Data Field Properties. 5. Check Phase
This is the stage when the user decides what data quality checks will be generated to supplement
the finished data set.
The percentage of records, by year, which have an empty field anywhere in the record.
The behaviour and default values of the 3 check boxes in this section a defined in the following
table:
Check box:

Include percent empty
check
If ticked, a data quality
check will be produced
field which calculates
the percentage of
values that are empty.

Include percent
empty-per-year check
If ticked, a data quality
check will be produced
which calculates the
percentage of values
that are empty
grouped by year. Only
applies numerator and
denominator tables.

Enabled/disabled

Enabled when Field
Requirement Level is
set to ‘extra field’.

Default value

Checked when Field
Requirement Level is
set to ‘extra field’.
Unchecked when Field
Requirement Level is
set to ‘ignore field’.

Enabled when Field
Requirement Level is
set to ‘extra field’ or
when editing covariate
data properties
Depends on setting in
the ‘configuration
hints’ section.

Description:

Identifies duplicates
Can the field be used
to help identify
duplicates in a
collection of records?
Sometimes multiple
fields together are
needed to identify
whether two records
are duplicates.
Enabled when Field
Requirement Level is
set to ‘extra field’.

Checked when Field
Requirement Level is
set to ‘extra field’.
Unchecked when Field
Requirement Level is
set to ‘ignore field’.

Once all the data field properties have been edited, clicking ‘OK’ will return the user to the main
application screen.

3.11 Running a data load
When the denominator, numerator and covariate data properties have been set (i.e. step 8 is
complete), the user needs to select an output directory by clicking ‘Browse’ then navigating to a
suitable directory using the ‘open’ dialog:

Once a suitable directory has been selected the user can finally click ‘Run’ on the main application
screen to start the data loader.
Depending on the amount of data to be loaded, the process may take several minutes to run. All the
data is read into the temporary database, checked, converted, optimised then exported to
processed csv files along with xml files describing the content of the data and database scripts (MS
SQL and Postgres) to load the data into the RIF database. A zip files containing audit trail, original
data, processing stages and results are also generated. All the output is generated in a new folder
with the name format:
run_DD-Mmm-yyyy_hh_mm_ss
This folder contains two master database scripts:
ms_run_data_loader_DD-Mmm-yyyy_hh_mm_ss.sql
pg_run_data_loader_DD-Mmm-yyyy_hh_mm_ss.sql
For each numerator, denominator and covariate data set there are 4 files:
XXX_datasetname.csv (CSV file of cleaned data to be load)
XXX_datasetname.fmt (XML file describing the format of the CSV file)
run_XXX_datasetname.sql (Postgres SQL script)
XXX_datasetname_DD-Mmm-yyyy_hh_mm_ss.zip (audit file)
Where XXX is ‘covar’ for covariates, ‘num’ for numerators, ‘pop’ for denominators data sets.

3.12 Database functions for tmp_sahsu_db
Are defined in

rapidInquiryFacility\rifDataLoaderTool\src\main\resources\MSCreateRIFDataLoaderToolDatabase.sq
l
and
rapidInquiryFacility\rifDataLoaderTool\src\main\resources\PGCreateRIFDataLoaderToolDatabase.sql
for MS SQL and Postgres respectively.



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.7
Linearized                      : No
Page Count                      : 19
Language                        : en-GB
Tagged PDF                      : Yes
XMP Toolkit                     : 3.1-701
Producer                        : Microsoft® Word 2016
Creator                         : Parkes, Brandon L
Creator Tool                    : Microsoft® Word 2016
Create Date                     : 2017:10:02 13:25:01+01:00
Modify Date                     : 2017:10:02 13:25:01+01:00
Document ID                     : uuid:9FDA52ED-D440-40F1-ACFB-E996E66DC68D
Instance ID                     : uuid:9FDA52ED-D440-40F1-ACFB-E996E66DC68D
Author                          : Parkes, Brandon L
EXIF Metadata provided by EXIF.tools

Navigation menu