Deployment Guide

User Manual: Pdf

Open the PDF directly: View PDF .
Page Count: 6

Download
Open PDF In Browser	View PDF

VA Online Memorial - Data scraper improvements

VA Online Memorial - Data scraper improvements

Revision History
Author

Revision Number

Date

TCCODER

1.0

Feb 09, 2018

©TopCoder, Inc. 2018

Page 1 of 6

VA Online Memorial - Data scraper improvements

Deployment Instructions
1. Deployment Dependencies
2. Organization of Submission
3. Deployment Instructions
4. Verification
5. Resource Contact List

©TopCoder, Inc. 2018

Page 2 of 6

VA Online Memorial - Data scraper improvements

Deployment Instructions
1. Deployment Dependencies
Before performing a deployment, it is assumed that the following have been set up:
● NodeJs 8+
● Postgresql Database
● Libpq (pg_config)

2. Organization of Submission
● va-backend/ – source of the submission
● docs/ - the deployment guide

3. Deployment Instructions
Go to the va-backend/ folder and follow the instructions of the README.md file to install all packages.
To install libpq do one of the following:
On OS X: brew install postgres
On Ubuntu/Debian: apt-get install libpq-dev g++ make
On RHEL/CentOS: yum install postgresql-devel
Change the database configuration in the packages/va-data-scraper and packages/va-models packages to match
your Postgresql database configuration.
Go to the va-backend/packages/va-data-scraper and follow the instructions of the README.md file
information on how to run the package.
Don’t run “npm install” inside the va-backend/packages/va-data-scraper folder.

4. Verification
Go to the folder va-backend/packages/va-data-scraper.
Run the command:
$ npm run download-data
Wait until it downloads 20 files and hit Ctrl+C (if you want to download all files go ahead but it’s not
necessary).
As described in https://apps.topcoder.com/forums/?module=Thread&threadID=912402, an option to ignore
CSV error can be made.
©TopCoder, Inc. 2018

Page 3 of 6

VA Online Memorial - Data scraper improvements

For that set the environment variable OPTION_IGNORE_BAD_CSV_LINE to “true” and run the script
(check va-backend/packages/va-data-scraper/services/data.js file for OPTION_IGNORE_BAD_CSV_LINE
usage).
To start the data scraper script run:
$ export OPTION_IGNORE_BAD_CSV_LINE=true
$ npm run import-data
You will see something like this:
> node --expose-gc --max_old_space_size=4096 scripts/import-data.js
[2018-02-08T11:15:21.140Z][INFO] Will ignore bad csv lines
[2018-02-08T11:15:21.143Z][INFO] connecting to database: postgres://postgres:topcoder@localhost:5432/test
[2018-02-08T11:15:21.330Z][INFO] Processing file downloads/ngl_alabama.csv
[▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇] 100% 387.8s
0.0s
[2018-02-08T11:21:54.236Z][INFO] Parsed file downloads/ngl_alabama.csv with 21872 lines. Inserted 20070 veterans
[2018-02-08T11:21:54.255Z][INFO] Processing file downloads/ngl_alaska.csv
[▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇] 100% 155.9s
0.0s
[2018-02-08T11:24:31.838Z][INFO] Parsed file downloads/ngl_alaska.csv with 8880 lines. Inserted 8411 veterans
[2018-02-08T11:24:31.936Z][INFO] Processing file downloads/ngl_arizona.csv
[▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇] 100% 2740.8s
0.0s
[2018-02-08T12:10:29.041Z][INFO] Parsed file downloads/ngl_arizona.csv with 96485 lines. Inserted 95108 veterans

While running the files you can check that the memory consumption stays below 150Mb (4Gb+ before).
To verify that transaction is working properly, move all files from the downloads/ folder to a temporary folder,
but keep file ngl_california.csv.
Drop the Postgresql database and create it again:
psql> DROP DATABASE test;
psql> CREATE DATABASE test;
Run the import the command again with OPTION_IGNORE_BAD_CSV_LINE set to “false”.
$ export OPTION_IGNORE_BAD_CSV_LINE=false
$ npm run import-data
It will crash while reading a CSV file.
> node --expose-gc --max_old_space_size=4096 scripts/import-data.js
[2018-02-08T12:19:18.491Z][INFO] connecting to database: postgres://postgres:topcoder@localhost:5432/test
[2018-02-08T12:19:19.446Z][INFO] Processing file downloads/ngl_california.csv
[▇———————————————————————————————————————————————————————————] 1% 0.0s 0.0s
[2018-02-08T12:22:42.783Z][ERROR] Failed to recover CSV line
[2018-02-08T12:22:42.785Z][ERROR] Failed to read file. Stack: Error: Invalid closing quote at line 1; found "M"
instead of delimiter ","
[2018-02-08T12:22:42.787Z][INFO] Operation completed!

Check the database tables. No data must be present.

©TopCoder, Inc. 2018

Page 4 of 6

VA Online Memorial - Data scraper improvements

Run the import the command again. This time the command will not crash and some warnings or errors will be
printed in the screen.
While running this file, ngl_california.csv, (the biggest one available) you can check that the memory
consumption stays below 600Mb (4Gb+ before). This is due to the Sequelize library and not with the line by
line read.
Last requirement verification:
Third, the scraper will ignore importing any row that has no information in these columns: first/last name,
birth/burial date and cemetery name/city/address. That does make our data complete, but also skips a lot of
records. We want you to analyze those skipped rows and propose a different strategy for importing records that
would yield better results (you can propose more than one).
For that set the environment OPTION_IMPORT_EXTRA_DATA to true and run the script again (check vabackend/packages/va-data-scraper/services/data.js file for OPTION_IMPORT_EXTRA_DATA usage).
Drop the Postgresql database and create it again. Run the commands below:
$ export OPTION_IGNORE_BAD_CSV_LINE=true
$ export OPTION_IMPORT_EXTRA_DATA=true
$ npm run import-data
You should see something like this:
[2018-02-08T00:15:14.259Z][INFO] Will ignore bad csv lines
[2018-02-08T00:15:14.261Z][INFO] Will try to import extra data
[2018-02-08T00:15:14.262Z][INFO] connecting to database: postgres://postgres:topcoder@localhost:5432/test
[2018-02-08T00:15:14.425Z][INFO] Processing file downloads/ngl_alabama.csv
[▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇] 100% 375.6s
0.0s
[2018-02-08T00:21:34.670Z][INFO] Parsed file downloads/ngl_alabama.csv with 21872 lines. Inserted 21848 veterans
[2018-02-08T00:21:34.684Z][INFO] Processing file downloads/ngl_alaska.csv
[▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇] 100% 153.2s
0.0s
[2018-02-08T00:24:09.535Z][INFO] Parsed file downloads/ngl_alaska.csv with 8880 lines. Inserted 8869 veterans
[2018-02-08T00:24:09.629Z][INFO] Processing file downloads/ngl_arizona.csv
[▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇] 100% 3953.9s
0.0s
[2018-02-08T01:30:19.496Z][INFO] Parsed file downloads/ngl_arizona.csv with 96485 lines. Inserted 96474 veterans

Comparing the execution with the previous one we have:
File
ngl_alabama.csv
ngl_alaska.csv
ngl_arizona.csv

Without Extra Flag
20070 inserted
8411 inserted
95108 inserted

With extra Flag
21848 inserted
8869 inserted
96474 inserted

Increase
8.8%
5.4%
1.4%

Checking the results, it can be verified an increase in the amount of imported data in all cases.

©TopCoder, Inc. 2018

Page 5 of 6

VA Online Memorial - Data scraper improvements

The following rules have been applied:
1- If column relationship is empty and names are equal to v_names we set relationship to Veteran (Self).
2- If column relationship is empty and last name is equal to v_last_name we set relationship to Other
Relative.
3- If the v_name or v_last_name columns are empty and relationship is equal to Veteran (Self), we copy
names to v_names.
4- If birth date or death date are null, we calculate the veteran id using a MD5 hash of the entire CSV line.
This could produce a duplicate result, but it’s very unlikely.

5. Resource Contact List
Name

Resource Email

TCCODER

Through TopCoder Member Contact

©TopCoder, Inc. 2018

Page 6 of 6

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.3
Linearized                      : No
Page Count                      : 6
Title                           : Microsoft Word - DeploymentGuide.docx
Producer                        : Mac OS X 10.12.6 Quartz PDFContext
Creator                         : Word
Create Date                     : 2018:02:08 13:14:49Z
Modify Date                     : 2018:02:08 13:14:49Z

EXIF Metadata provided by EXIF.tools

Deployment Guide

Navigation menu

Versions of this User Manual:

Views

Navigation