User Guide

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 18

DownloadUser Guide
Open PDF In BrowserView PDF
Dac-Man Documentation
version 0.0.1

September 7, 2018

i

Contents

1 Introduction
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
1
1
2

2 Installation
2.1 Installing Dac-Man . . . . . .
2.1.1 Getting the Package . .
2.1.2 Installation from Source
2.2 Testing the Installation . . . .
2.3 Dependencies . . . . . . . . . .

.
.
.
.
.

3
3
3
3
4
5

.
.
.
.
.
.
.

6
6
6
6
7
7
7
8

4 Using Dac-Man on HPC Clusters
4.1 Using MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Batch Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9
9
9

3 Using Dac-Man on Desktops
3.1 Quick Tutorial . . . . . . .
3.2 Command-line . . . . . . .
3.2.1 scan . . . . . . . . .
3.2.2 index . . . . . . . .
3.2.3 compare . . . . . . .
3.2.4 diff . . . . . . . . . .
3.3 Outputs . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

5 Configuration & Customization
11
5.1 Staging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 Plug-ins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
ii

5.3

Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

6 License & Copyright
13
6.1 License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.2 Copyright . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7 Team

15

iii

1
Introduction
1.1

Overview

Scientific datasets are updated frequently due to changes in instrument configuration, software updates, quality assessments or data cleaning algorithms.
However, due to the large size and complex data structures of these datasets,
existing tools either do not scale or are unable to generate meaningful change
information.
The Dac-Man (DAta Change Management) framework allows users to efficiently and effectively identify, track and manage data change and associated
provenance in scientific datasets. There are two main components of DacMan:
• Change tracker that keeps track of the changes between different versions
of a dataset.
• Query manager that manages data change related queries.

1.2

Features

The key features of Dac-Man include:
• HPC support. Dac-Man provides MPI support for enabling parallel
change capture in HPC environments.
• Offline comparison. Datasets can be compared away from the actual
location of the data, allowing users to find changes without necessarily
moving the datasets to a common location.

1

• Extendable. Users can plug-in their own scripts to calculate changes.
• Flexible command-line options. Provides different options to configure
change detection.
• Detailed output. Dac-Man outputs contain details on the different
types and amount of change.
• Customizable logging. Users can customize where and what to log, including the detailed steps in the change capture process.

1.3

Requirements

Dac-Man is developed using Python. It requires Python 2.7 or greater. Users
need Python setuptools and pip to install Dac-Man. Fore detailed instructions on the installation, please refer to Chapter 2.
Dac-Man is known to work on the following operating systems:
• Linux
• Unix-like OSs
• Mac OS

2

2
Installation
This chapter describes the steps required to install Dac-Man. It requires
Python version 2.7 or greater. Dac-Man has been tested to work with both
Python 2.7 and 3.6. It is installed as any other Python package and uses
Python setuptools package. For enabling advanced features in Dac-Man,
additional packages may need to be installed. You can install these packages
using pip.
More general information about installing Python packages can be found in
Python’s documentation at http://www.python.org/doc/.

2.1

Installing Dac-Man

This section describes the steps for installing Dac-Man. Upon successful
installation, you can use Dac-Man as a command-line tool.

2.1.1

Getting the Package

You can download the package from the Dac-Man repository (https://
github.com/dghoshal-lbl/dac-man) as a tarball. Alternatively, you
can clone the source tree from the repository:
$ git clone git@github.com:dghoshal-lbl/dac-man.git

2.1.2

Installation from Source

Once the package is downloaded/cloned, Dac-Man can be installed by running the following commands:
$ cd dacman
$ python setup.py install
3

If you are installing to a location that requires special permissions (like /usr/local), you may need to run the last command with sudo. Alternatively, you
can create and activate a build environment through virtualenv or conda
as described below.
Using virtualenv
You can install virtualenv using pip.
$ pip install virtualenv
More details on installing and using virtualenv can be found in https://
packaging.python.org/guides/installing-using-pip-and-virtualenv/.
After installing virtualenv, you need to create and activate the environment,
and then install Dac-Man.
$ virtualenv venv
$ source venv/bin/activate
(venv)$ cd dacman
(venv)$ python setup.py install
Using conda
Conda can be installed using the OS-specific installer that can be downloaded from https://conda.io/docs/user-guide/install/index.
html. After installing, the Python environment can be created and activated
as:
$ conda create --name env
$ source activate env
(env)$ cd dacman
(env)$ python setup.py install
More information about using conda environments can be found in https://
conda.io/docs/user-guide/tasks/manage-environments.html.

2.2

Testing the Installation

In order to test the Dac-Man installation, run the following commands:
$ cd examples/
$ ./simple.sh

4

On successful execution, this prints the summary of change and detailed
change information between two example directories.

2.3

Dependencies

Dac-Man primarily depends on the following packages:
• scandir>=1.5
• six>=1.10.0
• PyYAML==3.12
These dependencies are listed in requirements.txt file and are automatically installed during the build process.
Additional dependencies for running Dac-Man on HPC environments include:
• numpy : Python library for operations on large, multi-dimensional arrays
• mpi4py : Python MPI bindings

5

3
Using Dac-Man on Desktops
3.1

Quick Tutorial

To capture changes between two directories dir1 and dir2, run the following
command using the Dac-Man command-line:
$ dacman diff dir1 dir2
The above command identifies the number of files changed between the two
directories. In order to retrieve detailed infromation about the changes, you
can use the following command:
$ dacman diff dir1 dir2 --datachange

3.2

Command-line

Dac-Man enables change capture and analysis in four simple steps, which
provide flexibility to the users in identifying and capturing changes. Dac-Man
provides four command-line options to manage each of these steps separately.

3.2.1

scan

This option scans and saves the directory structure and other metadata related
to a data path. You can specify an optional staging directory, where the
metadata information will be saved.
$ dacman scan  [-s STAGINGDIR] [-i [IGNORE [IGNORE ...]]]
[--nonrecursive] [--symlinks]
The arguments to the command are:

6

−s STAGINGDIR : directory where filesystem metadata and indexes are
saved.
−i [IGNORE [IGNORE ...]] : list of file types to be ignored.
−−nonrecursive : do not scan the directory contents recursively.
−−symlinks : include symbolic links.

3.2.2

index

This command indexes the files, mapping the files to their contents.
$ dacman index  [-s STAGINGDIR] [-m python,tigres,mpi]
The arguments to the command are:
−s STAGINGDIR : directory where filesystem metadata and indexes are
saved.
−m python,tigres,mpi : index manager for parallelizing the index creation.
The options are python/mpi/tigres. By default, it uses the Python multiprocessing module (manager=python) that is suitable for parallelizing on
one node. For multi-node parallelism, users can select between MPI (manager=mpi) or tigres (manager=tigres).

3.2.3

compare

This command compares two datapaths. It compares and calculates the different types of changes.
$ dacman compare   [-s STAGINGDIR]
The arguments to the command are:
−s STAGINGDIR : directory where filesystem metadata and indexes are
saved.

3.2.4

diff

This command retrieves changes between two datapaths.
$ dacman diff   [-s STAGINGDIR] [-o OUTDIR]
[-a ANALYZER] [--datachange] [-e default,mpi,tigres]
The arguments to the command are:

7

−s STAGINGDIR : directory where filesystem metadata and indexes are
saved.
−o OUTDIR : directory where the summary of changes is saved.
−a ANALYZER : user-defined scripts for analyzing data changes.
−−datachange : calculate data level changes in addition to file changes.
−e python,tigres,mpi : type of executor (or runtime) for parallel data change
capture. The options are python/mpi/tigres. By default, it uses the Python
multiprocessing module (manager=python) that is suitable for parallelizing
on one node. For multi-node parallelism, users can select between MPI or
tigres.

3.3

Outputs

Dac-Man prints the summary of changes on standard output. The summary
lists the number of changes between two datasets. An example output looks
like below:
Added: 1, Deleted: 1, Modified: 1, Metadata-only: 0, Unchanged: 1

You can opt to save a more detailed output by specifying the output directory
where the detailed change information will be saved:
$ dacman diff /old/path /new/path -o output
The output/ directory contains a list of files with detailed information about
the changes. It also contains a summary of the change information as:
output/summary
counts:
added: 1
deleted: 1
metaonly: 0
modified: 1
unchanged: 1
versions:
base:
dataset_id: /path/to/old/data
nfiles: 3
revision:
dataset_id: /path/to/new/data
nfiles: 3

8

4
Using Dac-Man on HPC Clusters
4.1

Using MPI

Dac-Man allows you to parallelize index and diff steps. To parallelize on
HPC clusters, you need to enable the MPI support by using the appropriate
flags.
$ dacman index ... -m mpi
$ dacman diff ... -e mpi
In order to distribute the tasks to multiple workers, you need to use mpirun
or mpiexec. For example, running Dac-Man on an HPC cluster with 8
nodes and 32 cores per node, you can do the following:
$ mpiexec -n 256 dacman index ... -m mpi
$ mpiexec -n 256 dacman diff ... -e mpi

4.2

Batch Script

In order to submit a batch job in a cluster, you need to include the DacMan command in your job script. The example below shows a batch script
(hpcEx.batch) for the Slurm scheduler.
hpcEx.batch
#!/bin/bash
#SBATCH -J example
#SBATCH -t 00:30:00
#SBATCH -N 8
#SBATCH -q myqueue
mpiexec -n 256 dacman diff /old/data /new/data -e mpi
9

The script can then be submitted to the batch scheduler as:
$ sbatch hpcEx.batch

10

5
Configuration & Customization
5.1

Staging

For every dataset, Dac-Man creates a directory in the staging area to save
all metadata and index information. Each directory in the staging area
uniquely identifies the dataset (using a hash representation of the dataset
path) indexed by Dac-Man. By default, this staging area is located in
$HOME/.dacman/staging. However, the staging area can be changed to a
custom location through the command-line. You can change the staging area
by using the following command:
$ dacman index mydir/ -s mystage
The command above creates the indexes inside mystage directory. You can
copy or move these indexes to compare and calculate the changes, without
necessarily copying or moving the data. This is specifically useful when the
datasets to be compared are located on different systems. The example below
shows how can the staged indexes and metadata information be copied and
compared for finding changes, without copying the data itself.
$ scp -r user:pwd@ /.dacman/staging/remotedir /path/to/mystage/
$ dacman diff /path/to/localdir/ /remotedir/ -s /path/to/mystage

5.2

Plug-ins

By default, Dac-Man compares the data by reshaping file data into onedimensional arrays. However, you can use your own custom scripts for comparing data changes. You can specify an external script (for example, myscript)
as:
$ dacman diff /old/path/file1 /new/path/file1 -a myscript
11

The command above uses the script myscript to compare the contents of files
/old/path/file1 and /new/path/file1 instead of the default DacMan data comparator. If you want to use Unix diff to compare all the modified
files in the directories dir1 and dir2, run the following command:
$ dacman diff /path/to/dir1 /path/to/dir2 --datachange -a
/usr/bin/diff
The --datachange command tells Dac-Man to compare the data within
the files of the two directories.

5.3

Logging

Dac-Man uses the standard Python logging for creating execution logs. The
default logging configuration is saved in $HOME/.dacman/config/logging.yaml
file. Dac-Man logs all INFO level messages, and prints messages with levels
equal to or over the WARNING level. However, you can configure the logging
as per your requirement by modifying the configuration file.

12

6
License & Copyright
6.1

License

Dac-Man is licensed under the “new” or “revised” BSD license.
Dac-Man (DAta Change Management) Copyright (c) 2018, The Regents of the
University of California, through Lawrence Berkeley National Laboratory
(subject to receipt of any required approvals from the U.S. Dept.
Energy).

of

All rights reserved.

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

(1) Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.

(2) Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

(3) Neither the name of the University of California, Lawrence Berkeley
National Laboratory, U.S. Dept.

of Energy nor the names of its contributors

may be used to endorse or promote products derived from this software
without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS

13

IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS
BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY
WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
OF SUCH DAMAGE.

6.2

Copyright

The copyright for Dac-Man is described below.
Dac-Man (DAta Change Management) Copyright (c) 2018, The Regents of the
University of California, through Lawrence Berkeley National Laboratory
(subject to receipt of any required approvals from the U.S. Dept.
Energy).

of

All rights reserved.

If you have questions about your rights to use or distribute this software,
please contact Berkeley Lab’s Innovation & Partnerships Office at IPO@lbl.gov.

NOTICE. This Software was developed under funding from the U.S. Department
of Energy and the U.S. Government consequently retains certain rights.
As such, the U.S. Government has been granted for itself and others acting
on its behalf a paid-up, non-exclusive, irrevocable, worldwide license
in the Software to reproduce, distribute copies to the public, prepare
derivative works, and perform publicly and display publicly, and to permit
other to do so.

14

7
Team
Dac-Man is developed as part of the Deduce project, whose PI is Deborah
Agarwal [daagarwal-AT-lbl.gov]. The development of Dac-Man is led by
Lavanya Ramakrishnan [lramakrishnan-AT-lbl.gov].
As of now, the following developers have contributed to the development of
Dac-Man:
• Devarshi Ghoshal [dghoshal-AT-lbl.gov]. Initial design and development
of Dac-Man.
• Drew Paine [pained-AT-lbl.gov]. User interviews and initial evaluation.
• Abdelrahman Elbashandy [aaelbashandy-AT-lbl.gov]. Extending DacMan to handle streaming data.

15



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 18
Producer                        : pdfTeX-1.40.15
Creator                         : TeX
Create Date                     : 2018:09:07 12:44:56-04:00
Modify Date                     : 2018:09:07 12:44:56-04:00
Trapped                         : False
PTEX Fullbanner                 : This is pdfTeX, Version 3.14159265-2.6-1.40.15 (TeX Live 2014/MacPorts 2014_4) kpathsea version 6.2.0
EXIF Metadata provided by EXIF.tools

Navigation menu