hashdb
USERS MANUAL
Quickstart Guide Included
August 22, 2014
Authored by:
Bruce D. Allen
Jessica R. Bradley
Simson L. Garfinkel
One Page Quickstart for Linux and Mac Users
This page provides a very brief introduction to downloading, installing and running
hashdb (creating a database and populating it) on Linux and MacOS systems.
1. Download the latest version of hashdb. It can be obtained from http://digitalcorpora.
org/downloads/hashdb. The file is called hashdb-x.y.z.tar.gz where x.y.z is the
latest version.
2. Un-tar and un-zip the file. In the newly created hashdb-x.y.z directory, run the
following commands:
./configure
make
sudo make install
[Refer to Subsection 3.1. Note, for full functionality, some users may need to
first download and install dependent library files. Instructions are outlined in the
referenced section.]
3. Navigate to the directory where you would like to create a hash database. Then,
to run hashdb from the command line, type the following instructions:
hashdb create sample.hdb
In the above instructions, sample.hdb is that empty database that will be created
with default database settings.
4. Next, import data into the database, you will need a DFXML file containing block
hash values. If you do not already have one, see Subsection 2.2 for instructions
on creating one. To populate the hash database with the hashes from the DFXML
file called sample.xml, type the following instructions from the directory where
you created the database:
hashdb import sample.xml sample.hdb
This command, if executed successfully, will print the number of hash values inserted. For example:
hashdb changes (insert):
hashes inserted: 2595
5. Additionally, the file log.xml contained in the directory sample.hdb will be updated with change statistics. It will show the number of hash values that have
been inserted [see Subsection 4.4 for more information on the change statistics
tracked in the log file].
ii
One Page Quickstart for Windows Users
This page provides a very brief introduction to downloading, installing and running
hashdb on Windows systems.
1. Download the windows installer for the latest version of hashdb. It can be obtained from http://digitalcorpora.org/downloaads/hashdb. The file is called
hashdb-x.y.z-windowsinstaller.exe where x.y.z is the latest version.
2. Run the installer file. This will automatically install hashdb on your machine.
3. Navigate to the directory where you would like to create a hash database. Then,
to run hashdb from the command line, type the following instructions:
hashdb create sample.hdb
In the above instructions, sample.hdb is that empty database that will be created
with default database settings.
4. Next, import data into the database, you will need a DFXML file containing sector
hash values. If you do not already have one, see Subsection 2.2 for instructions
on creating one. To populate the hash database with the hashes from the DFXML
file called sample.xml, type the following instructions from the directory where
you created the database:
hashdb import sample.xml sample.hdb
This command, if executed successfully, will print the number of hash values inserted. For example:
hashdb changes (insert):
hashes inserted: 2595
5. Additionally, the file log.xml contained in the directory sample.hdb will be updated with change statistics. It will show the number of hash values that have
been inserted [see Subsection 4.4 for more information on the change statistics
tracked in the log file].
iii
Contents
1 Introduction
1.1 Overview of hashdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Purpose of this Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Conventions Used in this Manual . . . . . . . . . . . . . . . . . . . . . .
1
1
2
2
2 How hashdb Works
2.1 Hash Blocks . . . . . . . . . . . . . . . . . . . .
2.2 DFXML . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Creating a DFXML file using md5deep
2.2.2 Creating a DFXML file using fiwalk . .
2.2.3 Creating a DFXML file using hashdb .
2.3 Contents of a Hash Database . . . . . . . . . .
2.4 Using the Hash Databases . . . . . . . . . . . .
2.4.1 bulk_extractor . . . . . . . . . . . . .
2
3
4
4
5
6
6
7
7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Running hashdb
3.1 Installation Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Installing on Linux or Mac . . . . . . . . . . . . . . . . . . . . .
3.1.2 Installing on Windows . . . . . . . . . . . . . . . . . . . . . . . .
3.1.3 Installing Other Related Tools . . . . . . . . . . . . . . . . . . .
3.2 hashdb Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Creating a Hash Database . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Importing and Exporting between a DFXML File and a Hash
Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Manipulating Hash Databases . . . . . . . . . . . . . . . . . . . .
3.2.4 Tracking Changes in Hash Databases . . . . . . . . . . . . . . . .
3.2.5 Scanning Media for Hash Values . . . . . . . . . . . . . . . . . .
3.2.6 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.7 Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.8 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Importing and Scanning Using the bulk_extractor hashdb Scanner . .
7
8
8
9
9
10
10
4 Use Cases for hashdb
4.1 Querying for Source or Database Information . . . . . . . . .
4.1.1 Querying a Remote Hash Database . . . . . . . . . . .
4.2 Writing Software that works with hashdb . . . . . . . . . . . .
4.3 Scanning or Importing to a Database Using bulk_extractor
4.4 Updating Hash Databases . . . . . . . . . . . . . . . . . . . .
4.4.1 Update Commands and “Duplicate” Hashes . . . . . .
4.5 Optimizing a Hash Database . . . . . . . . . . . . . . . . . .
4.6 Exporting Hash Databases . . . . . . . . . . . . . . . . . . . .
21
21
22
22
23
24
24
25
25
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
13
13
14
18
18
20
20
5 Worked Example: Finding Similarity Between Disk Images
26
6 Troubleshooting
28
7 Related Reading
28
Appendices
30
iv
A hashdb Quick Reference
30
B Output of hashdb Help Command
31
C hashdb API: hashdb.hpp
37
D bulk_extractor hashdb Scanner Usage Options
39
v
1
Introduction
1.1
Overview of hashdb
hashdb is a tool that can be used to find data in raw media using cryptographic hashes
calculated from blocks of data. It is a useful forensic investigation tool for tasks such
as malware detection, child exploitation detection or corporate espionage investigations.
The tool provides several capabilities that include:
• Creating hash databases of MD5 block hashes, as opposed to file hashes.
• Importing hash values from Digital Forensic XML (DFXML) files created by other
programs such as md5deep.
• Scanning the hash database for matching hash values using either the local or
remote system.
• Providing the source information for hash values.
Using hashdb, a forensic investigator can take a known set of blacklisted media and generate a hash database. The investigator can then use the hash database to search against
raw media for blacklisted information. For example, given a known set of malware, an
investigator can generate a sector hash database representing that malware. The investigator can then search a given corpus for fragments of that malware and identify the
specific malware content in the corpus using hashdb and the bulk_extractor program.
hashdb relies on block hashing rather than full file hashing. Block hashing provides an
alternative methodology to file hashing with a different capability set. With file hashing,
the file must be complete to generate a file hash, although a file carver can be used to
pull together a file and generate a valid hash. File hashing also requires the ability to
extract files, which requires being able to understand the file system used on a particular
storage device. Block hashing, as an alternative, does not need a file system or files.
Artifacts are identified at the block scale (usually 4096 bytes) rather than at the file
scale. While block hashing does not rely on the file system, artifacts do need to be
sector-aligned for hashdb to find hashes [3].
hashdb provides an advantage when working with hard disks and operating systems that
fragment data into discontiguous blocks yet still sector-align media. This is because
scans are performed along sector boundaries. Because hashdb works at the block resolution, it can find part of a file when the rest of the file is missing, such as with a large
video file where only part of the video is on disk. hashdb can also be used to analyze
network traffic (such as that captured by tcpflow). Finally, hashdb can identify artifacts
that are sub-file, such as embedded content in a .pdf document.
hashdb stores cryptographic hashes (along with their source information) that have been
calculated from hash blocks. It also provides the capability to scan other media for
hash matches. Many of the capabilities of hashdb are best utilized in connection with
the bulk_extractor program. This manual describes uses cases for the hashdb tools,
including its uses with bulk_extractor and demonstrates how users can take full advantage of all of its capabilities.
1
1.2
Purpose of this Manual
This Users Manual is intended to be useful to new, intermediate and experienced users
of hashdb. It provides an in-depth review of the functionality included in hashdb and
shows how to access and utilize features through command line operation of the tool.
This manual includes working examples with links to the input data used, giving users
the opportunity to work through the examples and utilize all aspects of the system.
1.3
Conventions Used in this Manual
This manual uses standard formatting conventions to highlight file names, directory
names and example commands. The conventions for those specific types are described
in this section.
Names of programs including the post-processing tools native to hashdb and third-party
tools are shown in bold, as in bulk_extractor.
File names are displayed in a fixed width font. They will appear as filename.txt within
the text throughout the manual.
Directory names are displayed in italics. They appear as directoryname/ within the text.
The only exception is for directory names that are part of an example command. Directory names referenced in example commands appear in the example command format.
Database names are denoted with bold, italicized text. They are always specified in
lower-case, because that is how they are referred in the options and usage information
for hashdb. Names will appear as databasename.
This manual contains example commands that should be typed in by the user. A command entered at the terminal is shown like this:
command
The first character on the line is the terminal prompt, and should not be typed. The
black square is used as the standard prompt in this manual, although the prompt shown
on a users screen will vary according to the system they are using.
2
How hashdb Works
The hashdb tool provides capabilities to create, edit, access and search databases of
cryptographic hashes created from hash blocks. The cryptographic hashes are imported
into a database from DFXML files created by other programs (which could include
md5deep) or exported from another hashdb database. hashdb databases can also be
populated using bulk_extractor and the hashdb scanner. Once a databases is created, hashdb provides users with the capability to scan the database for matching hash
values and identify matching content. Hash databases can also be exported, added to,
subtracted from and shared.
Figure 1 provides an overview of the capabilities included with the hashdb tool. hashdb
populates databases from DFXML files created by other programs. The sources of those
2
Disk Image
Files
hashdb
Blacklist
Files
DFXML
File
Create &
Populate
Hash DB
Match
Hash
Values
Hash
Database
Matching
Hash
Values
Raw
Media
API
Library
Export
Disk Image
Files
3rd Party
Programs
DFXML
File
Figure 1: Overview of the hashdb system
|-512-|-512-|-512-|-512-|-512-|-512-|-512-|-512-|-512-|-512-| ...
|-----------------------4K----------------------|
|-----------------------4K----------------------|
|-----------------------4K----------------------|
etc.
Figure 2: Hashes generated over overlapping sector boundaries. 4K lines represent the
hash blocks.
files can be virtually any type of raw digital media including black list files and disk
images. Users can also add or remove data from the database after it is created. Once
the database is populated, hashdb can export content from the database in DFXML
format. It also provides an API that can be used by third party tools (as it is used in
the bulk_extractor program) to create, populate and access hash databases. Finally,
hashdb allows users to scan the hash database for matching hash values.
2.1
Hash Blocks
hashdb relies on block hashing rather than file hashing. A hash block is a contiguous
sequence of bytes, typically 4KiB in size. Tools using block hashing calculate cryptographic hashes from hash blocks, along with information about where the hash blocks
are sourced from. To increase the probability of finding matching hashes in sector-based
disk images, hashes are generated at each sector boundary. Figure 2 illustrates cryptographic hashes generated from 4KiB hash blocks aligned on 512 byte sector boundaries.
Block size is selectable in tools such as md5deep. In our work, we use a block size of
4KiB.
3
Listing 1: Excerpt of a DFXML report file showing the MD5 output
< fileobject >
< filename >/ home / bdallen / demo / mock_video . mp4
< filesize >10630146
< ctime >2014 -01 -30 T20 :20:39 Z
< mtime >2014 -01 -30 T19 :04:59 Z
< atime >2014 -01 -30 T20 :04:52 Z
< byte_run file_offset = ’0 ’ len = ’4096 ’ >
< hashdigest type = ’ MD5 ’ >63641 a3c008a3d26a192c778dd088868
< byte_run file_offset = ’4096 ’ len = ’4096 ’ >
< hashdigest type = ’ MD5 ’ > c7dd2354e223c10856469e27686b8c6b
< byte_run file_offset = ’8192 ’ len = ’4096 ’ >
< hashdigest type = ’ MD5 ’ > ff540fda05d008ccebf2cca2ec71571d
< byte_run file_offset = ’12288 ’ len = ’4096 ’ >
< hashdigest type = ’ MD5 ’ > d3de47d704e85e0f61a91876236593d3
...
< byte_run file_offset = ’10625024 ’ len = ’4096 ’ >
< hashdigest type = ’ MD5 ’ > d2d958b44c481cc41b0121b3b4afae85
< byte_run file_offset = ’10629120 ’ len = ’1026 ’ >
< hashdigest type = ’ MD5 ’ >4640564 a8655d3b201a85b4a76411b00
< hashdigest type = ’ MD5 ’ > a003483521c181d26e66dc09740e939d
2.2
DFXML
hashdb can be used to populate hash databases by importing block hashes from DFXML
files. DFXML is an XML language designed to represent a wide range of forensic information and forensic processing results. It allows the sharing of structured information
between independent tools and organizations [2].
Please note that hashdb does not require DFXML files to import hashes. The bulk_extractor
hashdb scanner can import hashes directly into a new hash database, see Section 3.3 for
importing using the bulk_extractor hashdb scanner. Also, third party tools can be
created for importing hashes directly into a hash database by interfacing with the hashdb
library API, see Section 2.4.
2.2.1
Creating a DFXML file using md5deep
The md5deep tool creates cryptographic hashes from hash blocks and produces DFXML
files. Listing 1 shows an excerpt of the DFXML file created by md5deep. The portion
of the file of interest to hashdb is contained in the “byte_run” tag. The “file_offset”
attribute is the number of bytes into the file where the cryptographic block hash was
calculated. The “len” attribute indicates the size of the block. The “hashdigest” tag
identifies that hash algorithm (MD5) and the long hexadecimal hash value. The “filename” tag indicates the filename to which the hashes can be attributed.
4
Users may create DFXML files to import hashes from by using the md5deep tool.
md5deep is available at http://md5deep.sourceforge.net. For additional instructions on downloading and installing md5deep, go to http://github.com/simsong/
hashdb/wiki/Installing-md5deep.
Choose a file or directory to use as the source of data for the hash file output. For
this manual, we use the file mock_video.mp4 available at http://digitalcorpora.
org/downloads/hashdb/demo/. Then, run md5deep with the following command:
md5deep -p 4096 -d mock_video.mp4 > mock_video.xml
The above command specifies:
• a block size of 4096 bytes (-p option)
• that the hash output will be written to a DFXML file (-d option)
• to write the output to the file mock_video.xml. The > symbol specification writes
the output into the file
The file mock_video.xml will be used in the next step to create the hash database.
However, any DFXML file containing block hash values can be used in hashdb.
Note, for this example we are using only one file to populate the DFXML. However,
users will typically be creating a block hash file from thousands of files in hundred of
directories. To create a block hash file that recursively includes all files and directories
contained within a directory, use the command mdf5deep -r along
with the other options specified above.
2.2.2
Creating a DFXML file using fiwalk
The fiwalk tool can create block hashes of files in filesystems in an image, see http:
//www.forensicswiki.org/wiki/Fiwalk. fiwalk is part of The Sleuth Kit R (TSK),
available from https://github.com/sleuthkit/sleuthkit.
For example run fiwalk with the following command:
fiwalk -x -S 4096 my_image.E01 > my_image.xml
The above command specifies:
• Send output to stdout -x option.
• Perform sector hashes every 4096 bytes -s option.
• Perform sector hashes on the file system in the my_image.E01 image.
• Direct output to file my_image.xml.
5
2.2.3
Creating a DFXML file using hashdb
The export command of the hashdb tool writes out the block hashes in a hash database
along with their source information.
For example run hashdb with the following command:
hashdb export mock_video.hdb demoVideoHashes.xml
The above command specifies to export hashes and their source information from hash
databse mock_video.hdb to DFXML file demoVideoHashes.xml.
2.3
Contents of a Hash Database
Each hashdb database is contained in a directory called .hdb and contains a number of files. These files are:
Bloom_filter_1
hash_store
history.xml
log.xml
settings.xml
source_filename_store.dat
source_filename_store.idx1
source_filename_store.idx2
source_lookup_store.dat
source_lookup_store.idx1
source_lookup_store.idx2
source_repository_name_store.dat
source_repository_name_store.idx1
source_repository_name_store.idx2
These files include XML files containing configuration settings and logs, a Bloom filter
file used for improving the speed of hash lookups, binary files containing stored hashes
from multiple sources and binary files that allow lookup of hash source names. Of these
files, the history, settings, and log files may be of interest to the user:
• log.xml
Every time a command is run that changes the content of the database, this file
is replaced with a log of the run. The log includes the command name, information about hashdb including the command typed and how hashdb was compiled,
information about the operating system hashdb was just run on, timestamps indicating how much time the command took, and the specific hashdb changes applied,
described in more detail in Section 3.2.
• history.xml
The purpose of this file is to provide full attribution for a database. Every hashdb
command executed that changes the state of the database is logged into the
log.xml file and is appended to the history.xml file. For hashdb commands
that involve manipulations from another database (or from two databases, as is
the case with the add_multiple command), the history file of those databases are
also appended. It can be difficult to follow the history.xml file because of its
XML format, but it provides full attribution nonetheless.
6
• settings.xml
This file contains the settings requested by the user when the block hash database
was created, see hashdb settings and Bloom filter settings options. This file also
contains internal hashdb configuration and versioning information that is specific
to how the hashdb tool was compiled.
2.4
Using the Hash Databases
hashdb provides the capability for users to scan the database for matching hash blocks
locally or remotely via a socket. Users can also query for hash source information and
information about the hash database itself. hashdb provides an API to access the import
and scan capabilities. The import capability allows third party tools to create a new
database at a specified directory, import an array of hashes with source information
and write changes to the log.xml file. The scan capability provided by the API allows
third party tools to open an existing database and perform a scan. Most importantly,
the bulk_extractor hashdb scanner uses the hashdb API to provide users with the
capability to create databases from disk images or scan digital media and find matching
hash blocks within the data bulk_extractor is processing. In later sections, this
manual describes the methods for using bulk_extractor together with the hashdb
tool.
2.4.1
bulk_extractor
bulk_extractor is an open source digital forensics tool that extracts features such
as email addresses, credit card numbers, URLs and other types of information from
digital evidence files. It operates on disk images, files or a directory of files and extracts useful information without parsing the file system or file system structures. For
more information on how to use bulk_extractor for a wide variety of applications,
refer to the separate publication The bulk_extractor Users Manual available at http:
//digitalcorpora.org/downloads/bulk_extractor/BEUsersManual.pdf [1].
bulk_extractor has multiple scanners that extract features. One particular scanner,
the hashdb scanner links the full set of bulk_extractor capabilities directly to the
hashdb tool. The hashdb scanner uses the hashdb API to create and import data into
hash databases directly from the data processed by bulk_extractor. The scanner also
can be run with a hash database as input (again using the hashdb API) will scan the
data processed by bulk_extractor for matching hash values.
The functionality of hashdb is provided through command line operation and the available API. The following section describes how to download, install and run hashdb.
3
Running hashdb
hashdb is a command line tool that can be run on Linux, MacOS or Windows systems.
Here we describe the installation procedures for those system as well as the basic commands used to run the tool, including creating and maintaining a database and scanning
media for hash values.
7
3.1
Installation Guide
The following sections explain how to install the required dependencies as well as download hashdb and compile the release or run the executable.
3.1.1
Installing on Linux or Mac
Before compiling hashdb for your platform, you may need to install other packages on
your system which hashdb requires to compile cleanly and with a full set of capabilities.
Dependencies for Linux
The following commands should add the appropriate packages:
sudo
sudo
sudo
sudo
yum
yum
yum
yum
update
groupinstall development-tools
install gcc-c++
install libxml2-devel openssl-devel tre-devel boost-devel
Dependencies for Mac Systems
Mac users must first install Apple’s Xcode development system. Other components
should be downloaded using the MacPorts system. If you do not have MacPorts, go to
the App store and download and install it. It is free. Once it is installed, try:
sudo port install autoconf automake libxml2
Download and Install hashdb
Next, download the latest version of hashdb. The software can be downloaded from http:
//digitalcorpora.org/downloads/hashdb/. The file to download is hashdb-x.y.z.tar.gz
where x.y.z is the latest version. As of publication of this manual, the latest version of
hashdb is 1.0.0.
After downloading the file, un-tar it by either right-clicking on the file and choosing
“extract to...’ or typing the following at the command line:
tar -xvf hashdb-x.y.z.tar.gz
Then, in the newly created hashdb-x.y.z directory, run the following commands to install
hashdb in /usr/local/bin (by default):
./configure
make
sudo make install
hashdb is now installed on your system and can be run from the command line.
Note: sudo is not required. If you do not wish to use sudo, build and install hashdb and
bulk_extractor in your own space at “$HOME/local” using the following commands:
./configure --prefix=$HOME/local/ --exec-prefix=$HOME/local CPPFLAGS=I$HOME/local/include/ LDFLAGS=-L$HOME/local/lib/
make
make install
8
Figure 3: Windows 8 warning when trying to run the installer. Select “More Info” and
then “Run Anyway.”
3.1.2
Installing on Windows
Windows users should download the Windows Installer for hashdb. The file to download
is located at http://digitalcorpora.org/downloads/hashdb and is called hashdb-x.y.
z-windowsinstaller.exe where x.y.z is the latest version number (1.0.0 as of publication of this manual).
You should close all Command windows before running the installation executable. Windows will not be able to find the hashdb tools in a Command window if any are open
during the installation process. If you do not do this before installation, simply close all
Command windows after installation. When you re-open, Windows should be able to
find hashdb.
Next run the hashdb-x.y.z-windowsinstaller.exe file. This will automatically install
hashdb on your machine. Some Windows safeguards may try to prevent you from running
it. Figure 3 shows the message Windows 8 displays when trying to run the installer. To
run anyway, click on “More info” and then select “Run Anyway.”
When the installer file is executed, the installation will begin and show a dialog like the
one shown in Figure 4. Users should select the default configuration, which will be the
64-bit configuration for 64-bit Windows systems, or the 32-bit configuration for 32-bit
Windows systems. Click on “Install’ and the installer will install hashdb on your system
and then notify you when it is complete. hashdb is now installed on your system can be
run from the command line.
3.1.3
Installing Other Related Tools
Download and Install bulk_extractor
The bulk_extractor hashdb scanner provides the capability to import block hashes
into a new hash database and to scan for hashes against an existing hash database.
This scanner is included in bulk_extractor version 1.4.5 or later. For detailed instructions on downloading and installing bulk_extractor, please refer to the Users Manual
found at http://digitalcorpora.org/downloads/bulk_extractor/BEUsersManual.
pdf. Note: hashdb must be installed first for bulk_extractor to build properly with
hashdb. bulk_extractor will automatically install the hashdb scanner but only if the
hashdb library has been installed. Otherwise, bulk_extractor will build without the
hashdb scanner. To check that the hashdb scanner is enabled, observe that is enabled
9
Figure 4: Dialog appears when the user executes the Windows Installer. Select the
default configuration.
in the output of running ./configure or type bulk_extractor -h and look for hashdb
setting options.
Download and Install md5deep
md5deep is available at https://github.com/jessek/hashdeep/releases/tag/release4.4. Additional platform-specific installation structions are provided at https://github.
com/simsong/hashdb/wiki/Installing-md5deep.
Download and Install fiwalk
Please see http://www.forensicswiki.org/wiki/Fiwalk. fiwalk is part of The Sleuth
Kit R (TSK), available from https://github.com/sleuthkit/sleuthkit.
3.2
hashdb Commands
The core capabilities provided by hashdb involve creating and maintaining a database of
hash values and scanning media for those hash values. To perform those tasks, hashdb
users need to start by building a database (if an existing database is not available for
use). Users then import hashes using a DFXML file or by using the bulk_extractor
hashdb scaner, and then possibly merge or subtract hashes to obtain the desired set of
hashes to scan against. Users then scan for hashes that match. Additional commands
are provided to support statistical analysis, performance tuning and performance analysis.
This section describes hashdb commands, along with examples, for performing these
tasks. For more examples of command usage, please see Section 4. For a hashdb
quick reference summary, please see Appendix A and http://digitalcorpora.org/
downloads/hashdb/hashdb_quick_reference.pdf.
3.2.1
Creating a Hash Database
A hash database must be created before hashes can be added to it. The command to
create a hash database is shown in Table 1. Table 2 shows the optional parameters that
10
can be used to specify database settings. Bloom filter settings for performance tuning
are not shown.
Hash Block Size
This setting specifies the hash block size used to generate hashes. The hash block size
must be greater than or equal to the sector size of 512, and must be divisible by 512 in
order to be byte aligned, as discussed in Section 2.1.
Maximum Duplicates
This setting specifies the maximum number of duplicates of a hash value that hashdb
may put into the database. A default value of 0 means unlimited, but this may be
unreasonable. For example if a block is repeated many times and is thus not interesting,
limit storing its duplicates using this setting.
Example
To create an (empty) hash database named mock_video.hdb, type the following command:
hashdb create mock_video.hdb
The above command will create a database with all of the default hash database settings.
Most users will not need to change those settings. Our DFXML file was created with
a default block size of 4096 bytes. Users can specify either the option and value or the
verbose option value for each parameter along with the create command, as in:
hashdb create --max_duplicates=20 mock_video.hdb
hashdb create -m 20 mock_video.hdb
The above two commands produce identical results, creating the database mock_video.hdb
that will accept a maximum of 20 hash duplicates.
Table 1: Commands Available in hashdb Command Line Tool to Create a Database
Command
create
3.2.2
Usage
create [-p ] [m hashdb.hdb
Description
Creates a new hash
database
with
the
given
configuration
parameters.
Importing and Exporting between a DFXML File and a Hash Database
Commands to import and export hashes are shown in Table 3. Once a database has
been created, it may be populated with hash values from a DFXML file. Note that there
are other ways to populate a database besides importing from a DFXML file, including
using other hash databases (discussed in Section 4.4), by using the bulk_extractor
hashdb scanner (discussed in Section 4.3), and through the use of the import capability
provided by the API (discussed in Section 4.2).
Using the DFXML file created in the previous section, type the following command:
11
Table 2: Settings for New Databases
Option
-p
Verbose Option
--hash_block_size=hash_block_size
-m
--max_duplicates=maximum
Specification
Specifies the block size
(hash_block_size) in
bytes used to generate
the hashes that will be
stored in the database.
Default is 4096 bytes.
Specifies the maximum
number of hash duplicates allowed. 0 value
indicates there is no
limit. Default is 0.
hashdb import -r mock_video_repository mock_video.xml mock_video.hdb
In the above command the option -r is used along with the repository name mock_video
_repository to indicate the repository source of the block hashes being imported into
the database. The repository name is used to keep track of the sources of hashes. Hash
blocks contained in one database often originate from many different sources and the
fileme may be the same. For example, if we add two separate but similar databases with
partial overlap to a database, this will result in some duplicate hashes from multiple
sources with the same filename. The repository name can be used with those duplicates
to allow users to track all hashes back to their original sources. By default, the repository name used is the text repository_ with the filename of the file being imported
from appended after it.
Table 3: Commands Available in hashdb Command Line Tool to Import and Export
between DFXML Files and Hash Databases
Command
import
Usage
import [-r < repository name
export
export
file.xml>
>]
mock_video_repository
< timestamp name = ’ begin import ’ delta = ’0.024016 ’ total = ’0.024016 ’/ >
< timestamp name = ’ end import ’ delta = ’0.015009 ’ total = ’0.039025 ’/ >
< hashdb_changes >
< hashes_inserted >2595
...
The database mock_video.hdb now holds 2595 hash values. Navigate into the directory, mock_video.hdb. It will contain a set of database files, the following lists the
contents:
4097
90112
3788
3573
3105
47
8192
8192
25
8192
8192
37
8192
8192
Mar
Mar
Mar
Mar
Mar
Mar
Mar
Mar
Mar
Mar
Mar
Mar
Mar
Mar
9
9
9
9
9
9
9
9
9
9
9
9
9
9
21:52
21:56
21:52
21:56
21:52
22:21
21:56
21:56
22:21
21:56
21:56
22:21
21:56
21:56
Bloom_filter_1
hash_store
history.xml
log.xml
settings.xml
source_filename_store.dat
source_filename_store.idx1
source_filename_store.idx2
source_lookup_store.dat
source_lookup_store.idx1
source_lookup_store.idx2
source_repository_name_store.dat
source_repository_name_store.idx1
source_repository_name_store.idx2
The file log.xml will show that a set of hash blocks have just been inserted. Listing
2 shows the excerpt of the log file that tracks this statistic. Users can also run the
following command to get information about the contents of the database (and confirm
that values were inserted):
hashdb statistics mock_video.hdb
3.2.3
Manipulating Hash Databases
Databases may need to be merged together or common hash values may need to be
subtracted out in order for them to be more suitable for scanning against. Commands
that manipulate hash databases are outlined in Table 4. Except for the deduplicate
command, the target database must already exist. For the deduplicate command, if
the target does not exist, one will be created with the same configuration settings as the
source.
3.2.4
Tracking Changes in Hash Databases
Statistics about hash database changes are reported on the console and to the log file
and history file inside the hash database. These statistics show the number of hashes
inserted or removed as a result of a command, and also show the number of hashes not
inserted or not removed because specific conditions were not met. These statistics are
shown in Table 5.
13
Table 4: Commands Available in hashdb Command Line Tool to Manipulate Hash
Databases
Command
add
Usage
add
Source Exif Data:
File Type : PDF
File Type Extension : pdf
MIME Type : application/pdf
PDF Version : 1.5
Linearized : No
Page Count : 46
Page Mode : UseOutlines
Author : Jessica R. Bradley, Simson L. Garfinkel
Title : Programmers Manual for Developing Bulk Extractor Scanner Plug-ins
Subject :
Creator : LaTeX with hyperref package
Producer : pdfTeX-1.40.14
Keywords : bulk extractor, scanners, plug-ins, bulk extractor developers
Create Date : 2014:08:22 12:36:11-07:00
Modify Date : 2014:08:22 12:36:11-07:00
Trapped : False
PTEX Fullbanner : This is pdfTeX, Version 3.1415926-2.6-1.40.14 (TeX Live 2014/dev) kpathsea version 6.2.0dev