Bat Manual
User Manual: Pdf
Open the PDF directly: View PDF .
Page Count: 57
Download | |
Open PDF In Browser | View PDF |
Binary Analysis Tool User and Developer Manual - describing version 37 Armijn Hemel – Binary Analysis Tool project April 24, 2018 Contents 1 Introducing the Binary Analysis Tool 5 2 Installing the Binary Analysis Tool 2.1 Hardware requirements . . . . . . . . . . . 2.2 Software requirements . . . . . . . . . . . 2.2.1 Security warning . . . . . . . . . . 2.2.2 Installation on Fedora . . . . . . . 2.2.3 Installation on Debian and Ubuntu 2.2.4 Installation on CentOS . . . . . . . . . . . . . . . . . . 3 Analysing binaries with the Binary Analysis 3.1 Running bat-scan . . . . . . . . . . . . . . . 3.2 Interpreting the results . . . . . . . . . . . . . 3.2.1 Output archive . . . . . . . . . . . . . 3.2.2 Viewing results with batgui . . . . . 3.2.3 Viewing results with batgui2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 5 5 6 6 6 Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 7 7 8 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 9 9 10 10 11 11 . . . . . . . . . . . . . . . . . . 4 Additional programs in the Binary Analysis Tool 4.1 busybox.py and busybox-compare-configs.py . 4.1.1 Extracting a configuration from BusyBox . 4.1.2 Comparing two BusyBox configurations . . 4.2 comparebinaries.py . . . . . . . . . . . . . . . . 4.3 sourcewalk.py . . . . . . . . . . . . . . . . . . . . 4.4 verifysourcearchive.py . . . . . . . . . . . . . . 4.5 findxor.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Binary Analysis Tool extratools collection A BAT scanning phases A.1 Identifier search . . A.2 Pre-run checks . . A.3 Unpackers . . . . . A.4 Leaf scans . . . . . A.5 Aggregators . . . . A.6 Post-run methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 13 13 14 15 16 B Scan configuration B.1 Global configuration . . . . . . . . . . . . . . . . . . . . . . . B.1.1 multiprocessing and processors . . . . . . . . . . . B.1.2 writeoutputfile . . . . . . . . . . . . . . . . . . . . B.1.3 outputlite . . . . . . . . . . . . . . . . . . . . . . . . B.1.4 configdirectory . . . . . . . . . . . . . . . . . . . . B.1.5 unpackdirectory . . . . . . . . . . . . . . . . . . . . B.1.6 temporary unpackdirectory . . . . . . . . . . . . . . B.1.7 debug and debugphases . . . . . . . . . . . . . . . . . B.1.8 postgresql user, postgresql password, postgresql postgresql host and postgresql port . . . . . . . . B.1.9 usedatabase . . . . . . . . . . . . . . . . . . . . . . . B.1.10 reporthash . . . . . . . . . . . . . . . . . . . . . . . . B.1.11 reportendofphase . . . . . . . . . . . . . . . . . . . . B.1.12 packconfig and scrub . . . . . . . . . . . . . . . . . B.1.13 template . . . . . . . . . . . . . . . . . . . . . . . . . B.1.14 scansourcecode . . . . . . . . . . . . . . . . . . . . . B.1.15 dumpoffsets . . . . . . . . . . . . . . . . . . . . . . . B.1.16 cleanup . . . . . . . . . . . . . . . . . . . . . . . . . . B.1.17 compress . . . . . . . . . . . . . . . . . . . . . . . . . B.1.18 packpickles . . . . . . . . . . . . . . . . . . . . . . . B.1.19 markersearchminimum . . . . . . . . . . . . . . . . . . B.1.20 tasktimeout . . . . . . . . . . . . . . . . . . . . . . . B.1.21 tlshmaxsize . . . . . . . . . . . . . . . . . . . . . . . B.1.22 Global environment variables . . . . . . . . . . . . . . B.2 Viewer configuration . . . . . . . . . . . . . . . . . . . . . . . B.3 Enabling and disabling scans . . . . . . . . . . . . . . . . . . B.4 Blacklisting and whitelisting scans . . . . . . . . . . . . . . . B.5 Passing environment variables . . . . . . . . . . . . . . . . . . B.6 Scan names . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.7 Scan conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . B.8 Storing results . . . . . . . . . . . . . . . . . . . . . . . . . . B.9 Running setup code . . . . . . . . . . . . . . . . . . . . . . . C Analyser internals C.1 Code organisation . . . . . . . . . . . . . . C.2 Pre-run methods . . . . . . . . . . . . . . . C.2.1 Writing a pre-run method . . . . . . C.3 Unpackers . . . . . . . . . . . . . . . . . . . C.3.1 Writing an unpacker . . . . . . . . . C.3.2 Adding an identifier for a file system C.3.3 Blacklisting and priorities . . . . . . C.4 Leaf scans . . . . . . . . . . . . . . . . . . . C.4.1 Writing a leaf scan . . . . . . . . . . C.5 Aggregators . . . . . . . . . . . . . . . . . . C.5.1 Writing an aggregator . . . . . . . . C.6 Post-run methods . . . . . . . . . . . . . . . C.6.1 Writing a post-run method . . . . . . . . . . . . . . . . . . . . . db, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . or compressed file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 17 17 17 17 17 18 18 18 18 19 19 19 19 20 20 20 20 20 20 21 21 21 21 21 22 22 22 22 22 23 23 23 24 25 25 26 26 27 27 28 28 29 29 30 30 D Building binary packages of the Binary Analysis Tool D.1 Building packages for RPM based systems from Git . . . . D.1.1 Building bat . . . . . . . . . . . . . . . . . . . . . . D.1.2 Building bat-extratools . . . . . . . . . . . . . . . D.2 Building packages for DEB based systems from releases . . D.3 Building packages for DEB based systems from Subversion . D.3.1 Building bat . . . . . . . . . . . . . . . . . . . . . . D.3.2 Building bat-extratools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 30 30 31 31 31 31 32 E Binary Analysis Tool knowledgebase E.1 Generating the package list . . . . . . . . . . . . . . . . E.2 Creating the database . . . . . . . . . . . . . . . . . . . E.3 License extraction and copyright information extraction E.4 Setting up PostgreSQL . . . . . . . . . . . . . . . . . . . E.4.1 Authentication configuration . . . . . . . . . . . E.4.2 Creating the database and database user . . . . E.5 Database design . . . . . . . . . . . . . . . . . . . . . . E.5.1 processed table . . . . . . . . . . . . . . . . . . E.5.2 processed file table . . . . . . . . . . . . . . . E.5.3 extracted string table . . . . . . . . . . . . . . E.5.4 extracted function table . . . . . . . . . . . . E.5.5 extracted name table . . . . . . . . . . . . . . . E.5.6 extracted copyright table . . . . . . . . . . . . E.5.7 hashconversion table . . . . . . . . . . . . . . . E.5.8 kernel configuration table . . . . . . . . . . . E.5.9 kernelmodule alias table . . . . . . . . . . . . E.5.10 kernelmodule author table . . . . . . . . . . . . E.5.11 kernelmodule description table . . . . . . . . E.5.12 kernelmodule firmware table . . . . . . . . . . E.5.13 kernelmodule license table . . . . . . . . . . . E.5.14 kernelmodule parameter table . . . . . . . . . . E.5.15 kernelmodule parameter description table . . E.5.16 kernelmodule version table . . . . . . . . . . . E.5.17 licenses table . . . . . . . . . . . . . . . . . . . E.5.18 renames table . . . . . . . . . . . . . . . . . . . . E.5.19 security cert table . . . . . . . . . . . . . . . . E.5.20 security cve table . . . . . . . . . . . . . . . . E.5.21 security password table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 32 33 34 34 34 35 35 36 36 37 37 37 38 38 38 38 39 39 39 39 40 40 40 40 41 41 41 41 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F Identifier extraction and ranking scan 42 F.1 Configuring identifier extraction . . . . . . . . . . . . . . . . . . . 42 F.2 Configuring the ranking method . . . . . . . . . . . . . . . . . . 42 F.2.1 Interpreting the results . . . . . . . . . . . . . . . . . . . 43 G BusyBox script internals G.1 Detecting BusyBox . . . . . . . . . . . . . . . . . . G.2 BusyBox version strings . . . . . . . . . . . . . . . G.3 BusyBox configuration format . . . . . . . . . . . . G.4 Extracting a configuration from a BusyBox binary G.4.1 BusyBox linked with uClibc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 44 44 45 45 46 G.4.2 BusyBox linked with glibc & uClibc exceptions G.5 Pretty printing a BusyBox configuration . . . . . . . . G.6 Using BusyBox configurations . . . . . . . . . . . . . . G.7 Extracting configurations from BusyBox sourcecode . . . . . . . . . . . . . . . . . . . . . . . . . 46 47 48 48 H Linux kernel identifier extraction H.1 Extracting visible strings from the Linux kernel binary H.2 Extracting visible strings from a Linux kernel module H.3 Extracting strings from the Linux kernel sources . . . H.3.1 EXPORT SYMBOL and EXPORT SYMBOL GPL . . . . H.3.2 module param . . . . . . . . . . . . . . . . . . . H.4 Forward porting and back porting . . . . . . . . . . . H.5 Corner cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 49 49 49 49 50 50 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 51 51 51 51 52 I Binary Analysis Tool performance tips I.1 Choose the right hardware . . . . . . . . . I.2 Use outputlite . . . . . . . . . . . . . . I.3 Use AGGREGATE CLEAN when scanning Java I.4 Disable tmp on tmpfs . . . . . . . . . . . . I.5 Use tmpfs for writing temporary results . . . . . . . JAR . . . . . . . . . . . . files . . . . . . . . . . . J Description for scans using the database 52 J.1 file2package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 K Parameter description for K.1 compress . . . . . . . . K.2 generatejson . . . . . . K.3 jffs2 . . . . . . . . . . K.4 lzma . . . . . . . . . . . K.5 tar . . . . . . . . . . . . K.6 xor . . . . . . . . . . . . K.7 zip . . . . . . . . . . . . K.8 findlibs . . . . . . . . K.9 findsymbols . . . . . . K.10 generateimages . . . . K.11 identifier . . . . . . . K.12 licenseversion . . . . K.13 prunefiles . . . . . . . K.14 hexdump and images . . default . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L Default ordering of scans in L.1 Pre-run scans . . . . . . . L.2 Unpack scans . . . . . . . L.3 Leaf scans . . . . . . . . . L.4 Aggregate scans . . . . . . BAT . . . . . . . . . . . . . . . . scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 52 53 53 53 53 53 53 54 54 54 54 54 54 54 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 55 55 56 56 . . . . . . . . . . . . 1 Introducing the Binary Analysis Tool The Binary Analysis Tool (BAT) is a generic framework that can help developers and companies analyse binary files. Its primary application is for Open Source software license compliance, with a special focus on supply chain management in consumer electronics, but it can also be used for other checks of binary for example the presence of security bugs. BAT consists of several programs written in Python. The most important program is the scanner for binary objects to unpack binaries recursively and apply a number of scans, for example for open source license compliance, visualising linking information, finding version information, and so on. There are also other programs to help with specific license compliance tasks, such as verifying if configurations for a given BusyBox binary match with the configuration in source code. Also included is a very experimental program to derive a possible configuration from a Linux kernel image, as well as programs to verify results from a binary scan with a source code archive. Development of security analysis features in BAT has been made possible through a joint grant from NLnet foundation and the programme “veilig door innovatie” from NCTV. BAT is currently no longer actively maintained. 2 2.1 Installing the Binary Analysis Tool Hardware requirements The tools in the Binary Analysis Tool can be quite resource intensive. They are largely I/O-bound (database access, reading files from disk), so it is better to invest in faster disks or ramdisks than in raw CPU power. Using more cores is also highly recommended, since most of the programs in the Binary Analysis Tool can take advantage of this and will run significantly faster. 2.2 Software requirements To run BAT a recent Linux distribution is needed, such as a recent Fedora or Ubuntu. Ubuntu versions older than 14.04, will not work due to a broken version of the PyDot package (unless a newer version is installed using pip). Debian versions older than 7 are unsupported. Versions older than Fedora 20 might not work scanning because of a bug in the version of matplotlib shipped on those distributions will throw errors in some cases. If the latest version from version control is used it is important to look at the file setup.cfg to get a list of the dependencies that should be met on the host system before installing BAT if the host system is Fedora. If the host system is Ubuntu or Debian this information will be in debian/control. 2.2.1 Security warning Do not install BAT on a machine that is performing any critical functions for your organisation. There are certain pieces of code in BAT that have known security issues, such as some of the Squashfs unpacking programs in bat-extratools that have been lifted from vendor SDKs. When scanning untrusted binary code there might be security risk. 2.2.2 Installation on Fedora To install on Fedora two packages are needed: bat-extratools, and bat. These packages can easily be created as both binary versions and as source RPM files from the source code in Git (as described later). When installing the two files there should be a list of dependencies that should be installed to let BAT work successfully. Some of the dependencies are not in Fedora by default but need to be installed through external repositories, such as RPMfusion. 2.2.3 Installation on Debian and Ubuntu To install on Debian and Ubuntu two packages are needed: bat-extratools, and bat. These packages can easily be generated as DEB packages from the source code in Git using the commands described later in this manual. When installing the two files there should be a list of dependencies that should be installed to let BAT work successfully. Some of these packages are not in Debian by default but need to be installed by enabling extra repositories such as Debian non-free. 2.2.4 Installation on CentOS In some cases it is possible to run BAT on CentOS (6.7 or 7 has been tested with) but some functionality will not be available, such as UBI/UBIFS unpacking and the scans creating graphs with PyDot (ELF linking, kernel module linking). It might be necessary to enable the EPEL repository ( https://fedoraproject. org/wiki/EPEL ) as well as RepoForge. A few packages might have to be installed manually. 3 Analysing binaries with the Binary Analysis Tool BAT consists of several programs and a few helper scripts (not meant to be used directly). The main purpose of the Binary Analysis tool is to analyse arbitrary binaries and review results. Analysis of the binary is done via a commandline tool (bat-scan), while the results can be viewed using a special graphical viewer (batgui). 3.1 Running bat-scan The bat-scan program can scan in two modes: either scan a single binary, or scan a whole directory of files. These are mutually exclusive and you cannot mix and match parameters for both modes. To scan a single binary you will need to supply three parameters to bat-scan: 1. -c : path to a configuration file 2. -b : path to the binary to be scanned 3. -o : path to an output file, where unpacked files, reports, plus the final program state be written to. This file can later be opened with the viewer. The default install of BAT comes with a configuration file (installed in /etc/bat/ although this will likely change in the future) with default settings that have proven to work well but almost everything can be changed or tweaked. A lengthy explanation of the different types of scans and their configuration can be found in the appendix. A typical invocation looks like this: python bat-scan -c /path/to/configuration -b /path/to/binary -o /path/to/outputfile To scan a directory you will need to supply three parameters to bat-scan: 1. -c : path to a configuration file 2. -d : path to a directory with files to be scanned 3. -u : path to a directory where output files will be written to For example: python bat-scan -c /path/to/configuration -d /path/to/dirwithbinaries -u /path/to/dirwithoutputfiles The format of output files in “directory scan” mode will be the name of the original file with the suffix .tar.gz. If there is already a file with that name in the output directory the file will not be scanned again. If the file should be scanned again, then the output file should be (re)moved. 3.2 Interpreting the results bat-scan will output an rchive file containing program state, complete unpacked directory tree containing all unpacked data (unless outputlite was set to yes), plus possibly some extra generated data, such as pictures and more reporting. These dumps are meant to be used by batgui or processed by other programs. 3.2.1 Output archive The output archive contains a few files and directories (depending on scan configuration): • scandata.pickle - Python pickle containing information about the structure of the binary, including offsets, paths, tags, and so on. It does not contain any of the actual scan results. • scandata.json - JSON file containing a subset of the information in scandata.pickle. This file is only generated if the generatejson scan is enabled. • STATISTICS - text file containing some statistics about the version of BAT used, the underlying Python implementation, and the scan times for each of the phases and some subphases. This file will likely be replaced by a JSON file in the future. • data - directory containing the full unpacked directory tree. If outputlite is set to yes this directory will be omitted from the output archive. • filereports - directory containing Python pickle files (gzip compressed) with scan results. Since identical files might be present the results are stored per checksum, not file name. • images - directory containing various images with results of scans (depending on which scans are enabled), per checksum • offsets - directory with Python pickle files containing the offsets of possible file systems, compressed files and media files found in the file. This directory as well as its files will only be created if dumpoffsets is set to yes in the global configuration. • reports - directory containing HTML and (optionally) JSON reports, per checksum 3.2.2 Viewing results with batgui The batgui program was made to view the results of the analysis process easily. The viewer has two modes: simple and advanced. In simple mode a tree of the unpack results will be shown, and each file in the tree can be clicked to display more information. Depending on which scans were run the tree will be decorated with more information, such as the type of the file (based on tags), or if matches were found with the ranking method. Using a filtering system (available from the menu) files that are typically uninteresting for license compliance engineering (empty files/directories, symbolic links, graphics files and so on) can be ignored. Information that is shown per file depends on the scans that were run and the type of file. For most files information like size, type, path (both relative inside the unpacked binary, as well as absolute in the scanning tree) will be shown. If the ranking method was enabled results of the ranking process such as matched strings, function names, a license guess etecetera will be displayed as well. In the optional advanced mode more results will be shown, such as a graphical representation of a file, where every bit in the binary has been assigned a grayscale value, plus a textual representation of a file generated with hexdump. Advanced mode is disabled by default, since loading the additional pictures and data is quite resource intensive and it will only be useful in very specific cases. It also requires that these special files are generated by BAT when scanning a file. This is not done by default but needs an explicit configuration change. Advanced mode might be removed from the GUI in future versions of BAT. The batgui program is no longer part of the default distribution of BAT, but can be downloaded from the BAT repository. 3.2.3 Viewing results with batgui2 A new user interface was created, based on Qt5. It is not part of the regular distribution of BAT, but can be grabbed from https://github.com/monkeyiq/ batgui2 4 4.1 Additional programs in the Binary Analysis Tool busybox.py and busybox-compare-configs.py Two other tools in BAT are busybox-compare-configs.py and busybox.py (in the subdirectory bat). These two tools are specifically used to analyse BusyBox binaries. BusyBox is in widespread use on embedded devices and the license violations of BusyBox are actively enforced in court. BusyBox binaries on embedded machines often have different configurations, depending on the needs of the manufacturer. Since providing the correct configuration is one of the requirements for license compliance it is important to be able to determine the configuration of a BusyBox binary and verify that there is a corresponding configuration file in the source code release. The BusyBox processing tools in BAT try to extract the most likely configuration from the binary and print it in the right format for that version of BusyBox. busybox.py is used to extract the configuration from a binary. Afterwards busybox-compare-configs.py can be used to compare the extracted configuration with a vendor supplied configuration. 4.1.1 Extracting a configuration from BusyBox Extracting a configuration from a BusyBox executable is done using busybox.py which can be found in the bat directory. It needs two commandline parameters: the path to the binary and the path to a directory which has files containing mappings from BusyBox applet names to BusyBox configuration directives. By default this value is hardcoded as /etc/bat, but this might change in the future. Some pre-extracted configurations can be found in the bat-data package (coming soon). The output (a possible BusyBox configuration) is written to standard output. python bat/busybox.py -b /path/to/busybox/binary -c /path/to/pre/extracted/configs > /path/to/saved/config This command will save the configuration to a file, which can be used as an input to busybox-compare-configs.py. 4.1.2 Comparing two BusyBox configurations After extracting the configuration the extracted configuration can be compared to another configuration, for example a configuration as supplied by a vendor in a source code archive: python busybox-compare-configs.py -e /path/to/saved/config -f /path/to/vendor/configuration -n $version 4.2 comparebinaries.py The comparebinaries.py program compares two file trees with for example unpacked firmwares. It is intended to find out which differences there are between two binaries (like firmwares) unpacked with BAT. There are two scenarios where this program can be used: 1. comparing an old firmware (that is already known and which has been verified) to a new firmware (update) and see if there are any differences. 2. comparing a firmware to a rebuild of a firmware as part of compliance engineering. A few assumptions are made: 1. both firmwares were unpacked using the Binary Analysis Tool 2. files that are in the original firmware, but not in the new firmware, are not reported (example: removed binaries). This will change in a future version. 3. files that are in the new firmware but not not in the original firmware are reported, since this would mean additions to the firmware which need to be checked. 4. files that appear in both firmwares but which are not identical are checked using bsdiff to determine the size of the difference. With checksums it is easy to find the files that are different. Using bsdiff it becomes easier to prioritise based on the size of the difference. Small differences are probably not very interesting at all: 1. time stamps (BusyBox, Linux kernel, and others record a time stamp in the binary) 2. slightly different build system settings (home directories, paths, and so on). Bigger differences are of course much more interesting. 4.3 sourcewalk.py This program can quickly determine whether or not source code files in a directory can be found in known upstream sources. It uses a pregenerated database containing names and checksums of files (for example the Linux kernel) and reports whether or not the source code files can be found in the database based on these checksums. The purpose of this script is to find source code files that cannot be found in upstream sources to reduce the search space during a source code audit. This script will not catch: • binary files • patch/diff files • anything that does not have an extension from the list in the script • configuration files/build scripts 4.4 verifysourcearchive.py The verifysourcearchive.py program is to verify a source code archive using the result of a scan done with BAT. 4.5 findxor.py The findxor.py program can be used to find possible XOR “encryption” keys. It prints the top 10 (hardcoded limit) of most common byte sequences (16 bytes) in the file. These can then be added to the batxor.py module in BAT. This will likely change in the future. 5 Binary Analysis Tool extratools collection To help with unpacking non-standard file systems, or standard file systems for which there are no tools readily available on Fedora or Ubuntu there is also a collection of tools that can be used by BAT to unpack more file systems. These tools are not part of the standard distribution, but have to be installed separately. They are governed by different license conditions than the core BAT distribution. Currently the collection consists of: • bat-minix has a Python script to unpack Minix v1 file systems that are frequently found on older embedded Linux systems, such as IP cameras. • modified version of code2html (which is unmaintained by the upstream author) that adds support for various more programming languages. This tool is not needed by BAT. • unmodified version of simg2img needed for converting Android sparse files to ext4 file system images. • unmodified version of romfsck needed for unpacking romfs file systems. • modified version of cramfsck that enables unpacking cramfs file systems. • reimplementation of unyaffs that enables unpacking for various YAFFS2 file systems. • various versions of unsquashfs that enable unpacking variants of SquashFS. These versions have either been lifted from vendor SDKs, the OpenWrt project, DD-WRT, or upstream SquashFS project. • ubi reader is a set of tools to deal with UBI/UBIFS images, currently not used by default. • bat-visualisation containing a few custom tools to help generate pictures. These might be removed in the future. The collection is split in two packages: the ubi reader package contains UBI/UBIFS specific tools and the bat-extratools package contains the rest. A BAT scanning phases BAT uses a brute force approach for analysing a binary. It assumes no prior knowledge of how a binary is constructed or what is inside the binary. Instead it tries to determine what is inside by applying a wide range of methods, such as looking for known identifiers of file systems and compressed files and running external tools to find contents in the binary. It should be noted that there are possibilities to add more information to the system to speed up scanning and skip phases. During scanning of a file the following steps are taken: 1. identifier search, using a list of known identifiers, like headers, footers or identifiers that indicate the start or end of a file system, compressed file or media file. 2. verifying file type of a file and, if successful, tagging it. Tags can be used later on to give more information to the scanner. 3. unpacking file systems, compressed files and media files from the file, carving them out of the larger file first. 4. repeat steps 1 - 3 for each file that was unpacked in step 3 5. run individual scans on each file if no further unpacking is possible 6. optionally aggregate scan results or modify results based on information that has become available during the scan 7. process results from scans in step 5 and 6 and generate reports 8. pack results into an archive that can be used by the viewer application or other applications A.1 Identifier search The first action performed is scanning a file for known identifiers of compressed files, file systems and media files. The identifers are important for a few reasons: first, they are used to determine which checks will run. They are also used frequently throughout the code for verification and speeding up unpacking. If a scan depends on a specific identifier being present it can be set using the magic attribute in the configuration. If an identifier is not defined anywhere in the configuration file as needed it will be skipped during the identifier search to speed up the identifier search. Some scans define an additional magic header in optmagic. The values defined in optmagic are not authoritive, but should be treated as hints. A good example is the YAFFS2 scan. The marker search cannot be enabled or disabled via the configuration file. The markers that are searched for can be found in bat/fsmagic.py. As an optimization the marker search can be skipped for some files if they have an extension which gives a possible hint about what kind of file it might be. For example, the extension gz is frequently used for gzip compressed files, so for files with the extension gz a special method (configured in the configuration for the gzip unpacker) is first run to see if the whole file is actually a gzip file, without looking at any other markers, or trying other scans first. If the whole file is indeed gzip compressed (which will be the case for the vast majority of files) then all other unpacking scans will be skipped. If the file is not gzip compressed, or only part of the file is gzip compressed (and there is trailing data), then the file will be processed in the normal way instead. If multiple CPUs are available and the top level file is larger than a certain limit and does not have a known extension as described above the marker search will be done in parallel as a speed up. The limit can be set in the global configuration using the variable markersearchminimum. The default value for this variable is 20 million bytes. A.2 Pre-run checks Before files are unpacked they are briefly inspected and if possible tagged. Tags are used to pass hints to methods that are run later to avoid unnecessarily scanning a file and to reduce the amount of false positives. For example, files that only contain text are tagged as text, all other files are tagged as binary (this depends on the implementation of Python. Python 2 only considers (by default) ASCII to be valid text). Methods that only work on binaries can then ignore anything that has been tagged as text. Other checks that are available are for valid XML, various Android formats, ELF executables and libraries, certain graphics and audio files, and so on. The prerun checks can easily be identified in the configuration, since it has its type set to prerun: [verifytext] type = module = method = priority = description = enabled = prerun bat.prerun verifyText 3 Check if file contains just ASCII text yes Prerun verifiers can optionally make use of tags that are already present by using magic and noscan attributes, which will be explained in detail later for the unpackers. A.3 Unpackers Unpackers can be recognized in the configuration because their type is set to unpack, for example: [jffs2] type module method priority magic noscan description enabled = = = = = = = = unpack bat.fwunpack searchUnpackJffs2 2 jffs2_le:jffs2_be text:xml:graphics:pdf:compressed:audio:video:mp4:elf:java:resource:dalvik Unpack JFFS2 file systems yes In BAT 37 the following file systems, compressed files and media files can be unpacked or extracted: • file systems: Android sparse files, cramfs, ext2/ext3/ext4, ISO9660, JFFS2 (no LZO compression), Minix (specific variant of v1 often found on older embedded Linux systems), SquashFS (several variants), romfs, YAFFS2 (specific variants), ubifs (not on all systems), PLF (Parrot specific file format) • compressed files and executable formats: 7z, ar, ARJ, BASE64, BZIP2, compressed Flash, CAB, compress, CPIO, ELF, EXE (specific compression methods only), GZIP, InstallShield (old versions), Java class file, LRZIP, LZIP, LZMA, LZOP, MSI, pack200, RAR, RPM, RZIP, serialized Java, TAR, UPX, XZ, ZIP (including APK, EAR, JAR and WAR), WIM, Intel HEX (whole files only, comments allowed), XAR • media files: BMP, GIF, JPEG, PNG, ICO, PDF, CHM, OTF, TTF, WOFF Most of the unpackers for these file systems, compressed files and media files are located in the file bat/fwunpack.py. Unpacking differs per file type. Most files use one or more identifiers that can be searched for in a binary blob. Using this information it is possible to carve out the right parts of a binary blob and verify if it indeed contains a compressed file, media file or file system. There is not always an identifier that can be searched for. The YAFFS2 file system layout for example is dependent on the hardware specifics of the underlying flash chip. Without knowing these specifics it is not possible to specifically search for a valid YAFFS2 file system. This scan therefore tries to run on every file, unless explicitely filtered out (using noscan and tags). Other file types (such as ARJ files) have a very generic identifier, so there are a lot of false positives. This causes a big increase in runtime. The ARJ unpacker is therefore disabled by default. LZMA is another special case: there are many different valid headers for LZMA files, but in practice only a handful are used. If unpacking is successful a directory with unpacked files is returned, and, if available, some meta information to avoid duplicate scanning (blacklisting information and tags). The unpacked files are added to the scan queue and scanned recursively. A.4 Leaf scans Leaf scans are scans that are run on every single file after unpacking, including files that contained files that were found and extracted by unpackers. Leaf scans can be recognized in the configuration because their type is set to leaf, for example: [markers] type module method = leaf = bat.checks = searchMarker noscan = text:xml:graphics:pdf:compressed:audio:video description = Determine presence of markers of several open source programs enabled = yes The current leaf scans that are available in BAT are: • marker scan searching for signature scans of a few open source programs (dproxy, ez-ipupdate, hostapd, iptables, iproute, libusb, loadlin, RedBoot, U-Boot, vsftpd, wireless-tools, wpa-supplicant) • advanced search mode using ranking of strings, function names, variable names, field names and Java class names using a database (for ELF and Java, both regular JVM and Dalvik) • BusyBox version number • dynamic library dependencies (ELF files only) • file architecture (ELF files only) • Linux kernel module license (Linux kernel modules only) • Linux kernel version number, plus detection for several subsystems • PDF meta data extraction • presence of URLs indicating an open source license • presence of URLs indicating forges/collaborative software development sites (SourceForge, GitHub, etcetera) The fast string searches are meant for quick sweep scanning only. They have their limits, can report false positives or fail to identify a binary. They should only be used to signal that further inspection is necessary. For a thorough investigation the advanced search mode should be used. These scans are likely to be disabled in the future in the default configuration. A.5 Aggregators Sometimes it helps to aggregate results of a number of files, or it could be useful to perform other actions after all the individual scans have run. The best example is dealing with JAR-files (Java ARchives). Individual Java class files often contain too little information to map them reliably to a source code package. Typically a class file contains just a few method names, or field names, or strings. If inner classes are used it can be even worse and information from a single source code file could be scattered across several class files. Since Java programs (note: excluding Android) are typically distributed as a JAR that is either included at runtime or directly executed, similar to an ELF library or ELF executable, it makes perfect sense to treat the JAR file as a single unit and aggregate results for the individual class files and assign them to the JAR file. Aggregators take all results of the entire scan as input. Currently the following aggregators are available: • advanced identifier search and classification • aggregating result of individual Java class files in case they come from the same JAR file. • cleaning up/fixing results of duplicate files: often firmwares contain duplicate files. Sometimes some more information is available to make a better choice as to which file is the duplicate and which one is the original version • checking dynamically linked ELF files • finding duplicate files • finding licenses and versions of strings and function names that were found and optionally pruning the result set to remove unlikely results. • pruning files from the scan completely if they are not interesting (such as pictures, or text files) using tags. • generating pictures of results of a scan • generating reports of results of a scan A.6 Post-run methods In BAT there are methods that are run after all the regular work has been performed, or “post-run”. These methods should not alter the scan results in any way, but just use the information from the scanning process. A typical use case would be to present the data in a nicer to use format than the standard report, to use more external data sources or generate graphical representations of data. The post-run methods have the type postrun in the configuration, for example: [hexdump] type module method noscan envvars description enabled storetarget storedir storetype cleanup B = = = = = = = = = = = postrun bat.generatehexdump generateHexdump text:xml:graphics:pdf:audio:video BAT_REPORTDIR=/tmp/images:BAT_IMAGE_MAXFILESIZE=100000000 Create hexdump output of files no reports /tmp/images -hexdump.gz no Scan configuration The analysis process is highly configurable: methods can be simply enabled and disabled, based on need: some methods can run for quite a long time, which might be undesirable at times. Configuration is done via a simple configuration file in Windows INI format. Most sections are specific to scanning methods, except two sections: a global section and one section specific for the viewer tool. B.1 Global configuration The global configuration section is called batconfig. In this section various global settings are defined which are described below. The section can be identified in the configuration file by looking for this: [batconfig] B.1.1 multiprocessing and processors The multiprocessing configuration option determines whether or not multiple CPUs (or cores) should be used during scanning. The default configuration as shipped in the official BAT distribution is to use multiple threads: multiprocessing = yes If set to yes the program will start an extra process per CPU that is available for parts of the program that can be run in parallel. In most cases it is completely safe to use multiprocessing. It might be desirable to not use all processors on a machine, for example if there are multiple scans of BAT running at the same time, or if other tasks need to run on the machine. It is possible to set the maximum amount of processors to use with the processors option: processors = 2 B.1.2 writeoutputfile Sometimes it is useful to let BAT just unpack, but not write an output file, saving time packing data and writing it to disk. To prevent an output archive being created set writeoutputfile to no. By default it will be set to yes. B.1.3 outputlite Another setting in this section is outputlite: outputlite = yes It defaults to yes. If set to yes the output archive will omit a full copy of the unpacked data, significantly decreasing the size of the output archive, but making it harder to do a “post mortem” on the unpacked data (a new analysis should be run to get it again). B.1.4 configdirectory BAT allows configurations for scans to be split in different files and stored in a separate directory. The directory with configurations can be set using the configdirectory parameter: configdirectory = /home/bat/configs/ Important note: the configuration files have to use the extension .conf. B.1.5 unpackdirectory The unpacking directory where BAT will store its runtime state can be set using unpackdirectory. By default this is /tmp. Each run of BAT will create a new directory underneath this directory. In some cases it can be wise to change it to a different location, with more storage (more and more Linux distributions have /tmp mounted on a ramdisk) or less latency (SSD). It can be use as follows: unpackdirectory = /ssd/tmp B.1.6 temporary unpackdirectory There is one setting to set the prefix for creating temporary files or directories, namely temporary unpackdirectory. By default the directory for creating temporary files and directories is /tmp. There might be situations where the temporary directory might need to be changed, for example for unpacking on a faster medium (ramdisk, SSD) than a normal harddisk. It can be used as follows: temporary_unpackdirectory = /ramdisk B.1.7 debug and debugphases To assist in debugging and finding errors in scans of BAT there are two settings: debug and debugphases. The setting debug can be used to enable and disable debugging. If set multiprocessing will be disabled and information about which file is scanned and which method is run will be printed on standard error. If specified without debugphases this will apply to all scan phases. The debugphases parameter can be used to limit this behaviour to just one or a few phases. The other phases will behave normally. For example, this will enable debugging, but just for the leaf scans and aggregate scans: debug = yes debugphases = leaf:aggregate B.1.8 postgresql user, postgresql password, postgresql db, postgresql host and postgresql port The PostgreSQL database used by BAT is configured in the global section. A few variables have to be set to be able to connect with the database server, namely the username, password and database name: postgresql_user = bat postgresql_password = bat postgresql_db = bat Optionally a port and host can be set too if another port and/or host need to be used: postgresql_host postgresql_port = 127.0.0.1 = 5432 Depending on the version of python-psycopg2 it could be that postgresql host and postgresql port both have to be specified. For example on CentOS 6.x both parameters have to be set when using a different port, even if the database resides on the local machine. B.1.9 usedatabase By default the database is enabled. There could be situations where it is undesirable to use the database and it needs to be temporarily disabled. By setting usedatabase to anything but yes the database will be disabled: usedatabase = no B.1.10 reporthash If reporthash is set, then hashes in the ranking scan that come from the BAT database will be converted from SHA256 (default) to the hash if supported (currently MD5, SHA1 and CRC32 are supported) in the default BAT database as created by the database scripts. reporthash = sha256 B.1.11 reportendofphase If reportendofphase is set to yes, then BAT will write a line with some statistics about when a scanning phase has ended on standard output. This can be useful to track progress of BAT. reportendofphase = yes B.1.12 packconfig and scrub The output archive by default does not contain the configuration file that was used during the scan. In some situations it is actually desirable to store the configuration with the scan archive results, for example to debug an issue, or to recreate results with the same configuration. For this the option packconfig can be set to yes. packconfig = yes Because the configuration file can contain confidential information (such as database credentials) it is desirable to scrub this information from the configuration file. For this the scrub setting can be used. Its value should be a colon separated list of configuration options for which the value in the configuration file (all occurances) should be replaced. For example, to scrub the values of postgresql user and postgresql password the following would be used: scrub = postgresql_user:postgresql_password B.1.13 template Some compression formats or file systems are stored anonymously without a name. Examples are certain gzip-compressed files (like a ramdisk), or an LZMAstream. template = unpacked-by-bat-from-%s B.1.14 scansourcecode The scansourcecode option can be used to check if a file that is scanned can actually be found in the BAT database of source code files: scansourcecode = yes The underlying rationale for this option is that various people have tried to use BAT for source code scanning and did not get the results they expected. By filtering out exact matches to the BAT database beforehand there are fewer false positives. B.1.15 dumpoffsets During the BAT marker scans a dictionary with possible offsets for compressed files and file systems is generated. Although most of these are discarded during unpacking (as they are false positives) it could be useful to store this data. By setting dumpoffsets to yes the offsets will be stored as Python pickles in the offsets directory: dumpoffsets = yes B.1.16 cleanup If cleanup is set to yes the BAT working directory will be removed after the scan has finished. By default the working directories are not removed: cleanup = yes B.1.17 compress BAT outputs several result files. To save disk space these can be compressed using gzip, at the expense of processing time. If not specified in the configuration file compress will default to no. B.1.18 packpickles BAT internally uses Python pickles to (temporarily) store information on disk. Since recent versions of BAT the preferred reporting and data exchange format is JSON and pickles are no longer needed and don’t need to be packed in the output archive. If not specified in the configuration file packpickles will default to no. B.1.19 markersearchminimum When a file is scanned for markers it is done in a single process. If many files have to be scanned for markers at once this makes sense. However, if there are multiple processors available and the top level file is big, then it is a bit of a waste of time to not be able to use the extra processor power. With markersearchminimum it is possible to set a minimum size for a file to search for markers in the top level file in parallel. By default top level files larger than 20 million bytes are processed in allel. B.1.20 tasktimeout Several of the scanning phases in BAT use a task queue. Unfortunately it could be that due to unknown bugs in BAT there are uncaught errors, which make it seem like BAT hangs. For this the task queues have time outs. The default for the timeout is 2592000 seconds (roughly one month). The task queue timeout can be shortened by setting tasktimeout to a lower value: tasktimeout = 2592000 It should be noted that the value should not be set to 0, because otherwise the queues will timeout immediately and BAT will barf. B.1.21 tlshmaxsize Beyond a certain size it no longer makes sense to compute TLSH checksums for files. Using tlshmaxsize this limit can be set. By default it is set to 52428800 bytes (50 MiB). B.1.22 Global environment variables Global environment variables are shared between scans. They can be overridden by individual scans. For example to set the environment variable FOO for all scans you would put something like this in the global configuration: envvars = FOO=/home/bat/bar To pass two or more environment variables use a semicolon: envvars = FOO=/home/bat/bar:XYZZY=1 As a rule of thumb: settings that are shared between all scans should be set in the global sections, while scan specific options should be in the scan specific sections. B.2 Viewer configuration The other global section is viewer. This section is specific for the graphical frontend and is not used in any other parts of BAT and might be moved to a separate configuration file in a future version of BAT. B.3 Enabling and disabling scans The standard configuration file enables most of the scans and methods implemented in BAT by default. Scans can be enabled and disabled by setting the option enabled to yes and no respectively. Another way to not run a scan is to outcomment the entry in the configuration file (by starting the line with the # character), or by removing the section from the configuration file. B.4 Blacklisting and whitelisting scans Files can be explicitely blacklisted for scanning by using the noscan configuration setting. The value of this parameter is a list of tags, separated by colons: noscan = text:xml:graphics:pdf:audio:video Similarly files can be whitelisted by using the scanonly setting. Only files that are tagged with any of the values in this list (if not empty) will be scanned. If there is an overlapping value in scanonly and noscan then the file will not be scanned. B.5 Passing environment variables All scans have an optional parameter scanenv defaulting to an empty Python dictionary. In the configuration file a colon separated list of name/value pairs can be specified using the keyword envvars. These will then become available in the environment of the scan: envvars = BAT_REPORTDIR=/tmp/images:BAT_IMAGE_MAXFILESIZE=100000000 If the environment of a scan needs to be adapted in the context of a single file it is important to first make a copy of the environment or the environment might be modified for the scan for all other files that are scanned. B.6 Scan names The name of the scan is used in various places, for example for storing results or for determining scan conflicts. The name parameter can be used to set the name for the scan. If no name is specified the name of the section of the scan is used instead. name = gzip B.7 Scan conflicts Possibly scans can conflict with other scans in the same phase and they should not be enabled at the same time. To indicate that a scan conflicts with others the conflicts option can be set: conflicts = gzip:bzip2 If there is a conflict in the configuration BAT will refuse to run. Currently BAT only looks at conflicts in the same unpacking phase and only for scans that are enabled. B.8 Storing results Postrun scans and aggregate scans that output data, for example graphics files or reports, can specify which files should be added to the output file. There are three settings that should be set together: storetarget = images storedir = /tmp/images storetype = -piechart.png:-version.png The storetarget setting specifies the relative directory inside the output TAR archive. The storedir setting tells where to get the files that need to be stored can be found (this should be where the postrun scan or aggregate scan stores its results). The storetype setting is a colon separated list of extensions/partial file names that the files should end in (typically the rest of the filename is a SHA256 value). The additional setting cleanup can be used to instruct BAT that the files generated by this postrun scan or aggregate scan should be removed after copying them into the result archive: cleanup = yes The cleanup setting should be set to yes unless the results do not change in between subsequent runs of BAT. Currently (BAT 37) if cleanup is set the files are written directly to output directories. The values of these directories are hardcoded (and match values that the GUI expects) but these will be replaced by the value of storetarget in a later release. B.9 Running setup code For some scans it is necessary to run some setup code to ensure that certain conditions are met, for example to see if database tables exist, or if locations are readable/writeable. These checks only need to be run once. Based on the result of the setup code the scan might be disabled if certain conditions are not met. There is a special hook for unpack scans, leaf scans and aggregate scans to run setup code for the scan: setup = nameOfSetupMethod The result of the setup method is a tuple containing a boolean to indicate wheter or not the scan should be run, and a (possibly adapted) environment. The files bat/identifier.py and bat/licenseversion.py contain very extensive examples of setup hooks. C Analyser internals The analyser was written with extensibility in mind: new file systems or variants of old ones tend to appear regularly (for example: there are at least 5 or more versions of SquashFS with LZMA compression out there), and sometimes it is needed to plug in a new unpacker for a file system or compressed file type. C.1 Code organisation bat-scan is merely a frontend for the real scanner and only handle the list of scans, the binary/binaries to scan and where to write the output file(s). The meaty bits of the analyser can be found in files in the bat subdirectory (note that this directory currently contains more files than are actually used by BAT at the moment): • batxor.py contains experimental code to deal with files that have been obfuscated with XOR. • bruteforcescan.py contains the main logic of the program: it launches scans based on what is inside the binary and the scans that are enabled, collects results from scans and writes results to an output file. • busybox.py and busyboxversion.py contain code to extract useful information from a BusyBox binary, such as the version number. • checks.py contains various leaf scans, like scanning for certain marker strings, or the presence of license texts and URLs of forges/collaborative software development sites. • ext2.py implements some functionality needed for unpacking ext2 file systems. • extractor.py provides convenience functions that are used throughout the code. • file2package.py has code to match names of files to names of packages from popular distributions in a database. • findduplicates.py is used to find duplicate files in the scanned archive. • findlibs.py and interfaces.py are for researching dynamically linked ELF files in the archive. • fixduplicates.py is used to correct tagging of files that were tagged incorrectly as duplicates, as they are the original, not the copy. For now this is only for ELF files. • fsmagic.py contains identifiers of various file systems and compressed files, like magic headers and offsets for which might need to be corrected. • fwunpack.py includes most of the functionality for unpacking compressed files and file systems. • generatehexdump.py and images.py generate textual and graphical representations of the input files. • generatereports.py, generateimages.py, guireport.py, generatejson.py and piecharts.py generate textual and graphical representations of results of the analysis. • identifier.py implements functionality to extract identifiers (string constants, function names, method names, variable names, and so on) from binary files and make them available for further analysis. • javacheck.py has code to parse Java class files. • jffs2.py has code specific to handling JFFS2 file systems. • kernelanalysis.py includes code to extract information from Linux kernel images and Linux kernel modules. • kernelsymbols.py is used for generating dependency graphs for Linux kernel modules and indicating any possible license issues of exported symbols and declared licenses. • licenseversion.py gets version and licensing information for uniquely identified strings and function names (and in the future variable names too) from the database. It can optionally prune the result set to only include relevant versions. It also contains code to aggregate results of Java class files from a JAR file and assign results to the JAR file instead of the individual class files. • prerun.py contains scans that are run in the pre-run phase for correctly tagging files as early in the process as possible. • prunefiles.py can be used to remove files with a certain tag from the scan results. This is useful for for example graphics files. • renamefiles.py is used for renaming files to use a more logical name after more contextual information from the scan has become available. For example: detect an initramfs in the Linux kernel and rename the temporary file to initramfs. • security.py contains several security scans. • unpackrpm.py has code specifically for unpacking RPM archives. C.2 Pre-run methods Pre-run methods check and tag files, so the files can be ignored by later methods and scans, reducing scanning time and preventing false positives. While tagging is not exclusive to pre-run methods it is their main purpose. C.2.1 Writing a pre-run method Pre-run methods have a strict interface. Parameters are: • filename is the absolute path of the file that needs to be tagged • tempdir is the (possibly) empty name of a directory where the file is. This is currently unused and might be removed in the future. • tags is the set of tags that have already been defined for the file. • offsets is the set of offsets that have been found for the file • scanenv is an optionally empty dictionary of environment variables that can be used to pass extra information to the pre-run method. • debug is an environment variable that can be used to optionally set the scan in debugging mode so it can print more information on standard error. By default it is set to False. • unpacktempdir is the location of a directory for writing temporary files. This value is optional and by default it is set to None. Return values are: • a list containing tags Example: def prerunMethod(filename, tempdir=None, tags=[], offsets={}, scanenv={}, debug=False, unpacktempdir=None): newtags = [] newtags.append(’helloworld’) return newtags C.3 Unpackers Unpackers are responsible for recursively unpacking binaries until they can’t be unpacked any further. C.3.1 Writing an unpacker The unpackers have a strict interface: def unpackScan(filename, tempdir=None, blacklist=[], offsets={}, scanenv={}, debug=False): ## code goes here The last four parameters are optional, but in practice they are always passed by the top level script. • tempdir is the directory into which files and directories for unpacking should be created. If it is None a new temporary directory should be created. • blacklist is a list of byte ranges that should not be scanned. If the current scan needs to blacklist a byte range it should add it to this list after finishing a scan. • offsets is a dictionary containing a mapping from an identifier to a list of offsets in the file where these identifiers can be found. This list is filled by the scan genericMarker which always runs before anything else. • scanenv is an optionally empty dictionary of environment variables that can be used to pass extra information to the pre-run method. • debug is an environment variable that can be used to optionally set the scan in debugging mode so it can print more information on standard error. By default it is set to False. Return values are: • the name of a directory, containing files that were unpacked. • the blacklist, possibly appended with new values • a list of tags, in case any tags were added, or an empty list Most scans have been split in two parts: one part is for searching the identifiers, correctly setting up temporary directories and collecting results. The other part is doing the actual unpacking of the data and verification. The idea behind this split is that sometimes functionality is shared between two scans. For example, unpackCpio is used by both searchUnpackCpio and unpackRPM. C.3.2 Adding an identifier for a file system or compressed file Identifiers for new file systems and compressed files are, if available, added to fsmagic.py in the directory bat. These identifiers will be available in the offsets parameter that is passed to a scan, if any were found. Good sources to find identifiers are /usr/share/magic, documentation for file systems or compressed files, or the output of hexdump -C. C.3.3 Blacklisting and priorities In BAT blacklists are used to prevent some scans from running on a particular byte range, because other scans have already covered these bytes, or will cover them. The most obvious example is the ext2 file system: in a normal setup (no encryption) it is trivial to see the content of all the individual files if an ext2 file system image is opened. This is because this file system is mostly a concatenation of the data, with some meta data associated with the files in the file system. If another compressed file is in the ext2 file system it could be that it will be picked up by BAT twice: once it will be detected inside the ext2 file system and once after the file system has been unpacked by the ext2 file system unpacker. Other examples are: • cpio (files are concatenated with a header and a trailer) • TAR (files are concatenated with some meta data) • RPM (files are in a compressed archive with some meta data) • ar and DEB • some flavours of cramfs • ubifs To avoid duplicate scanning and false positives it is therefore necessary to prevent other scans from running on the byte range already covered by one of these files. In BAT this is achieved by using blacklists. All unpackers have a parameter called blacklist which is consulted every time a file is unpacked. If a file system offset is in a blacklist the scan could use the next offset, or skip scanning the entire file, depending on the scan. The blacklist is set for every file individually and is initially empty. If a scan is successful it adds a byte range to the blacklist. Subsequent scans will skip the byte range added by the scan. The scans are run in a particular order to make the best use of blacklists. The order of scans is determined by the priority parameter in the configuration file. The file systems and concatenated files mentioned above have a higher priority and are scanned earlier than other scans that could also give a match. It is not a fool proof system, but it seems to work well enough. C.4 Leaf scans After everything has been unpacked each file, including the files from which other files were carved, will be scanned by the leaf scans. C.4.1 Writing a leaf scan The leaf scans have a simple interface. There are eight parameters passed to the scan, namely the absolute path of the file, the tags of the file, a database cursor and connection, an optional blacklist with byte ranges that should not be scanned, an optional list of environment variables and an optional name of a directory for writing temporary results. For example: def leafScan(path, tags, cursor, conn, blacklist=[], scanenv={}, debug=False, unpacktempdir=None): ## code goes here There are no restrictions on the return values of the leaf scan, except if nothing could be found (in which case None is usd as return value). The result value is a tuple with a list of tags as well as one of the following: • None if nothing can be found • simple values (booleans, strings) • custom data structure. Code that processes this data should know about its structure. There is no restriction on the code that is run as part of the leaf scan and basically anything can be done. In BAT there are for example checks that invoke other external programs to discover dynamically linked libraries using readelf, find the license of a kernel module using modinfo or simple checks for the presence of strings in the binary that indicate the use of certain software. The simplest scans are the ones that search for hardcoded strings. These strings are frequently found just in the package for which the check is written for. For example, the following strings can often be found in copies of the iptables binary and the related libiptc library: markerStrings = [ ’iptables who? (do you need to insmod?)’ , ’Will be implemented real soon. I promise ;)’ , ’can\’t initialize iptables table ‘%s\’: \%s’ ] Although searching for hardcoded strings is very fast, this method has some drawbacks: • a binary sometimes does not have these exact strings embedded • this method will only find the strings that are hardcoded and not any other significant strings • if another package includes the string, it will be a false positive The quick checks should therefore only be used as an indication that further inspection of the binary is needed. A much better method is the ranking method that is also available in BAT, but which requires a special setup with a database. C.5 Aggregators Aggregators take all information from the entire scan process and possibly modify results. C.5.1 Writing an aggregator Aggregators have a strict interface: def aggregateexample(unpackreports, scantempdir, topleveldir, scanenv, batcursors, batcons, debug=False, unpacktempdir=None) • unpackreports are the reports of the unpackers for all files • scantempdir is the location of the top level data directory of the scan • topleveldir is the location of the top level directory of the scan • scanenv is a dictionary of environment variables • batcursors is a list of PostgreSQL database cursors. If no database is used this list will be empty. • batcons is a list of PostgreSQL database connections. If no database is used this list will be empty. • debug is an environment variable that can be used to optionally set the scan in debugging mode so it can print more information on standard error. By default it is set to False. • unpacktempdir is the location of a directory for writing temporary files. This value is optional and by default it is set to None. The aggregators should read any results of the leaf scans from the pickles on disk. If there is any result it should be returned as a dictionary with one key. It will be assigned to the results of the top level element. Examples are: the names of files which are duplicates in an archive or firmware. C.6 Post-run methods Post-run methods don’t change the result of the whole scanning process, but only use the data from the process. For example prettyprinting a fancy report would be a typical post-run method. C.6.1 Writing a post-run method Post-run methods have a strict interface: def postrunHelloWorld(filename, unpackreport, scantempdir, topleveldir, scanenv, cursor, conn, debug=False): print "Hello World" • filename is the absolute path of the scanned file, after unpacking. • unpackreport is the report of unpacking the file • scantempdir is the directory that contains the unpacked data • topleveldir is the top level directory containing the data directory and the directory with the per file result pickles. • scanenv is a dictionary of environment variables • cursor is the database cursor (or None if there is no database) • conn is the database connection (or None if there is no database) • debug is an environment variable that can be used to optionally set the scan in debugging mode so it can print more information on standard error. By default it is set to False. The post-run methods should read any results of the leaf scans from the pickles stored on disk. Since the post-run methods don’t change the result in any way, but just have side effects there is no need to return anything. Any return value will be ignored. D Building binary packages of the Binary Analysis Tool If you want to install BAT through the package manager of your distribution you might first need to generate packages for your distribution if none exist. For BAT there is currently support to build packages for RPM-based systems and for DEB-based systems. D.1 D.1.1 Building packages for RPM based systems from Git Building bat Building the bat package is fairly straightforward. 1. Git clone the BAT repository from GitHub 2. cd to the directory src 3. run the command: python setup.py bdist rpm This will create an RPM file and an SRPM file. If you need to install BAT on other versions of Fedora or on other RPM based distributions you can simply rebuild the SRPM using: rpmbuild --rebuild D.1.2 Building bat-extratools Building packages for bat-extratools is unfortunately a bit more elaborate. 1. Git clone the bat-extratools repository from GitHub 2. change the name of bat-extratools directory to contain the version name of the release (for example bat-extratools-14.0). 3. Make a tar.gz archive of the directory: tar zcf bat-extratools-14.0.tar.gz bat-extratools-14.0 4. run rpmbuild to create binary packages: rpmbuild -ta bat-extratools-14.0.tar.gz D.2 Building packages for DEB based systems from releases Currently no rebuildable packages for DEB based systems are made for releases. D.3 D.3.1 Building packages for DEB based systems from Subversion Building bat The Debian scripts were written according to the documentation for debhelper found at https://wiki.ubuntu.com/PackagingGuide/Python. Package building and testing is done on Ubuntu 14.04 LTS. Older versions of Ubuntu are no longer supported and its use is discouraged. This is because versions of Ubuntu older than 14.04 use a broken version of the PyDot package. To build a .deb package do an export of the Subversion repository first. Change to the directory src and type: debuild -uc -us to build the package. This assumes that you will have the necessary packages installed to build the package (like devscripts and debhelper). The build process might complain about not being able to find the original sources. In our experience it is safe to ignore this. The command will build a .deb package which can be installed with dpkg -i. D.3.2 Building bat-extratools To build a .deb package clone the Git repository first. Change to the correct directory (bat-extratools and type: debuild -uc -us to build the package. There are some dependencies that need to be installed beforehand, such as zlib1g-dev, liblzo2-dev and liblzma-dev for building bat-extratools. These dependencies are documented in the file debian/control and debuild will warn if these packages are missing. E Binary Analysis Tool knowledgebase BAT comes with a mechanism to use a database backend. The default version of BAT only unpacks file systems and compressed files and runs a few simple checks on the leaf nodes of the unpacking process. In the paper “Finding Software License Violations Through Binary Code Clone Detection” by Hemel et. al. (ACM 978-1-4503-0574-7/11/05), presented at the Mining Software Repositories 2011 conference, a method to use a database with strings extracted from source code was described. This functionality is available in the ranking module in the file licenseversion.py. This code is enabled by default, but if no database is present it will not do anything. To give good results the database that is used needs to be populated with as many packages as possible, from a cross cut of all of open source software, to prevent bias towards certain packages: if you only would have BusyBox in your database, everything would look like BusyBox. A more detailed description about how to create the database can be found in the BAT GitHub repository in the doc E.1 Generating the package list The code and license extractor wants a description file of which packages to process. This file is hardcoded to LIST relative to the directory that contains all source archives. The reason there is a specific file is that some packages do not follow a consistent naming scheme. By using this extra file we can cleanup names and make sure that source code archives are recognized correctly. The file contains four values per line: • name • version • archivename • origin (defaults to “unknown” if not specified) separated by whitespace (spaces or tabs). An example would look like this: amarok 2.3.2 amarok-2.3.2.tar.bz2 kde This line says that the package is amarok, the version number is 2.3.2, the filename is amarok-2.3.2.tar.bz2 and the file was downloaded from the KDE project. There is a helper script (generatelist.py) to help generate the file. It can be invoked as follows: python generatelist.py -f /path/to/directory/with/sources -o origin The output is printed on standard output, so you want to redirect it to a file called LIST (as expected by the string extraction script) and optionally sorting it first: python generatelist.py -f /path/to/directory/with/sources -o origin | sort > /path/to/directory/with/sources/LIST generatelist.py tries to determine the name of the package by splitting the file name on the right on a - (dash) character. This is not always done correctly because a package uses multiple dashes, or because it does not contain a dash. In the latter case an error will be printed on standard error, informing you that a file could not be added to the list of packages and it should be added manually. It is advised to manually inspect the file after generating it to ensure the correctness of the package names. Packages can have been renamed for a number of reasons: • upstream projects decided to use a new name for archives (AbiWord archives for example were renamed from abi-$VERSION.tar.gz (used for early versions) to abiword-$VERSION.tar.gz). • a distribution has renamed packages to avoid clashes during installation and allow different versions to be installed next to eachother. • a distribution has renamed a package. For example, Debian renamed httpd to apache2. In these cases you need to change the names of the packages, otherwise different versions of the same package will be recorded in the database as different packages, which will confuse the rating algorithm and cause it to give suboptimal results. Other helper scripts are dumplist.py which recreates a package list file from a database, and rewritelist.py which takes two package list files and outputs a new file with package names and versions rewritten for filenames that occur in both files. These two scripts are useful if a database needs to be regenerated, possibly with new packages. E.2 Creating the database The program to extract strings from sourcecode is createdb.py. It is not part of the standard installation of BAT, but needs to be retrieved separately from version control together with generatelist.py. This will be changed at some point in the future. It parses the file generated by generatelist.py, unpacks the files (gzip compressed TAR, bzip2 compressed TAR, LZMA compressed TAR, XZ compressed TAR and ZIP are currently supported) and scans each individual source code file (written in C, C++, assembler, QML, C#, Java, Scala, JSP, Groovy, PHP, Python, Ruby and ActionScript) for string constants, methods, functions, variables and, if enabled, licenses using Ninka and FOSSology and copyright information using FOSSology and regular expressions lifted from FOSSology. For the Linux kernel additional information is extracted about kernel functions and variables, module information (author, license, parameters, and so on), and kernel symbol information. E.3 License extraction and copyright information extraction The configuration for createdb.py has a few options. The most important ones to consider are whether or not to also extract licenses and copyrights from the source code files. License extraction is done using the Ninka license scanner and the Nomos license scanner from FOSSology. Copyright scanning is done using the copyright scanner from FOSSology. These options are disabled by default for a few reasons: • extracting licenses and copyrights costs significantly more time • there are no packages for Fedora and Debian/Ubuntu for Ninka If you want to enable license extraction, you will have to install Ninka first and change one hardcoded path that points to the main Ninka script in createdb.py. You will also have to install FOSSology (for which packages are available for most distributions). E.4 Setting up PostgreSQL Setting up the PostgreSQL server itself is out of scope for this document. The rest of this section should be considered as one potential way to set up a PostgreSQL database. Any changes to the PostgreSQL installation should be discussed with a local database administrator. When starting from scratch (no existing database server) then the following command can be used to initiate the database: # postgresql-setup initdb E.4.1 Authentication configuration The main authentication configuration file of PostgreSQL is called pg hba.conf. Usually this file resides in the top level PostgreSQL directory, for example /var/lib/pgsql/data (but this depends on the local configuration). In this file various configuration options are set, such as how clients can connect and how they should authenticate. In a default installation of PostgreSQL local users connecting over a local socket in the file system are implicitely trusted (this depends on the distribution, some use peer or ident instead of trust): local all all trust To prompt for the password (recommended) it should be changed in: local all all password Local users connecting over an IPv4 TCP/IP socket (over the network stack) are also implicitely trusted: host all all 127.0.0.1/32 trust 127.0.0.1/32 password and can be changed to: host all all Something similar can be done for the local IPv6 connections. To allow connections on a different port the main configuration file for PostgreSQL (called postgresql.conf) should be adapted. By default the server only listens on localhost, as defined by the listen address configuration. To allow PostgreSQL to also listen on a different interface this should be changed, for example: listen_addresses = ’localhost,172.16.0.1’ The authentication in pg hba.conf should also be changed: host all all 172.16.0.1/16 password Note: in this case the network mask is 255.255.0.0, but this could of course be different. E.4.2 Creating the database and database user After setting up the PostgreSQL server the following commands can be used to create a database and a user: 1. create database bat; 2. create user bat with password ’bat’; 3. grant all privileges on database bat to bat; The database name, user name and password should correspond to the database name, user name and password in the BAT configuration. As a next step the database can be filled, either by loading an existing dump file or by using the database creation scripts. E.5 Database design The database currently has 16 tables, 9 of which are Linux kernel specific. • processed • processed file • extracted string • extracted function • extracted name • kernel configuration • kernelmodule alias • kernelmodule author • kernelmodule description • kernelmodule firmware • kernelmodule license • kernelmodule parameter • kernelmodule parameter description • kernelmodule version • renames • hashconversion • extracted copyright • licenses • security E.5.1 processed table This table is to keep track of which versions of which packages were scanned. Its only purpose is to avoid scanning packages multiple times. It is not actively used in the ranking code. It has the following fields: • package: name of the package • version: version of the package • filename: name of the archive • origin: site/origin where the archive was downloaded (optional) • checksum: SHA256 checksum of the archive • downloadurl: download URL of the source code package (optional) • website: the website of the project (optional) E.5.2 processed file table This table contains information about of individual source code files that were scanned. It has the following fields: • package: name of the package the file is from (same as in processed) • version: version of the package the file is from (same as in processed) • pathname: relative path inside the source code archive • checksum: SHA256 checksum of the file • filename: filename of the file, without path component • thirdparty: boolean indicating if the file is an obvious copy of a file from another package. E.5.3 extracted string table This table stores the individual strings that were extracted from files and that could possibly end up in binaries. It has the following fields: • stringidentifier: string constant that was extracted • checksum: SHA256 checksum of file the string constant was extracted from • language: language the source code file was written in (mapped to a language family, such as C or Java) • linenumber: line number where the string constant can be found in the source code file (if determined using using xgettext) or 0 (if determined using a regular expression). E.5.4 extracted function table In this table information about C functions and Java methods is stored. • checksum: SHA256 checksum of the file • functionname: function name or method name that was extracted • language: language the source code file was written in (mapped to a language family, such as C or Java) • linenumber: line number where the function/method can be found in the source code file (if determined using using xgettext) or 0 (if determined using a regular expression). E.5.5 extracted name table This table stores information of various names extracted from source code. Included are variable names (C), field names (Java) and class names (Java) and Linux kernel variable names. It has the following fields: • checksum: SHA256 checksum of the file • name: name of variable, field or class name that was extracted • type: type (field, variable, class name, etcetera) • language: language the source code file was written in (mapped to a language family, such as C or Java) • linenumber: line number where the function/method can be found in the source code file (if determined using using xgettext) or 0 (if determined using a regular expression). E.5.6 extracted copyright table This table stores copyright information that was extracted from files by FOSSology. It has the following fields: • checksum: SHA256 checksum of the file • copyright: copyright information that was extracted • type: type of information that was extracted, currently url, email or statement • offset: byte offset in the file where the copyright statement can be found E.5.7 hashconversion table The hashconversion table is used as a lookup table to translate between different hashes and use these for checks or reporting. The table has the following mandatory field: • sha256: SHA256 checksum of the file Any other hashes (limited to values that Python’s hashlib supports, as well as CRC32 and TLSH) listed in extrahashes in the database creation script configuration file will be added as columns to this database. The database creation scripts defaults to MD5, SHA1, CRC32 and TLSH. E.5.8 kernel configuration table The Makefiles in the Linux kernel configuration contain a lot of information about which configuration includes which files. This information can be used to reconstruct a possible kernel configuration that was used to create the Linux binary image. The table has the following fields: • configstring: configuration directive in Linux kernel • filename: filename/directory name to which the configuration directive applies • version: Linux kernel version E.5.9 kernelmodule alias table This table is used to store information about Linux kernel module aliases. This information is declared in the Linux kernel source code using the MODULE ALIAS macro. The table has the following fields: • checksum: SHA256 checksum of the file • modulename: name of the source code file • alias: contents of the MODULE ALIAS macro E.5.10 kernelmodule author table This table is used to store information about Linux kernel module author(s). This information is declared in the Linux kernel source code using the MODULE AUTHOR macro. The table has the following fields: • checksum: SHA256 checksum of the file • modulename: name of the source code file • author: contents of the MODULE AUTHOR macro E.5.11 kernelmodule description table This table is used to store information about Linux kernel module descriptions. This information is declared in the Linux kernel source code using the MODULE DESCRIPTION macro. The table has the following fields: • checksum: SHA256 checksum of the file • modulename: name of the source code file • description: E.5.12 kernelmodule firmware table This table is used to store information about Linux kernel module firmware. This information is declared in the Linux kernel source code using the MODULE FIRMWARE macro. The table has the following fields: • checksum: SHA256 checksum of the file • modulename: name of the source code file • firmware: contents of the MODULE FIRMWARE macro E.5.13 kernelmodule license table This table is used to store information about Linux kernel module licenses. This information is declared in the Linux kernel source code using the MODULE LICENSE macro. The table has the following fields: • checksum: SHA256 checksum of the file • modulename: name of the source code file • license: contents of the MODULE LICENSE macro E.5.14 kernelmodule parameter table This table is used to store information about Linux kernel module parameters. This information is declared in the Linux kernel source code using the MODULE PARM and module param macros, as well as variations of the module param macro. These different notations were used for different versions of the Linux kernel and both formats have been used in the kernel at the same time. The table has the following fields: • checksum: SHA256 checksum of the file • modulename: name of the source code file • paramname: name of the parameter • paramtype: type of the parameter, as specified in the source code (various formats have been used) E.5.15 kernelmodule parameter description table This table is used to store information about Linux kernel module parameters descriptions. This information is declared in the Linux kernel source code using the MODULE PARM DESC macro. The table has the following fields: • checksum: SHA256 checksum of the file • modulename: name of the source code file • paramname: name of the parameter • description: descriptio of the parameter E.5.16 kernelmodule version table This table is used to store information about Linux kernel module versions. This information is declared in the Linux kernel source code using the MODULE VERSION macro. The table has the following fields: • checksum: SHA256 checksum of the file • modulename: name of the source code file • version: contents of the MODULE VERSION macro E.5.17 licenses table This table stores the licenses that were extracted from files using a source code scanner, like Ninka or FOSSology. If a file has more than one licenses there will be multiple rows for a file. It has these fields: • checksum: SHA256 checksum of the file • license: license as found by the scanner • scanner: scanner name. Currently only Ninka and FOSSology are used in BAT, but is not limited to that: the scanner could also be a person doing a manual review. • version: version of scanner. This is useful if there is for example a bug in a scanner, or to compare results from various versions. E.5.18 renames table This is a lookup table to deal with packages that have been cloned or renamed and should be treated as another package when scanning. Examples are packages in Debian that have been renamed for trademark reasons (Firefox is called Iceweasel), forks (KOffice versus Calligra), and so on. • originalname: name the package was published under • newname: name that the package name should be translated to The script clonedbinit.py in the maintenance directory generates a minimal translation database. E.5.19 security cert table This table stores security information that was extracted from files. It has these fields: • checksum: SHA256 checksum of the file • securitybug: identifier for a security bug, for example identifiers for the CERT secure coding standard. • linenumber: line number where the security bug can be found • whitelist: boolean value indicating whether or not the bug can safely be ignored. The idea is that this can be set by security reviewers if the security bug cannot be triggered to lower the amount of false positives. E.5.20 security cve table This table stores information about relations between paths and CVE numbers. • checksum: SHA256 checksum of the file • cve: CVE identifier E.5.21 security password table This table stores information about relations between hashes and derived passwords. • hash: hash value as found in password or shadow file • password: password found with a password cracker F Identifier extraction and ranking scan As explained identifying binaries works in two phases: first identifiers are extracted from the binaries, then the identifiers are processed by one or more scans, for example the ranking scan. Apart from making it possible to process the identifiers with various methods there is another reason that the code is split in two parts and that is performance: extracting identifiers is very quick and can be done in parallel for many files. Computing a score can be quite expensive to do for certain files (such as a Linux kernel image). Processing identifiers per file in parallel instead of processing files in parallel turns out to be much faster. This is why the current ranking scan(s) are all aggregate scans and not leaf scans. F.1 Configuring identifier extraction [identifier] type = module = method = envvars = noscan = description enabled setup priority = = = = leaf bat.identifier searchGeneric ramdisk:BAT_STRING_CUTOFF=5 text:xml:graphics:pdf:compressed: resource:audio:video:mp4:vimswap:timezone:ico Classify packages using advanced ranking mechanism yes extractidentifiersetup 1 The parameter is: • BAT STRING CUTOFF - this value is the mimimal length of the string that is matched (default value is 5). If extracted strings are shorter than this value they will be ignored. It is important to keep this parameter in sync with the minimum length of strings in the database extract script. F.2 Configuring the ranking method The ranking method can be found in bat/licenseversion.py. The ranking method looks up strings in the database, optionally aggregates results for Java class files at the JAR level, determines versions and licenses while also removing unlikely versions from the result set. For the first part (determining which package a string belongs to) it uses tables with caching information for string constants, function names, variable names and so on. These caching tables contain a subset of information to vastly speed up scanning by using pregenerated results to avoid expensive database join operations. There is no script in the standard distribution of BAT to create these caching tables, but the format has been described in the database schema. For the second part (determining versions and licenses) other tables containing the raw package data are used. In the database the strings, averages, function names, variable names, etcetera are split per language family (C, Java, C#, and so on). The reason for this is that strings/function names that are very significant in one programming language family could be very generic in another programming language family and vice versa. During scanning a guess will be made to see which language the program was written in and the proper caching database will be queried. Since there are relatively few binaries (at least on Linux) that combine code from both languages the caching databases are split. This makes the caching databases a lot smaller so they can easier fit into memory. There are of course programs with language embeddeding and better support for these will be added in the future. An optional table in the database to deal with copied and renamed packages can be generated with clonedbinit.py in the maintenance directory. If this table is populated the ranking scan will use information from this table to rewrite package names. This is useful if a package was renamed for a reason and different packages should be treated as if they were a single package. Examples are Ethereal that had to be renamed to Wireshark, or KOffice that was forked into Calligra, after which development on KOffice effectively stopped and everyone moved to Calligra. If BAT RANKING LICENSE is not set to 1 no license information will be extracted. If BAT RANKING VERSION is not set to 1 no version information will be extracted. If BAT RANKING LICENSE is set to 1 it automatically sets BAT RANKING VERSION to 1 as well. The parameter USE SOURCE ORDER can be used to tell the matching algorithm to assume that identifiers in the binary code are similar as in the source code and that the compiler has not reordered these. As compilers often keep the order this assigns more strings to packages. As soon as compilers start reordering identifiers this method will not work. The default setting is to not use the order of identifiers. The parameter BAT STRING CUTOFF indiciates the mimimal length of the string that is matched (default value is 5). If extracted strings are shorter than this value they will be ignored. It is important to keep this parameter in sync with the minimum length of strings in the database extract script. Results of Java class files are aggregated per JAR where the class files were found in. If the parameter AGGREGATE CLEAN is set to 1 the class files will be removed from the result set after aggregating the results. By default class files will not be removed. The parameters BAT KEEP VERSIONS, BAT MINIMUM UNIQUE and BAT KEEP MAXIMUM PERCENTAGE are used to tell the pruning methods how many versions to keep, how many unique strings minimally should be found, and so on. F.2.1 Interpreting the results There are two ways to interpret the results. The recommended way is to load the result file into the graphical user interface. If there are any matches the report contains the following: • number of lines that were extracted from the binary • number of lines that could be matched exactly with an entry in the database (unique matches) • number of lines that were assigned to a package (assigned matches) • number of lines which could not be matched (unmatched lines) Per package the following is reported: • name of the package • all unique matches (strings that can only be found in this package) • relative ranking • percentage of the total score Finally piecharts are also generated providing a visual representation of these results. G BusyBox script internals The BusyBox processing scripts look simple, but behind the internals are a bit hairy. Especially extracting the correct configuration is not trivial. G.1 Detecting BusyBox Detecting if a binary is indeed BusyBox is trivial, since in a BusyBox binary there are almost always clear indication strings if BusyBox is used (unless they it was specifically altered to hide the use of BusyBox). A significant set of strings to look for is: BusyBox is a multi-call binary that combines many common Unix utilities into a single executable. Most people will create a link to busybox for each function they wish to use and BusyBox will act like whatever it was invoked as! Another clear indicator is a BusyBox version string, for example: BusyBox v1.15.2 (2009-12-03 00:14:42 CET) As an exception a BusyBox binary configured to include just a single applet will not contain contain the marker strings, or the BusyBox version string. In such a case a different detection mechanism will have to be used, for example the ranking code as used by bat-scan, although this will only be necessary in a very small percentage of cases, since the vast majority of BusyBox instances include more than one applet. G.2 BusyBox version strings The BusyBox version strings have remained fairly consistent over the years: BusyBox v1.00-rc2 (2006.09.14-03:08+0000) multi-call binary BusyBox v1.1.3 (2009.09.11-12:49+0000) multi-call binary BusyBox v1.15.2 (2009-12-03 00:14:42 CET) The time stamps in the version string are irrelevant, since they are generated during build time and are not hardcoded in the source code. Extracting version information from the BusyBox binary is not difficult. Using regular expression it is possible to look for BusyBox v which indicates the start of a BusyBox version string. The version number can be found immediately following this substring until ( (including leading space) is found. Apart from reporting, the BusyBox version number is also used for other things, such as determining the right configuration format and accessing a knowledgebase of known applet names extracted from the standard BusyBox releases from busybox.net. G.3 BusyBox configuration format During the compilation of BusyBox a configuration file is used to determine which functionality will be included in the binary. The format of this configuration file has changed a few times over the years. Early versions used a simple header format file, with GNU C/C++ style defines. Later versions, starting 1.00pre1, moved to Kbuild, the same configuration system as used by for example the Linux kernel or OpenWrt. This format is still in use today (BusyBox 1.20.0 being the latest version at the time of writing). Each configuration directive determines whether or not a certain piece of source code will be compiled and up in the BusyBox binary. This source code can either be a full applet, or just a piece of functionality that merely extends an existing applet. G.4 Extracting a configuration from a BusyBox binary Extracting the BusyBox configuration from a binary is not entirely trivial. There are a few methods which can be used: 1. run busybox (on a device, or inside a sandbox) and see what functionality is reported. This is probably the most accurate method, but also the hardest, since it requires access to a device, or a sandbox that has been properly set up, with all the right dependencies, and so on. When running busybox without any arguments, or with the --help parameter it will output a list of functions that are defined inside the binary: Currently defined functions: ar, cal, cpio, dpkg, dpkg-deb, gunzip, zcat These can be mapped to a configuration, using information extracted from BusyBox source code about which applets map to which configuration option. 2. extract the configuration from the binary by searching for known applet names in the firmware. The end result is the same as a previous step, but possibly with less accuracy in some cases but it is the only feasible solution if you only have a binary. The BusyBox binary has a string embedded for every applet that is included. This is the string that is printed out if --help is given as a parameter to an invocation of busybox. Using information about the configuration extracted from BusyBox source code these strings can be mapped to a configuration directive and a possible configuration can be reconstructed. Depending on how the binary was compiled this can be trivial, or quite hard. G.4.1 BusyBox linked with uClibc In binaries that link against uClibc (a particular C library) the name of the main function of the applet is sometimes (but not always) included in the busybox binary as follows (a good way is to run strings on the binary and look at the output). wget_main This string maps to the name of the main function for the wget applet (networking/wget.c): int wget_main(int argc, char **argv) MAIN_EXTERNALLY_VISIBLE; The BusyBox authors are pretty strict in their naming and usually have a configuration directive in the a specific format (CONFIG-$appletname) in the Makefile, like: lib-$(CONFIG_WGET) += wget.o (example taken from networking/Kbuild in BusyBox 1.15.2). There are cases where the format could be slightly different. G.4.2 BusyBox linked with glibc & uClibc exceptions Sometimes the method described in the previous section does not work for binaries that are linked with uClibc. It also does not work with binaries compiled with glibc. If the binary is unstripped and the binary still contains symbol information it is possible to extract the right information using readelf (part of GNU binutils) in a similar fashion as the earlier described method. In case there is no information available it is still possible to search inside the binary for the applet names. Because most instances of BusyBox that are installed on devices have not been modified the list of applets in the stock version of BusyBox serves as an excellent starting point. The list as printed by busybox if the --help parameter is given is embedded in the binary. The applet names are alphabetically sorted and separated by NUL characters. By searching for this list and splitting it accordingly it is possible to get the list of all applets that are defined. The only caveats are that a new applet that was added appears alphabetically before any of the applets that can be recognized using a list of applet names extracted from the source code, or it appears alphabetically after the last one that can be recognized. G.5 Pretty printing a BusyBox configuration Pretty printing a BusyBox configuration is fairly straightforward, but there are a few cases where it is hard to make a good guess: 1. aliases 2. functionality that is added to an applet, depending on a configuration directive 3. applets that use non-standard configuration names (like CONFIG APP UDHCPD instead of CONFIG UDHCPD in some versions of BusyBox) 4. features For some applets aliases are installed by default as symlinks. These aliases are recorded in the binary, but there is no separate applet for it. In the BusyBox sources (1.15.2, others might be different) these are defined as: IF_CRYPTPW(APPLET_ODDNAME(mkpasswd, cryptpw, _BB_DIR_USR_BIN, _BB_SUID_DROP, mkpasswd)) So if the cryptw tool is built, an additional symlink called mkpasswd is added during installation. If extra functionality is added to an applet in BusyBox it is defined in the source code by macros like the following: IF_SHA256SUM(APPLET_ODDNAME(sha256sum, md5_sha1_sum, _BB_DIR_USR_BIN, _BB_SUID_DROP, sha256sum)) IF_SHA512SUM(APPLET_ODDNAME(sha512sum, md5_sha1_sum, _BB_DIR_USR_BIN, _BB_SUID_DROP, sha512sum)) The above configuration tells to add extra symlinks for sha256sum and sha512sum if BusyBox is configured for suppport for the SHA256 and SHA512 algorithms. The applet that implements this functionality is md5 sha1 sum. Non-standard configuration names can be fixed by using a translation table that translates to the non-standard name. The current code has a translation table for BusyBox 1.15 and higher. Detecting features is really hard to do in a generic way. In most cases it will even be impossible, because there are no clear markers (strings, applet names) in the binary that indicate that a certain feature is enabled. In cases there are clear marker strings these would still need to be linked to specific features. One possibility would be to parse the BusyBox sources and link strings to features, for example (from BusyBox 1.15.3, editors/diff.c): #if ENABLE_FEATURE_DIFF_DIR diffdir(f1, f2); return exit_status; #else bb_error_msg_and_die("no support for directory comparison"); #endif The string "no support for directory comparison" only appears if the feature ENABLE FEATURE DIFF DIR is not enabled. Implementing this will be a lot of work and it will likely not be very useful. G.6 Using BusyBox configurations By referencing with information extracted from the standard BusyBox sourcecode it is possible to get a far more accurate configuration, because it is known which applets use which configuration, unless: • new applets were added to BusyBox • applets use old names, but contain different code The names of applets that are defined in BusyBox serve as a very good starting point. How these are recorded in the sources has changed a few times and depends on the version of BusyBox. The tool appletname-extractor.py can extract these from the BusyBox sources and store them for later reference as a simple lookup table in Python pickle format. Names of applets per version breakdown: • 1.15.x and later: include/applets.h • 1.1.1-1.14.x: include/applets.h USE syntax • 1.00-1.1.0: include/applets.h (different syntax) • 0.60.5 and earlier: applets.h, like 1.00-1.1.0 but with a slightly different syntax In one particular version of BusyBox (namely 1.1.0) there is a mix of three different syntaxes: (0.60.5, 1.00 and another) for a few applets (runlevel, watchdog, tr). There are also a few applets in 1.1.0 which seem to be a bit harder to detect: busybox, mkfs.ext3, e3fsck and [[. These can easily be added by hand, since there are just four of them. Another issue that is currently unresolved is that not all the shells are correctly recognized. G.7 Extracting configurations from BusyBox sourcecode The busybox.py script makes use of a table that maps applet names to configuration directives. These tables are stored in a Python pickle and read by busybox.py upon startup. To generate these pickle files the appletname-extractor.py should be used. In the standard distribution for BAT the configurations for most versions of BusyBox are shipped. The applet names are extracted from a file called applets.h. It might be that this file first has to be generated if the only file present is applets.src.h. In that case: 1. unpack the BusyBox archive 2. cd to the root of the unpacked archive 3. run ./scripts/gen build files.sh . . to regenerate applets.h python appletname-extractor.py -a /path/to/applets.h -n $VERSION The configuration will be written to a file $VERSION-config and should be moved into the directory containing the other configurations. H Linux kernel identifier extraction The createdb.py program processes Linux kernel source code files in a slightly different way than normal source code files. There is a lot of interesting information that can be extracted from the Linux kernel sources, as well as the binary. There are a few challenges when working with Linux kernel source code and Linux kernel binaries. First of all there are many different variants in use and many vendors have their own slightly modified version, with extra drivers, or bug fixes from later versions, or bug fixes that might not yet have been applied to the version on kernel.org. Second is that in the Linux kernel binary string constants, function names, symbols, module parameters, and so on, are intertwined and some steps need to be taken to correctly split these to avoid false positives (there are other packages where kernel function names, module parameters, symbols, and so on, are valid string constants). H.1 Extracting visible strings from the Linux kernel binary If a kernel is an ELF binary (sometimes) the relevant sections of the binary can be read using readelf. Otherwise strings can be run on the binary. This method will return more strings than if using readelf, but the extra strings are mostly extra cruft that have a low chance of matching. H.2 Extracting visible strings from a Linux kernel module If a kernel module is an ELF binary (most cases) the relevant sections of the binary can be read using readelf. Otherwise strings can be run on the binary. This method will return more strings than if using readelf, but the extra strings are mostly extra cruft that have a low chance of matching. H.3 Extracting strings from the Linux kernel sources The Linux kernel is full of strings that can end up in a binary. Some programmers have defined macros just specific to their part of the kernel for ease of use (often a wrapper around printk, other programmers use more standard mechanisms like printk. Most strings can be extracted from the Linux kernel using xgettext. A minority of strings needs to be extracted using a custom regular expression. The following two cases are worth a closer look: H.3.1 EXPORT SYMBOL and EXPORT SYMBOL GPL The symbols defined in the EXPORT SYMBOL and EXPORT SYMBOL GPL macros end up in the kernel image. The EXPORT SYMBOL GPL symbol could be interesting for licensing reporting as well, since anything that uses this symbol should be released under the GPLv2. This is a topic for future research. H.3.2 module param The names of parameters for kernel modules can end up in the kernel, or in the kernel module itself. The names of these parameters are typically prefixed with the name of the module (which is often, but not always) and a dot, but without the extension of the file. In cases where the module name does not match the name of the file it was defined in extra information from the build system needs to be added to determine the right string. The code for this is in the function init param sysfs builtin in kernel/params.c. Module names are extracted from the kernel Makefiles and stored in the database together with module information (author, license, description, parameters, and so on). H.4 Forward porting and back porting There are some strings we scan for which might not be present in certain versions, because they were removed, or not yet included in the mainline kernel. A good example is devfs. This subsystem was removed in Linux kernel 2.6.17, but it is not safe to assume that this was done for every 2.6.17 (or later) kernel that is out in the wild, since some vendors might have kept it and ported it to newer versions (forward porting). Similarly code from newer kernels might have been included in older versions (backporting). H.5 Corner cases Sometimes a #define or some configuration directive causes that our string matching method will not work, because the string is prepended with extra characters. An example from arch/arm/mach-sa1100/dma.c from kernel 2.6.32.9: #undef DEBUG #ifdef DEBUG #define DPRINTK( s, arg... ) #else #define DPRINTK( x... ) #endif printk( "dma<%p>: " s, regs , ##arg ) Other examples include pr debug, DBG, DPRINTK and pr info. To work around this there are two ways: 1. do substring matches 2. parse the source code and record where extra code is being added as in the example above and only do substring matches in a small number of cases. Substring matching is expensive and since it only happens in a minority of cases the second method, although not trivial to implement, would be easier. This is future work. I Binary Analysis Tool performance tips This section describes a few methods to increase performance of the Binary Analysis Tool, plus describe drawbacks of methods named. The standard configuration of BAT tries to be sensible, with a trade off between performance and completeness. In some cases there is quite a bit of performance to be gained by simply tweaking the configuration. I.1 Choose the right hardware BAT will benefit a lot from fast disk, enough memory and multiple cores. Many of the scans in BAT can be run in parallel and will scale very well (until of course disk I/O limits are reached). Invest in SSD to reduce disk I/O and more cores instead of a faster CPU. Enough memory will prevent swapping which just kills performance, especially because the ranking scan in BAT can be very I/O intensive. I.2 Use outputlite Using the default configuration the original unpacked data is not included into the result archive. There are situations where it makes sense to include the data into the result archive, for example to make it easier to do a “post mortem” after a scan. The original data can take up a lot of space, since every original file, plus everything that might have been extracted from that file, will be included, which leads to large archives and long associated packing time. It also has performance impact on the BAT viewer, which needs to unpack some data from the archive. The smaller the archive is, the faster unpacking is. If the original data and the unpacked data is not relevant, then setting the option outputlite to yes in the section [batconfig] is highly recommended: outputlite = yes I.3 Use AGGREGATE CLEAN when scanning Java JAR files If Java JAR files are scanned then pictures and reports will be generated for each of the individual .class files. If only the results of the JAR file are needed, then setting AGGREGATE CLEAN to 1 will prevent pictures and reports to be generated for the individual .class files, which can save quite some processing time and help declutter the interface as well. Of course, not generating the pictures for individual .class files means that some detail might be lost, especially if there are .class files that contain some unexpected results. I.4 Disable tmp on tmpfs Some Linux distributions (most notably Fedora 18 and later) store the /tmp file system on tmpfs. This means that part of the system memory is used for the /tmp file system. By default on Fedora it is set to 50% of the system’s memory. This could influence BAT in two ways: 1. less memory available for processing 2. BAT unpacks to /tmp by default, unless configured differently. If the unpack results grow big enough (which is fairly easy with big firmwares) it could fill up the partition. However, there are some external tools that will write temporary results to /tmp. There are various solutions, apart from adding more memory to the machine: • configure BAT to use another path than /tmp for unpacking and storing results and configure some scans in BAT to use /tmp or a different ramdisk (recommended) • disable tmp on tmpfs (not recommended) I.5 Use tmpfs for writing temporary results A few scans can use tmpfs or a ramdisk to write temporary results. The scans that can benefit from this are LZMA unpacking, ranking (temporary results of DEX and ODEX unpacking), compress unpacking, JFFS2 unpacking and TAR unpacking. The parameter temporary unpackdirectory in the global configuration can be used to set this location. J J.1 Description for scans using the database file2package The file2package leaf scan uses the table file that contains information with checksums of file names found in standard Linux distributions. This table can be populated using the scripts createfiledatabasedebian.py and createfiledatabasefedora.py which can be found in the subdirectory maintenance in the BAT source tree and are for Debian and Fedora respectively. K Parameter description for default scans This section describes the default parameters for several of the scans as shipped in BAT, if not described earlier in this document. These parameters are passed to the scans as part of the environment and are defined in the envvars setting in the configuration file. K.1 compress The COMPRESS MINIMUM SIZE parameter instructs the scan to ignore output files that are COMPRESS MINIMUM SIZE bytes in size or less. This parameter was introduced because false positives in compress unpacking are very common on Debian and Ubuntu, often leading to small sized files that contain no useful data and which could interfere with scanning. The scan can also use a different directory for unpacking temporary files. The location is set in the global configuration using the parameter temporary unpackdirectory. By setting this to for example SSD or ramdisk it can help avoid disk I/O on a slower disk and speed up scanning. K.2 generatejson The generatejson scan has an optional parameter BAT JSONDIR which can be set to the location of a directory where the top level JSON file will be written to (as scandata-$SHA256.json). This is useful when scanning files in batch mode. K.3 jffs2 The scan can use a different directory for unpacking temporary files. The location is set in the global configuration using the parameter temporary unpackdirectory. By setting this to for example SSD or ramdisk it can help avoid disk I/O on a slower disk and speed up scanning. K.4 lzma The lzma unpack scan has one parameter: LZMA MINIMUM SIZE. The LZMA MINIMUM SIZE parameter instructs the scan to ignore output files that are LZMA MINIMUM SIZE bytes in size or less. This parameter was introduced because false positives in LZMA unpacking are very common, often leading to small sized files that contain no useful data. By default LZMA MINIMUM SIZE is set to 10 bytes, but this is a very conservative setting and can likely be set higher safely. The scan can also use a different directory for unpacking temporary files. The location is set in the global configuration using the parameter temporary unpackdirectory. By setting this to for example SSD or ramdisk it can help avoid disk I/O on a slower disk and speed up scanning. K.5 tar The scan can use a different directory for unpacking temporary files. The location is set in the global configuration using the parameter temporary unpackdirectory. By setting this to for example SSD or ramdisk it can help avoid disk I/O on a slower disk and speed up scanning. K.6 xor The XOR MINIMUM parameter is used to set the minimum amount of occurences of a key that have to be present in the file before XOR unpacking is done. This is to reduce false positives. K.7 zip The ZIP MEMORY CUTOFF parameter is used to set the maximum size of ZIP data that should be read into memory. If the ZIP data is larger it will be carved out from the larger file using dd. If not set a value of 50 million bytes will be used. K.8 findlibs For the findlibs aggregate scan the ELF SVG parameter can be set to 1 to output the graphs in SVG format. K.9 findsymbols For the findsymbols aggregate scan the KERNELSYMBOL SVG parameter can be set to 1 to output the graphs in SVG format. The KERNELSYMBOL CSV parameter can be set to output a spreadsheet in Excel-format. K.10 generateimages The generateimages postrun scan has five optional parameters: AGGREGATE IMAGE SYMLINK, BAT IMAGEDIR, BAT PICKLEDIR, MAXIMUM PERCENTAGE MINIMUM PERCENTAGE K.11 identifier The scan can use a different directory for unpacking temporary files. The location is set in the global configuration using the parameter temporary unpackdirectory. By setting this to for example SSD or ramdisk it can help avoid disk I/O on a slower disk and speed up scanning. K.12 licenseversion The licenseversion aggregate scan has a few parameters that can influence performance. One of them is AGGREGATE CLEAN. This parameter instructs the scan to remove results for individual Java class files from the result set after aggregating results at the JAR level. Java class files that are not unpacked from a JAR file are not removed from the result set. By default this parameter is set to 0 which means that results for Java class files are not removed from the result set. K.13 prunefiles The prunefiles aggregate scan has two parameters: PRUNE TAGS and PRUNE FILEREPORT CLEAN. The PRUNE TAGS parameter contains a comma-separated list of tags that should be ignored and removed from the scan results. The PRUNE FILEREPORT CLEAN parameter can be set to indicate whether or not the result pickles for the pruned files should also be removed from disk. Example: PRUNE_TAGS=png,gif:PRUNE_FILEREPORT_CLEAN=1 K.14 hexdump and images The hexdump and images scans (disabled by default) have two parameters. The BAT IMAGE MAXFILESIZE parameter is set to specify the maximum size of a file for which a result is generated. Since output from this scan can be extremely large, and the results are not very interesting for large files it is strongly advised to cap this value. L Default ordering of scans in BAT BAT comes with a default configuration file. In this file an order for running the scans is specified, using the priority field: the higher the priority, the earlier the scan is run in the process. In this section the rationale behind this ordering is explained. The order for pre-run scans, leaf scans, unpack scans and aggregate scans is described below. Since postrun scans do not change the result files and they are independent there is no order defined for them (although this might change in the future). L.1 Pre-run scans Most pre-run scans have the same priority, with a few exceptions, the most important being verifytext to find out if a file is ASCII only, or if there are any non-ASCII characters in the file. Since many of the scans (including prerun scans) only work on non-ASCII files it is important to find out soon if a file contains only ASCII characters or not. The order for pre-run scans is: 1. checkXML 2. verifytext 3. verifyjava 4. verifyelf, verifysqlite3 5. verifyandroiddex, verifyandroidodex, verifyandroidresource, verifyandroidxml, verifycertificate, verifychromepak, verifyico, verifyihex, verifyjava, verifymessagecatalog, verifyresourcefork, verifyrsacertificate, verifyterminfo, verifytz, verifywebp, vimswap L.2 Unpack scans As a general rule of thumb: compressed formats are scanned last, while simple containers that concatenate contents, or where the original content can still be (partially) recognised, are scanned first. An example of a container is TAR: content is simply concatenated without compression. If the TAR archive would contain a file of a certain type (such as a gzip compressed file) and the unpacker for that type is run first it will try to carve it from the TAR file, blacklist the byte range, and the TAR unpacker would not successfully run. For the compressed files on the other hand the original content isn’t visible without unpacking so no other scans will pick it up and they can have a low priority. The order that is defined starts with byteSwap, a special unpacker that is needed to unpack firmwares of certain devices, where a different kind of flash chip is used, needing bytes in a firmware to be swapped first before any other scan can be run. Then the unpack scans for various container formats and file systems are run. The order in which they appear is not fool proof: container files could be embedded in container files with a lower priority, but BAT comes with (hopefully) sane defaults to prevent this. As a second to last step the unpack scans for compressed files where all data is packed in such a way that the original content can’t be seen without unpacking are run. Finally there are some scans that unpack text files (base64) or media files. The lzma unpack scan also has the lowest priority because of possibly many false positives. The order of the unpack scans as defined in BAT 37 is: 1. byteswap 2. tar, android-sparse 3. pdf unpack, iso9660, plf, wim 4. cramfs, ext2fs, ubi 5. ar, cpio, java serialized, romfs, rpm, upx, yaffs 6. exe, jffs2, squashfs, xar 7. 7z, arj, bzip2, cab, chm, compress, gzip, installshield, intelhex, lrzip, lzip, lzo, minix, msi, pack200, rar, rzip, xz, zip, 8. android-backup, base64, bmp, gif, ico, jpeg, lzma, ogg, otf, png, swf, ttf, woff L.3 Leaf scans There is currently only one explicit ordering: kernelchecks is run before identifier because identifier depends on the result of kernelchecks. For the rest the order of the leaf scans does not matter. L.4 Aggregate scans Aggregate scans have a clear order. Reports and (most) images are generated at the very end when all information is known. Other scans are mostly independent of eachother, but are usually run before versionlicensecopyright to prevent having to read big report pickles from disk. The order for aggregate scans is: 1. fixduplicates 2. prunefiles (disabled by default) 3. findduplicates 4. findlibs, findsymbols, copyright, file2package 5. kernelversions 6. versionlicensecopyright 7. passwords (disabled by default), shellinvocations 8. generateimages, generatereports, generatejson, searchlogins
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 57 Producer : pdfTeX-1.40.17 Creator : TeX Create Date : 2018:04:24 00:37:26+02:00 Modify Date : 2018:04:24 00:37:26+02:00 Trapped : False PTEX Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2EXIF Metadata provided by EXIF.tools