Manual

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 36

Using the Software Analysis Workbench (SAW)
Galois, Inc.
421 SW 6th Ave., Ste. 300
Portland, OR 97204
Overview
The Software Analysis Workbench (SAW) is a tool for constructing mathematical models of the computational
behavior of software, transforming these models, and proving properties about them.
SAW can currently construct models of a subset of programs written in Cryptol, LLVM (and therefore C), and
JVM (and therefore Java). The models take the form of typed functional programs, so in a sense SAW can
be considered a translator from imperative programs to their functional equivalents. Various external proof
tools, including a variety of SAT and SMT solvers, can be used to prove properties about SAW functional
models. Models can be constructed from arbitrary Cryptol programs, and can typically be constructed from
C and Java programs that have fixed-size inputs and outputs and that terminate after a fixed number of
iterations of any loop (or a fixed number of recursive calls). One common use case is to verify that an
algorithm specification in Cryptol is equivalent to an algorithm implementation in C or Java.
The process of extracting models from programs, manipulating them, forming queries about them, and sending
them to external provers is orchestrated using a special purpose language called SAWScript. SAWScript is a
typed functional language with support for sequencing of imperative commmands.
The rest of this document first describes how to use the SAW tool,
saw
, and outlines the structure of
the SAWScript language and its relationship to Cryptol. It then presents the SAWScript commands that
transform functional models and prove properties about them. Finally, it describes the specific commands
available for constructing models from imperative programs in a variety of languages.
Invoking SAW
The primary mechanism for interacting with SAW is through the
saw
executable included as part of the
standard binary distribution. With no arguments,
saw
starts a read-evaluate-print loop (REPL) that allows
the user to interactively evaluate commands in the SAWScript language. With one file name argument, it
executes the specified file as a SAWScript program.
In addition to a file name, the saw executable accepts several command-line options:
-h, -?, --help Print a help message.
-V, --version Show the version of the SAWScript interpreter.
-c path, --classpath=path Specify a colon-delimited list of paths to search for Java classes.
-i path, --import-path=path Specify a colon-delimited list of paths to search for imports.
-t, --extra-type-checking Perform extra type checking of intermediate values.
-I, --interactive Run interactively (with a REPL).
-j path, --jars=path Specify a colon-delimited list of paths to .jar files to search for Java classes.
-d num, --sim-verbose=num Set the verbosity level of the Java and LLVM simulators.
-v num, --verbose=num Set the verbosity level of the SAWScript interpreter.
SAW also uses several environment variables for configuration:
©2017 Galois, Inc. Page 1
Using the Software Analysis Workbench (SAW)
CRYPTOLPATH
Specify a colon-delimited list of directory paths to search for Cryptol imports (including the
Cryptol prelude).
SAW_IMPORT_PATH Specify a colon-delimited list of directory paths to search for imports.
SAW_JDK_JAR Specify the path of the .jar file containing the core Java libraries.
On Windows, semicolon-delimited lists are used instead of colon-delimited lists.
Structure of SAWScript
A SAWScript program consists, at the top level, of a sequence of commands to be executed in order. Each
command is terminated with a semicolon. For example, the
print
command displays a textual representation
of its argument. Suppose the following text is stored in the file print.saw:
print 3;
The command saw print.saw will then yield output similar to the following:
Loading module Cryptol
Loadi ng file " print . saw "
3
The same code can be run from the interactive REPL:
sawscript > print 3;
3
At the REPL, terminating semicolons can be omitted:
sawscript > print 3
3
To make common use cases simpler, bare values at the REPL are treated as if they were arguments to
print
:
sawscript > 3
3
One SAWScript file can be included in another using the
include
command, which takes the name of the file
to be included as an argument. For example:
include " print . saw ";
Syntax
The syntax of SAWScript is reminiscent of functional languages such as Cryptol, Haskell and ML. In particular,
functions are applied by writing them next to their arguments rather than by using parentheses and commas.
Rather than writing f(x, y), write fxy.
In SAWScript, all text from
//
until the end of a line is ignored. Additionally, all text between
/*
and
*/
is
ignored, regardless of whether the line ends.
Basic Types and Values
All values in SAWScript have types, and these types are determined and checked before a program runs (that
is, SAWScript is statically typed). The basic types available are similar to those in many other languages.
The
Int
type represents unbounded mathematical integers. Integer constants can be written in decimal
notation (e.g.,
42
), hexadecimal notation (
0x2a
), and binary (
0b00101010
). However, unlike many
languages, integers in SAWScript are used primarily as constants. Arithmetic is usually encoded in
Cryptol, as discussed in the next section.
©2017 Galois, Inc. Page 2
Using the Software Analysis Workbench (SAW)
The Boolean type,
Bool
, contains the values
true
and
false
, like in many other languages. As with
integers, computations on Boolean values usually occur in Cryptol.
Values of any type can be aggregated into tuples. For example, the value
(true, 10)
has the type
(Bool, Int).
Values of any type can also be aggregated into records, which are exactly like tuples except that their
components have names. For example, the value
{b=true,n=10}
has the type
{ b : Bool, n
: Int }.
A sequence of values of the same type can be stored in a list. For example, the value
[true, false,
true] has the type [Bool].
Strings of textual characters can be represented in the
String
type. For example, the value
"example"
has type String.
The “unit” type, written
()
, is essentially a placeholder. It has only one value, also written
()
. Values
of type () convey no information. We will show in later sections several cases where this is useful.
SAWScript also includes some more specialized types that do not have straightforward counterparts in most
other languages. These will appear in later sections.
Basic Expression Forms
One of the key forms of top-level command in SAWScript is a binding, introduced with the
let
keyword,
which gives a name to a value. For example:
sawscript > let x = 5
sawscript > x
5
Bindings can have parameters, in which case they define functions. For instance, the following function takes
one parameter and constructs a list containing that parameter as its single element.
sawscript > let f x = [ x]
sawscript > f " text "
[" text "]
Functions themselves are values and have types. The type of a function that takes an argument of type
a
and returns a result of type bis a -> b.
Function types are typically inferred, as in the example
f
above. In this case, because
f
only creates a list
with the given argument, and because it is possible to create a list of any element type,
f
can be applied to
an argument of any type. We say, therefore, that
f
is polymorphic. Concretely, we write the type of
f
as
{a} a -> [a]
, meaning it takes a value of any type (denoted
a
) and returns a list containing elements of
that same type. This means we can also apply fto 10:
sawscript > f 10
[10]
However, we may want to specify that a function has a more specific type. In this case, we could restrict
f
to
operate only on Int parameters.
sawscript > let f ( x : Int ) = [x ]
This will work identically to the original fon an Int parameter:
sawscript > f 10
[10]
©2017 Galois, Inc. Page 3
Using the Software Analysis Workbench (SAW)
However, it will fail for a String parameter:
sawscript > f " text "
type mi sm at ch : String -> t .0 and Int - > [ Int ]
at " _" ( REPL )
mism atch ed type co nst ruc tor s : String and Int
Type annotations can be applied to any expression. The notation
(e : t)
indicates that expression
e
is
expected to have type
t
and that it is an error for
e
to have a different type. Most types in SAWScript are
inferred automatically, but specifying them explicitly can sometimes enhance readability.
Because functions are values, functions can return other functions. We make use of this feature when writing
functions of multiple arguments. Consider the function g, similar to fbut with two arguments:
sawscript > let g x y = [x , y ]
Like
f
,
g
is polymorphic. Its type is
{a} a -> a -> [a]
. This means it takes an argument of type
a
and
returns a function that takes an argument of the same type
a
and returns a list of
a
values. We can therefore
apply gto any two arguments of the same type:
sawscript > g 2 3
[2,3]
sawscript > g true false
[ true , false ]
But type checking will fail if we apply it to two values of different types:
sawscript > g 2 false
type mi sm at ch : Boo l -> t .0 and Int -> [ Int ]
at " _" ( REPL )
mism atch ed type co nst ruc tor s : Bool and Int
So far we have used two related terms, function and command, and we take these to mean slightly different
things. A function is any value with a function type (e.g.,
Int -> [Int]
). A command is a function where
the result type is one of a specific set of special types. These special types are parameterized (like the list
type), and allow us to restrict command usage to specific contexts.
The most important command type is the
TopLevel
type, indicating a command that can run at the top level
(directly at the REPL, or as one of the top level commands in a script file). The
print
command has the
type
{a} a -> TopLevel ()
, where
TopLevel ()
means that it is a command that runs in the
TopLevel
context and returns a value of type
()
(that is, no useful information). In other words, it has a side effect
(printing some text to the screen) but doesn’t produce any information to use in the rest of the SAWScript
program. This is the primary usage of the () type.
It can sometimes be useful to bind a sequence of commands together in a unit. This can be accomplished
with the do { ... } construct. For example:
sawscript > let print_two = do { print " first "; print " second "; }
sawscript > prin t_tw o
first
second
The bound value, print_two, has type TopLevel (), since that is the type of its last command.
Note that in the previous example the printing doesn’t occur until
print_two
directly appears at the REPL.
The
let
expression does not cause those commands to run. The construct that runs a command is written
©2017 Galois, Inc. Page 4
Using the Software Analysis Workbench (SAW)
using the
<-
operator. This operator works like
let
except that it says to run the command listed on the
right hand side and bind the result, rather than binding the variable to the command itself. Using
<-
instead
of let in the previous example yields:
sawscript > print_tw o <- do { print " first "; print " second "; }
first
second
sawscript > prin t_tw o
()
Here, the
print
commands run first, and then
print_two
gets the value returned by the second
print
command, namely
()
. Any command listed alone at the REPL, the top level in a script, or inside a
do
block
is treated as implicitly having a <- that binds its result to an unnamed variable (that is, discards it).
In some cases it can be useful to have more control over the value returned by a
do
block. The
return
command allows us to do this. For example, say we wanted to write a function that would print a message
before and after running some arbitrary command and then return the result of that command. We could
write:
let run_with_message c =
do {
print " Star ting .";
res <- c ;
print " Done .";
return res ;
};
x <- r un _w it h_m ess age ( r eturn 3) ;
print x;
If we put this script in run.saw and run it with saw, we get something like:
Loading module Cryptol
Loadi ng file " run . saw "
Starting.
Done.
3
Note that it ran the first
print
command, then the caller-specified command, then the second
print
command.
The result stored in xat the end is the result of the return command passed in as an argument.
Other Basic Functions
Aside from the functions we have listed so far, there are a number of other operations for working with basic
data structures and interacting with the operating system.
The following functions work on lists:
concat : {a } [ a] -> [ a ] -> [a ]
head : {a} [ a] -> a
tail : {a} [ a] -> [a]
length : { a } [a ] -> Int
©2017 Galois, Inc. Page 5
Using the Software Analysis Workbench (SAW)
null : {a} [ a] -> Bool
nth : { a} [ a ] -> Int -> a
for : {m , a , b } [a ] -> (a -> m b) -> m [b]
The
concat
function takes two lists and returns the concatenation of the two. The
head
function returns
the first element of a list, and the
tail
function returns everything except the first element. The
length
function counts the number of elements in a list, and the
null
function indicates whether a list is empty (has
zero elements). The
nth
function returns the element at the given position, with
nth l 0
being equivalent
to
head l
. The
for
command takes a list and a function that runs in some command context. The passed
command will be called once for every element of the list, in order, and
for
will ultimately return a list of all
of the results produced by the command.
For interacting with the operating system, we have:
get_opt : Int -> String
exec : String -> [ String ] -> String -> To pLeve l String
exit : Int -> TopLevel ()
The
get_opt
function returns the command-line argument to
saw
at the given index. Argument 0 is always
the name of the
saw
executable itself, and higher indices represent later arguments. The
exec
command runs
an external program given, respectively, an executable name, a list of arguments, and a string to send to
the standard input of the program. The
exec
command returns the standard output from the program it
executes and prints standard error to the screen. Finally, the
exit
command stops execution of the current
script and returns the given exit code to the operating system.
Finally, there are a few miscellaneous functions and commands. The
show
function computes the textual
representation of its argument in the same way as
print
, but instead of displaying the value it returns it as a
String
value for later use in the program. This can be useful for constructing more detailed messages later.
The str_concat function, which concatenates two String values, can also be useful in this case.
The
time
command runs any other
TopLevel
command and prints out the time it took to execute. If you
want to use the time value later in the program, the
with_time
function returns both the original result of
the timed command and the time taken to execute it (in milliseconds), without printing anything in the
process.
show : {a} a -> String
str_ conc at : String -> String -> String
time : {a} TopL evel a -> TopLe vel a
with_time : {a} TopLevel a -> Top Level ( Int , a)
The Term Type
Perhaps the most important type in SAWScript, and the one most unlike the built-in types of most other
languages, is the
Term
type. Essentially, a value of type
Term
precisely describes all possible computations
performed by some program. In particular, if two
Term
values are equivalent, then the programs that they
represent will always compute the same results given the same inputs. We will say more later about exactly
what it means for two terms to be equivalent, and how to determine whether two terms are equivalent.
Before exploring the Term type more deeply, it is important to understand the role of the Cryptol language
©2017 Galois, Inc. Page 6
Using the Software Analysis Workbench (SAW)
in SAW.
Cryptol and its Role in SAW
Cyptol is a domain-specific language originally designed for the high-level specification of cryptographic
algorithms. It is general enough, however, to describe a wide variety of programs, and is particularly applicable
to describing computations that operate on streams of data of some fixed size.
In addition to being integrated into SAW, Cryptol is a standalone language with its own manual:
http :// c ry pt ol . net / files / P rog ra mmi ngC ry pto l . pdf
SAW includes deep support for Cryptol, and in fact requires the use of Cryptol for most non-trivial tasks.
To fully understand the rest of this manual and to effectively use SAW, you will need to develop at least a
rudimentary understanding of Cryptol.
The primary use of Cryptol within SAWScript is to construct values of type
Term
. Although
Term
values can
be constructed from various sources, inline Cryptol expressions are the most direct and convenient way to
create them.
Specifically, a Cryptol expression can be placed inside double curly braces (
{{
and
}}
), resulting in a value of
type
Term
. As a very simple example, there is no built-in integer addition operation in SAWScript. However,
we can use Cryptol’s built-in integer addition operator within SAWScript as follows:
sawscr ipt > let t = {{ 0 x22 + 0 x33 }}
sawscript > print t
85
Although it printed out in the same way as an
Int
, it is important to note that
t
actually has type
Term
. We
can see how this term is represented internally, before being evaluated, with the print_term function.
sawscript > prin t_te rm t
Cryptol . ecPlus
( Prelude . Vec 8 Prelude . Bool )
( Cryptol . OpsSeq
( Cryptol . TCNum 8)
Prelud e . Bool
Cryptol . OpsBit )
( Prelude . bvNat 8 34)
( Prelude . bvNat 8 51)
For the moment, it’s not important to understand what this output means. We show it only to clarify that
Term
values have their own internal structure that goes beyond what exists in SAWScript. The internal
representation of
Term
values is in a language called SAWCore. The full semantics of SAWCore are beyond
the scope of this manual.
The text constructed by
print_term
can also be accessed programmatically (instead of printing to the screen)
using the
show_term
function, which returns a
String
. The
show_term
function is not a command, so it
executes directly and does not need <- to bind its result. Therefore, the following will have the same result
as the print_term command above:
sawscript > let s = show_term t
sawscript > print s
Numbers are printed in decimal notation by default when printing terms, but the following two commands
can change that behavior.
set_ascii : Bool -> TopL evel ()
©2017 Galois, Inc. Page 7
Using the Software Analysis Workbench (SAW)
set_base : Int -> TopLevel ()
The
set_ascii
command, when passed
true
, makes subsequent
print_term
or
show_term
commands print
sequences of bytes as ASCII strings (and doesn’t affect printing of anything else). The
set_base
command,
which supports any base from 2 through 36 (inclusive), prints all bit vectors in the given base.
A
Term
that represents an integer (any bit vector, as affected by
set_base
) can be translated into a SAWScript
Int
using the
eval_int : Term -> Int
function. This function returns an
Int
if the
Term
can be represented
as one, and fails at runtime otherwise.
sawscript > print ( eval_i nt t )
85
sawscript > print ( eval_i nt {{ True }})
" eval_int " (< stdin >:1:1) :
eval_int : a rgume nt is not a finite bit vect or
sawscript > print ( eval_i nt {{ [True ] }})
1
Similarly, values of type
Bit
in Cryptol can be translated into values of type
Bool
in SAWScript using the
eval_bool : Term -> Bool function:
sawscript > let b = {{ True }}
sawscript > prin t_te rm b
Prelud e . True
sawscript > print ( eval_ bool b )
true
In addition to being able to extract integer and Boolean values from Cryptol expressions,
Term
values can
be injected into Cryptol expressions. When SAWScript evaluates a Cryptol expression between
{{
and
}}
delimiters, it does so with several extra bindings in scope:
Any value in scope of SAWScript type Bool is visible in Cryptol expressions as a value of type Bit.
Any value in scope of SAWScript type
Int
is visible in Cryptol expressions as a type variable. Type
variables can be demoted to numeric bit vector values using the backtick (`) operator.
Any value in scope of SAWScript type
Term
is visible in Cryptol expressions as a value with the Cryptol
type corresponding to the internal type of the term. The power of this conversion is that the
Term
does
not need to have originally been derived from a Cryptol expression.
In addition to these rules, bindings created at the Cryptol level, either from included files or inside Cryptol
quoting brackets, are visible only to later Cryptol expressions, and not as SAWScript variables.
To make these rules more concrete, consider the following examples. If we bind a SAWScript
Int
, we can use
it as a Cryptol type variable. If we create a
Term
variable that internally has function type, we can apply it
to an argument within a Cryptol expression, but not at the SAWScript level:
sawscript > let n = 8
sawscript > let {{ f ( x : [n ]) = x + 1 }}
sawscript > print {{ f 2 }}
3
sawscript > print (f 2)
unbound va riabl e : "f " (< stdin >:1:8)
©2017 Galois, Inc. Page 8
Using the Software Analysis Workbench (SAW)
If fwas a binding of a SAWScript variable to a Term of function type, we would get a different error:
sawscript > let f = {{ \( x : [n ]) -> x + 1 }}
sawscript > print {{ f 2 }}
3
sawscript > print (f 2)
type mism atch : Int -> t .0 and Term
at " _" ( REPL )
m is ma tc he d t yp e c on s tr u ct or s : ( - >) and Ter m
One subtlety of dealing with
Term
s constructed from Cryptol is that because the Cryptol expressions
themselves are type checked by the Cryptol type checker, and because they may make use of other
Term
values already in scope, they are not type checked until the Cryptol brackets are evaluated. So type errors
at the Cryptol level may occur at runtime from the SAWScript perspective (though they occur before the
Cryptol expressions are run).
So far, we have talked about using Cryptol value expressions. However, SAWScript can also work with
Cryptol types. The most direct way to refer to a Cryptol type is to use type brackets:
{|
and
|}
. Any
Cryptol type written between these brackets becomes a
Type
value in SAWScript. Some types in Cryptol are
size types, and isomorphic to integers. These can be translated into SAWScript integers with the
eval_size
function. For example:
sawscript > let {{ type n = 16 }}
sawscript > eval _siz e {| n |}
16
sawscript > eval_siz e {| 16 |}
16
For non-size types, eval_size fails at runtime:
sawscript > eval_s ize {| [16] |}
" eval_size " (< stdin >:1:1) :
eval_size : not a numeric type
In addition to the use of brackets to write Cryptol expressions inline, several built-in functions can extract
Term
values from Cryptol files in other ways. The
import
command at the top level imports all top-level
definitions from a Cryptol file and places them in scope within later bracketed expressions.
The
cryptol_load
command behaves similarly, but returns a
CryptolModule
instead. If any
CryptolModule
is in scope, its contents are available qualified with the name of the
CryptolModule
variable. To see how this
works, consider the
cryptol_prims
function, of type
() -> CryptolModule
. This function returns a built-in
module containing a collection of useful Cryptol definitions that are not available in the standard Cryptol
Prelude.
The definitions in this module include (in Cryptol syntax):
trunc : {m , n } ( fin m , fin n) = > [m + n ] -> [n ]
uext : {m , n} ( fin m , fin n ) = > [n ] -> [m + n ]
sgt : { n} ( fin n ) = > [n ] -> [n ] -> Bit
sge : { n} ( fin n ) = > [n ] -> [n ] -> Bit
©2017 Galois, Inc. Page 9
Using the Software Analysis Workbench (SAW)
slt : { n} ( fin n ) = > [n ] -> [n ] -> Bit
sle : { n} ( fin n ) = > [n ] -> [n ] -> Bit
These perform bit-vector operations of truncation (
trunc
), unsigned extension (
uext
), and signed comparison
(
sgt
,
sge
,
slt
, and
sle
). These definitions are typically accessed through binding
cryptol_prims
to a local
variable:
sawscript > set_b ase 16
sawscript > let m = cryptol_ pri ms ()
sawscr ipt > let x = {{ ( m :: trunc 0 x23 ) : [4] }}
sawscript > print x
0 x3
The 8-bit value 0x23 was truncated to a 4-bit value 0x3.
Finally, a specific definition can be extracted from a
CryptolModule
more explicitly using the
cryptol_extract command:
cry pto l_ ext ra ct : C ryp tol Module -> String -> TopLevel Term
Transforming Term Values
The three primary functions of SAW are extracting models (
Term
values) from programs, transforming those
models, and proving properties about models using external provers. So far we’ve shown how to construct
Term
values from Cryptol programs; later sections will describe how to extract them from other programs.
Now we show how to use the various term transformation features available in SAW.
Rewriting
Rewriting a
Term
consists of applying one or more rewrite rules to it, resulting in a new
Term
. A rewrite rule
in SAW can be specified in multiple ways:
as the definition of a function that can be unfolded,
as a term of Boolean type (or a function returning a Boolean) that is an equality statement, and
as a term of equality type with a body that encodes a proof that the equality in the type is valid.
Each of these forms is a
Term
of a different shape. In each case the term logically consists of two parts, each of
which may contain variables (bound by enclosing lambda expressions). By thinking of the variables as holes
that may match any sub-term, the two parts of each term can both be seen as patterns. The left-hand pattern
describes a term to match (which may be a sub-term of the full term being rewritten), and the right-hand
pattern describes a term to replace it with. Any variable in the right-hand pattern must also appear in the
left-hand pattern and will be instantiated with whatever sub-term matched that variable in the original term.
For example, say we have the following Cryptol function:
\( x :[8]) -> (x * 2) + 1
We might for some reason want to replace multiplication by a power of two with a shift. We can describe this
replacement using an equality statement in Cryptol:
\( y :[8]) - > (y * 2) == ( y << 1)
Interpreting this as a rewrite rule, it says that for any 8-bit vector (call it
y
for now), we can replace
y*2
with y << 1. Applying this rule to the earlier expression would then yield:
\( x : [8 ]) - > (x << 1) + 1
©2017 Galois, Inc. Page 10
Using the Software Analysis Workbench (SAW)
The general philosophy of rewriting is that the left and right patterns, while syntactically different, should be
semantically equivalent. Therefore, applying a set of rewrite rules should not change the fundamental meaning
of the term being rewritten. SAW is particularly focused on the task of proving that some logical statement
expressed as a
Term
is always true. If that is in fact the case, then the entire term can be replaced by the
term
True
without changing its meaning. The rewriting process can in some cases, by repeatedly applying
rules that themselves are known to be valid, reduce a complex term entirely to
True
, which constitutes a
proof of the original statement. In other cases, rewriting can simplify terms before sending them to external
automated provers that can then finish the job. Sometimes this simplification can help the automated provers
run more quickly, and sometimes it can help them prove things they would otherwise be unable to prove by
applying reasoning steps (rewrite rules) that are not available to the automated provers.
In practical use, rewrite rules can be aggregated into
Simpset
values in SAWScript. A few pre-defined
Simpset values exist:
empty_ss : Simpset
basic_ss : Simpset
cryp tol_ ss : () -> Simpset
The first is the empty set of rules. Rewriting with it should have no effect, but it is useful as an argument to
some of the functions that construct larger
Simpset
values. The
basic_ss
constant is a collection of rules
that are useful in most proof scripts. The
cryptol_ss
value includes a collection of Cryptol-specific rules.
Some of these simplify away the abstractions introduced in the translation from Cryptol to SAWCore, which
can be useful when proving equivalence between Cryptol and non-Cryptol code. Leaving these abstractions in
place is appropriate when comparing only Cryptol code, however, so
cryptol_ss
is not included in
basic_ss
.
The next set of functions add either a single rule or a list of rules to an existing Simpset.
addsimp ': Term -> Simpset -> Simpset
addsimps ': [ Term ] -> Simp set -> Simps et
Given a Simpset, the rewrite command applies it to an existing Term to produce a new Term.
rewrite : Simpset -> Term -> Term
To make this more concrete, we examine how the rewriting example sketched above, to convert multiplication
into shift, can work in practice. We simplify everything with
cryptol_ss
as we go along so that the
Term
s
don’t get too cluttered. First, we declare the term to be transformed:
sawscr ip t > let term = r ew rite ( c ryp to l_s s () ) {{ \( x : [8 ]) - > (x * 2) + 1
}};
sawscript > p ri nt _t er m term ;
\( x :: Prelude . Vec 8 Prelude . Bool ) ->
Prelude . bvAdd 8
( Prelude . bvMul 8 x
( Prelude . bvNat 8 2) )
( Prelude . bvNat 8 1)
Next, we declare the rewrite rule:
sawscr ip t > let rule = r ew rite ( c ryp to l_s s () ) {{ \( y : [8 ]) - > (y * 2) ==
(y << 1) }};
sawscript > p ri nt _t er m rule ;
let { x0 = Prelude . Vec 8 Prelude . Bool ;
}
in \( y :: x0 ) ->
©2017 Galois, Inc. Page 11
Using the Software Analysis Workbench (SAW)
Pre lu de . eq x0
( Prelude . bvMul 8 y
( Prelude . bvNat 8 2) )
( Prelude . b vShif tL 8 Prelude . Bool
1
Prelude . False
y
( Prelude . bvNat 1 1) )
Finally, we apply the rule to the target term:
sawscr ipt > let result = rew ri te ( addsimp 'rule empt y_s s ) term ;
sawscript > pr int_te rm result ;
\( x :: Prelude . Vec 8 Prelude . Bool ) ->
Prelude . bvAdd 8
( Prelude . b vShif tL 8 Prelude . Bool
1
Prelude . False
x
( Prelude . bvNat 1 1) )
( Prelude . bvNat 8 1)
Note that
addsimp'
and
addsimps'
take a
Term
or list of
Term
s; these could in principle be anything, and
are not necessarily terms representing logically valid equalities. They have
'
suffixes because they are not
intended to be the primary interface to rewriting. When using these functions, the soundness of the proof
process depends on the correctness of these rules as a side condition.
The primary interface to rewriting uses the
Theorem
type instead of the
Term
type, as shown in the signatures
for addsimp and addsimps.
addsimp : Theorem -> Simpset -> Simpset
addsimps : [ Theorem ] -> Simpset -> Simpset
A
Theorem
is essentially a
Term
that is proven correct in some way. In general, a
Theorem
can be any
statement, and may not be useful as a rewrite rule. However, if it has the shape described earlier, it can be
used for rewriting. In the “Proofs about Terms” section, we’ll describe how to construct
Theorem
values from
Term values.
In the absence of user-constructed
Theorem
values, there are some additional built-in rules that are not
included in either
basic_ss
and
cryptol_ss
because they are not always beneficial, but that can sometimes
be helpful or essential.
add _cr yp tol _e qs : [ String ] -> Simpset -> Simpset
add _p relud e_def s : [ String ] -> Simpset -> Simpset
add _pr el ude _e qs : [ String ] -> Simpset -> Simpset
The
cryptol_ss
simpset includes rewrite rules to unfold all definitions in the
Cryptol
SAWCore module,
but does not include any of the terms of equality type. The
add_cryptol_eqs
function adds the terms of
equality type with the given names to the given
Simpset
. The
add_prelude_defs
and
add_prelude_eqs
functions add definition unfolding rules and equality-typed terms, respectively, from the SAWCore
Prelude
module.
©2017 Galois, Inc. Page 12
Using the Software Analysis Workbench (SAW)
Finally, it’s possible to construct a theorem from an arbitrary SAWCore expression (rather than a Cryptol
expression), using the core_axiom function.
core _axi om : String -> Theorem
Any Theorem introduced by this function is assumed to be correct, so use it with caution.
Folding and Unfolding
A SAWCore term can be given a name using the
define
function, and is then by default printed as that
name alone. A named subterm can be “unfolded” so that the original definition appears again.
define : String -> Term -> TopLevel Term
unfo ld_ ter m : [ String ] -> Term -> Term
For example:
sawscr ipt > let t = {{ 0 x22 }}
sawscript > prin t_te rm t
Prelude . bvNat 8 34
sawscript > t'<- define " t" t
sawscr ip t > p ri nt _t erm t '
t
sawscr ip t > let t '' = un fo ld _t er m [" t "] t '
sawscr ip t > p ri nt _t erm t ''
Prelude . bvNat 8 34
This process of folding and unfolding is useful both to make large terms easier for humans to work with
and to make automated proofs more tractable. We’ll describe the latter in more detail when we discuss
interacting with external provers.
In some cases, folding happens automatically when constructing Cryptol expressions. Consider the following
example:
sawscr ipt > let t = {{ 0 x22 }}
sawscript > prin t_te rm t
Prelude . bvNat 8 34
sawscr ip t > let {{ t '= 0 x22 }}
sawscr ip t > p ri nt _t erm {{ t '}}
t
This illustrates that a bare expression in Cryptol braces gets translated directly to a SAWCore term. However,
a Cryptol definition gets translated into a folded SAWCore term. In addition, because the second definition
of
t
occurs at the Cryptol level, rather than the SAWScript level, it is visible only inside Cryptol braces.
Definitions imported from Cryptol source files are also initially folded and can be unfolded as needed.
Other Built-in Transformation and Inspection Functions
In addition to the Term transformation functions described so far, a variety of others also exist.
bet a_red uce_t erm : Term -> Term
replace : Term -> Term -> Term -> TopLevel Term
The
beta_reduce_term
function takes any sub-expression of the form
(\x -> t)v
in the given
Term
and
replaces it with a transformed version of tin which all instances of xare replaced by v.
©2017 Galois, Inc. Page 13
Using the Software Analysis Workbench (SAW)
The
replace
function replaces arbitrary subterms. A call to
replace x y t
replaces any instance of
x
inside
twith y.
Assessing the size of a term can be particularly useful during benchmarking. SAWScript provides two
mechanisms for this.
term_size : Term -> Int
term_tree_size : Term -> Int
The first,
term_size
, calculates the number of nodes in the Directed Acyclic Graph (DAG) representation
of a
Term
used internally by SAW. This is the most appropriate way of determining the resource use of a
particular term. The second,
term_tree_size
, calculates how large a
Term
would be if it were represented by
a tree instead of a DAG. This can, in general, be much, much larger than the number returned by
term_size
,
and serves primarily as a way of assessing, for a specific term, how much benefit there is to the term sharing
used by the DAG representation.
Finally, there are a few commands related to the internal SAWCore type of a Term.
chec k_te rm : Term -> TopLevel ()
type : Term -> Type
The
check_term
command checks that the internal structure of a
Term
is well-formed and that it passes all
of the rules of the SAWCore type checker. The
type
function returns the type of a particular
Term
, which
can then be used to, for example, construct a new fresh variable with fresh_symbolic.
Loading and Storing Terms
Most frequently,
Term
values in SAWScript come from Cryptol, JVM, or LLVM programs, or some transfor-
mation thereof. However, it is also possible to obtain them from various other sources.
pars e_co re : String -> Term
read_aig : String -> TopLevel Term
read _byt es : String -> TopLevel Term
read_core : String -> TopLevel Term
The
parse_core
function parses a
String
containing a term in SAWCore syntax, returning a
Term
. The
read_core
command is similar, but obtains the text from the given file and expects it to be in the simpler
SAWCore external representation format, rather than the human-readable syntax shown so far. The
read_aig
command returns a
Term
representation of an And-Inverter-Graph (AIG) file in AIGER format.
The
read_bytes
command reads a constant sequence of bytes from a file and represents it as a
Term
. Its
result will always have Cryptol type [n][8] for some n.
It is also possible to write
Term
values into files in various formats, including: AIGER (
write_aig
), CNF
(write_cnf), SAWCore external representation (write_core), and SMT-Lib version 2 (write_smtlib2).
write_aig : String -> Term -> To pLeve l ()
write_cnf : String -> Term -> To pLeve l ()
writ e_co re : String -> Term -> Top Level ()
wri te_ smt lib 2 : String -> Term -> TopLe vel ()
©2017 Galois, Inc. Page 14
Using the Software Analysis Workbench (SAW)
Proofs about Terms
The goal of SAW is to facilitate proofs about the behavior of programs. It may be useful to prove some small
fact to use as a rewrite rule in later proofs, but ultimately these rewrite rules come together into a proof of
some higher-level property about a software system.
Whether proving small lemmas (in the form of rewrite rules) or a top-level theorem, the process builds on the
idea of a proof script that is run by one of the top level proof commands.
prov e_p rin t : Pr oof Script Sa tResult -> Term -> TopLe vel Theorem
sat_print : ProofSc rip t SatR esul t -> Term -> To pLeve l ()
The
prove_print
command takes a proof script (which we’ll describe next) and a
Term
. The
Term
should be
of function type with a return value of
Bool
(
Bit
at the Cryptol level). It will then use the proof script to
attempt to show that the
Term
returns
True
for all possible inputs. If it is successful, it will print
Valid
and
return a Theorem. If not, it will abort.
The
sat_print
command is similar except that it looks for a single value for which the
Term
evaluates to
True and prints out that value, returning nothing.
A similar command to
prove_print
,
prove_core
, can produce a
Theorem
from a string containing a SAWCore
term.
prov e_co re : Proof Scr ipt SatResult -> String -> TopLevel Theorem
Automated Tactics
The simplest proof scripts just specify the automated prover to use. The
ProofScript
values
abc
and
z3
select the ABC and Z3 theorem provers, respectively, and are typically good choices.
For example, combining prove_print with abc:
sawscript > t <- pro ve_ pri nt abc {{ \( x :[8]) -> x+x == x*2 }}
Valid
sawscript > t
Theorem ( let { x0 = Cryptol . TCSeq ( Crypto l . TCNum 8) C ryptol . TCBit ;
x1 = C ryptol . e PA ri th x0 ;
}
in \( x:: Prelud e . Vec 8 Pre lude . Bool ) ->
Crypto l . ecEq x0
( Cryptol . ePCmp x0 )
( Cryptol . ecPlus x0 x1 x x)
( Cryptol . ecMul x0 x1 x
( Prelude . bvNat 8 2) ))
Similarly,
sat_print
will show that the function returns
True
for one specific input (which it should, since
we already know it returns True for all inputs):
sawscript > sat_prin t abc {{ \( x :[8]) -> x + x == x*2 }}
Sat : [x = 0]
In addition to these, the
boolector
,
cvc4
,
mathsat
, and
yices
provers are available. The internal decision
procedure
rme
, short for Reed-Muller Expansion, is an automated prover that works particularly well on the
Galois field operations that show up, for example, in AES.
In more complex cases, some pre-processing can be helpful or necessary before handing the problem off to an
automated prover. The pre-processing can involve rewriting, beta reduction, unfolding, the use of provers
©2017 Galois, Inc. Page 15
Using the Software Analysis Workbench (SAW)
that require slightly more configuration, or the use of provers that do very little real work.
Proof Script Diagnostics
During development of a proof, it can be useful to print various information about the current goal. The
following tactics are useful in that context.
prin t_go al : Pr oof Script ()
pri nt _go al _cons ts : Proof Scr ipt ()
pri nt_go al_de pth : Int -> ProofS cri pt ()
pri nt_ go al_ si ze : Pr oofScri pt ()
The
print_goal
tactic prints the entire goal in SAWCore syntax. The
print_goal_depth
is intended for
especially large goals. It takes an integer argument,
n
, and prints the goal up to depth
n
. Any elided subterms
are printed with a
...
notation. The
print_goal_consts
tactic prints a list of the names of subterms that
are folded in the current goal, and
print_goal_size
prints the number of nodes in the DAG representation
of the goal.
Rewriting in Proof Scripts
The
simplify
command works just like the
rewrite
command, except that it works in a
ProofScript
context and implicitly transforms the current (unnamed) goal rather than taking a Term as a parameter.
simplify : S impset -> Pro ofScrip t ()
Other Transformations
Some useful transformations are not easily specified using equality statements, and instead have special
tactics.
bet a_red uce_g oal : ProofSc rip t ()
unfolding : [ String ] -> Pr oo fScript ()
The
beta_reduce_goal
tactic takes any sub-expression of the form
(\x -> t)v
and replaces it with a
transformed version of tin which all instances of xare replaced by v.
The
unfolding
tactic works like
unfold_term
but on the current goal. Using
unfolding
is mostly valuable
for proofs based entirely on rewriting, since default behavior for automated provers is to unfold everything
before sending a goal to a prover. However, with Z3 and CVC4, it is possible to indicate that specific named
subterms should be represented as uninterpreted functions.
unin t_cvc4 : [ String ] -> ProofS cr ipt SatResult
unin t_ yices : [ String ] -> Pr oo fScript S atR esu lt
unint_z3 : [ String ] -> Proof Sc ript Sat Resu lt
The list of
String
arguments in these cases indicates the names of the subterms to leave folded, and therefore
present as uninterpreted functions to the prover. To determine which folded constants appear in a goal, use
the print_goal_consts function described above.
Ultimately, we plan to implement a more generic tactic that leaves certain constants uninterpreted in whatever
prover is ultimately used (provided that uninterpreted functions are expressible in the prover).
©2017 Galois, Inc. Page 16
Using the Software Analysis Workbench (SAW)
Other External Provers
In addition to the built-in automated provers already discussed, SAW supports more generic interfaces to
other arbitrary theorem provers supporting specific interfaces.
ext erna l_a ig_s ol ver : String -> [ String ] -> P ro ofScrip t Sat Res ult
ext erna l_c nf_s ol ver : String -> [ String ] -> P ro ofScrip t Sat Res ult
The
external_aig_solver
function supports theorem provers that can take input as a single-output AIGER
file. The first argument is the name of the executable to run. The second argument is the list of command-line
parameters to pass to that executable. Within this list, any element that consists of
%f
on its own is replaced
with the name of the temporary AIGER file generated for the proof goal. The output from the solver is
expected to be in DIMACS solution format.
The
external_cnf_solver
function works similarly but for SAT solvers that take input in DIMACS CNF
format and produce output in DIMACS solution format.
Offline Provers
For provers that must be invoked in more complex ways, or to defer proof until a later time, there are
functions to write the current goal to a file in various formats, and then assume that the goal is valid through
the rest of the script.
offl ine _ai g : String -> Pr oofS cri pt SatResult
offl ine _cn f : String -> Pr oofS cri pt SatResult
off lin e_ ext co re : String -> Proo fSc ript SatRe sult
off lin e_ smt li b2 : String -> Proo fSc ript SatRe sult
off line _u nint _smtl ib2 : [ String ] -> String -> Proo fScrip t Sat Result
These support the AIGER, DIMACS CNF, shared SAWCore, and SMT-Lib v2 formats, respectively. The
shared representation for SAWCore is described in the
saw-script
repository. The
offline_unint_smtlib2
command represents the folded subterms listed in its first argument as uninterpreted functions.
Miscellaneous Tactics
Some proofs can be completed using unsound placeholders, or using techniques that do not require significant
computation.
assume_unsat : P roofScr ipt S atResult
assume_valid : P roofScr ipt P roo fResult
quic kche ck : Int -> P roo fSc ript SatResult
trivial : Pr oofScri pt SatResult
The
assume_unsat
and
assume_valid
tactics indicate that the current goal should be considered unsatisfiable
or valid, depending on whether the proof script is checking satisfiability or validity. At the moment,
java_verify
and
llvm_verify
run their proofs in the a satisfiability-checking context, so
assume_unsat
is
currently the appropriate tactic. This is likely to change in the future.
The
quickcheck
tactic runs the goal on the given number of random inputs, and succeeds if the result of
©2017 Galois, Inc. Page 17
Using the Software Analysis Workbench (SAW)
evaluation is always
True
. This is unsound, but can be helpful during proof development, or as a way to
provide some evidence for the validity of a specification believed to be true but difficult or infeasible to prove.
The
trivial
tactic states that the current goal should be trivially true (i.e., the constant
True
or a function
that immediately returns True). It fails if that is not the case.
Proof Failure and Satisfying Assignments
The
prove_print
and
sat_print
commands print out their essential results (potentially returning a
Theorem
in the case of
prove_print
). In some cases, though, one may want to act programmatically on the result of
a proof rather than displaying it.
The
prove
and
sat
commands allow this sort of programmatic analysis of proof results. To allow this, they
use two types we haven’t mentioned yet:
ProofResult
and
SatResult
. These are different from the other
types in SAWScript because they encode the possibility of two outcomes. In the case of
ProofResult
, a
statement may be valid or there may be a counter-example. In the case of
SatResult
, there may be a
satisfying assignment or the statement may be unsatisfiable.
prove : ProofSc rip t SatR esul t -> Term -> To pLeve l ProofRe sul t
sat : ProofS cri pt SatResu lt -> Term -> TopLevel Sat Result
To operate on these new types, SAWScript includes a pair of functions:
cas ePr oo fRe su lt : { b} ProofR esu lt -> b -> ( Term -> b ) -> b
cas eSa tRe sul t : {b } S atRe sult -> b -> ( Term -> b) -> b
The
caseProofResult
function takes a
ProofResult
, a value to return in the case that the statement is
valid, and a function to run on the counter-example, if there is one. The
caseSatResult
function has the
same shape: it returns its first argument if the result represents an unsatisfiable statement, or its second
argument applied to a satisfying assignment if it finds one.
AIG Values and Proofs
Most SAWScript programs operate on
Term
values, and in most cases this is the appropriate representation.
It is possible, however, to represent the same function that a
Term
may represent using a different data
structure: an And-Inverter-Graph (AIG). An AIG is a representation of a Boolean function as a circuit
composed entirely of AND gates and inverters. Hardware synthesis and verification tools, including the ABC
tool that SAW has built in, can do efficient verification and particularly equivalence checking on AIGs.
To take advantage of this capability, a handful of built-in commands can operate on AIGs.
bitblast : Term -> TopLevel AIG
cec : AIG -> AIG -> TopL evel Pr oof Res ult
load_aig : String -> TopLevel AIG
save_aig : String -> AIG -> TopLevel ()
sav e_a ig _as _c nf : String -> AIG -> TopLevel ()
The
bitblast
command represents a
Term
as an
AIG
by “blasting” all of its primitive operations (things like
bit-vector addition) down to the level of individual bits. The
cec
command, for Combinational Equivalence
Check, will compare two AIGs, returning a
ProofResult
representing whether the two are equivalent. The
load_aig
and
save_aig
commands work with external representations of AIG data structures in the AIGER
©2017 Galois, Inc. Page 18
Using the Software Analysis Workbench (SAW)
format. Finally,
save_aig_as_cnf
will write an AIG out in CNF format for input into a standard SAT
solver.
Symbolic Execution
Analysis of Java and LLVM within SAWScript relies heavily on symbolic execution, so some background on
how this process works can help with understanding the behavior of the available built-in functions.
At the most abstract level, symbolic execution works like normal program execution except that the values of
all variables within the program can be arbitrary expressions, potentially containing free variables, rather than
concrete values. Therefore, each symbolic execution corresponds to some set of possible concrete executions.
As a concrete example, consider the following C program that returns the maximum of two values:
unsi g ned int max ( u nsig n ed int x , u n sign e d int y ) {
if ( y > x) {
return y;
}else {
return x;
}
}
If you call this function with two concrete inputs, like this:
int r = max (5 , 4) ;
then it will assign the value
5
to
r
. However, we can also consider what it will do for arbitrary inputs.
Consider the following example:
int r = max (a , b);
where
a
and
b
are variables with unknown values. It is still possible to describe the result of the
max
function
in terms of aand b. The following expression describes the value of r:
ite ( b > a) b a
where
ite
is the “if-then-else” mathematical function, which based on the value of the first argument
returns either the second or third. One subtlety of constructing this expression, however, is the treatment of
conditionals in the original program. For any concrete values of
a
and
b
, only one branch of the
if
statement
will execute. During symbolic execution, on the other hand, it is necessary to execute both branches, track
two different program states (each composed of symbolic values), and then merge those states after executing
the
if
statement. This merging process takes into account the original branch condition and introduces the
ite expression.
A symbolic execution system, then, is very similar to an interpreter that has a different notion of what
constitutes a value and executes all paths through the program instead of just one. Therefore, the execution
process is similar to that of a normal interpreter, and the process of generating a model for a piece of code is
similar to building a test harness for that same code.
More specifically, the setup process for a test harness typically takes the following form:
Initialize or allocate any resources needed by the code. For Java and LLVM code, this typically means
allocating memory and setting the initial values of variables.
Execute the code.
Check the desired properties of the system state after the code completes.
Accordingly, three pieces of information are particularly relevant to the symbolic execution process, and are
therefore needed as input to the symbolic execution system:
©2017 Galois, Inc. Page 19
Using the Software Analysis Workbench (SAW)
The initial (potentially symbolic) state of the system.
The code to execute.
The final state of the system, and which parts of it are relevant to the properties being tested.
In the following sections, we describe how the Java and LLVM analysis primitives work in the context of
these key concepts. We start with the simplest situation, in which the structure of the initial and final states
can be directly inferred, and move on to more complex cases that require more information from the user.
Symbolic Termination
In the previous section we described the process of executing multiple branches and merging the results
when encountering a conditional statement in the program. When a program contains loops, the branch that
chooses to continue or terminate a loop could go either way. Therefore, without a bit more information, the
most obvious implementation of symbolic execution would never terminate when executing programs that
contain loops.
The solution to this problem is to analyze the branch condition whenever considering multiple branches. If
the condition for one branch can never be true in the context of the current symbolic state, there is no reason
to execute that branch, and skipping it can make it possible for symbolic execution to terminate.
Directly comparing the branch condition to a constant can sometimes be enough to ensure termination. For
example, in simple, bounded loops like the following, comparison with a constant is sufficient.
for (int i = 0; i < 10; i ++) {
// do s omet hing
}
In this case, the value of
i
is always concrete, and will eventually reach the value
10
, at which point the
branch corresponding to continuing the loop will be infeasible.
As a more complex example, consider the following function:
uint8_t f(uint8_t i) {
int done = 0;
while (! done ) {
if ( i % 8 == 0) done = 1;
i += 5;
}
return i;
}
The loop in this function can only be determined to symbolically terminate if the analysis takes into account
algebraic rules about common multiples. Similarly, it can be difficult to prove that a base case is eventually
reached for all inputs to a recursive program.
In this particular case, however, the code is guaranteed to terminate after a fixed number of iterations (where
the number of possible iterations is a function of the number of bits in the integers being used). To show
that the last iteration is in fact the last possible one, it’s necessary to do more than just compare the branch
condition with a constant. Instead, we can use the same proof tools that we use to ultimately analyze the
generated models to, early in the process, prove that certain branch conditions can never be true (i.e., are
unsatisfiable).
Normally, most of the Java and LLVM analysis commands simply compare branch conditions to the constant
True
or
False
to determine whether a branch may be feasible. However, each form of analysis allows branch
satisfiability checking to be turned on if needed, in which case functions like fabove will terminate.
Now, we examine the details of the specific commands available to analyze JVM and LLVM programs.
©2017 Galois, Inc. Page 20
Using the Software Analysis Workbench (SAW)
Loading Code
The first step in analyzing any code is to load it into the system.
To load LLVM code, simply provide the location of a valid bitcode file to the llvm_load_module function.
llv m_loa d_mod ule : String -> TopLevel LLVMModu le
The resulting
LLVMModule
can be passed into the various functions described below to perform analysis of
specific LLVM functions.
Loading Java code is slightly more complex, because of the more structured nature of Java packages. First,
when running
saw
, two flags control where to look for classes. The
-j
flag takes the name of a JAR file as an
argument and adds the contents of that file to the class database. The
-c
flag takes the name of a directory
as an argument and adds all class files found in that directory (and its subdirectories) to the class database.
By default, the current directory is included in the class path. However, the Java runtime and standard
library (usually called
rt.jar
) is generally required for any non-trivial Java code, and can be installed in
a wide variety of different locations. Therefore, for most Java analysis, you must provide a
-j
argument
specifying where to find this file.
Once the class path is configured, you can pass the name of a class to the java_load_class function.
jav a_l oa d_c la ss : String -> TopLevel Jav aClass
The resulting
JavaClass
can be passed into the various functions described below to perform analysis of
specific Java methods.
Direct Extraction
In the case of the
max
function described earlier, the relevant inputs and outputs are immediately apparent.
The function takes two integer arguments, always uses both of them, and returns a single integer value,
making no other changes to the program state.
In cases like this, a direct translation is possible, given only an identification of which code to execute. Two
functions exist to handle such simple code:
java_extract : Java Clas s -> String -> JavaSetup () -> To pLeve l Term
llvm_extract : L LVMModu le -> String -> LLVMSe tup () -> TopLevel Term
The structure of these two functions is essentially identical. The first argument describes where to look for
code (in either a Java class or an LLVM module, loaded as described in the previous section). The second
argument is the name of the method or function to extract.
The third argument provides the ability to configure other aspects of the symbolic execution process. At the
moment, only one option is available: pass in
java_pure
or
llvm_pure
, for Java and LLVM respectively, and
the default extraction process is simply to set both arguments to fresh symbolic variables.
When the
..._extract
functions complete, they return a
Term
corresponding to the value returned by the
function or method.
These functions currently work only for code that takes some fixed number of integral parameters, returns
an integral result, and does not access any dynamically-allocated memory (although temporary memory
allocated during execution and not visible afterward is allowed).
Creating Symbolic Variables
The direct extraction process just discussed automatically introduces symbolic variables and then abstracts
over them, yielding a SAWScript
Term
that reflects the semantics of the original Java or LLVM code. For
©2017 Galois, Inc. Page 21
Using the Software Analysis Workbench (SAW)
simple functions, this is often the most convenient interface. For more complex code, however, it can be
necessary (or more natural) to specifically introduce fresh variables and indicate what portions of the program
state they correspond to.
The function fresh_symbolic is responsible for creating new variables in this context.
fre sh_ symbo lic : String -> Type -> TopLevel Term
The first argument is a name used for pretty-printing of terms and counter-examples. In many cases it makes
sense for this to be the same as the name used within SAWScript, as in the following:
x <- fresh_symbolic "x" ty;
However, using the same name is not required.
The second argument to
fresh_symbolic
is the type of the fresh variable. Ultimately, this will be a SAWCore
type; however, it is usually convenient to specify it using Cryptol syntax with the type quoting brackets
{|
and
|}
. For example, creating a 32-bit integer, as might be used to represent a Java
int
or an LLVM
i32
,
can be done as follows:
x <- f res h_s ym bol ic "x " {| [32] |};
Although symbolic execution works best on symbolic variables, which are “unbound” or “free”, most of the
proof infrastructure within SAW uses variables that are bound by an enclosing lambda expression. Given a
Term with free symbolic variables, we can construct a lambda term that binds them in several ways.
abs tr act _s ymbol ic : Term -> Term
lambda : Term -> Term -> Term
lambdas : [ Term ] -> Term -> Term
The
abstract_symbolic
function is the simplest, but gives you the least control. It finds all symbolic
variables in the
Term
and constructs a lambda expression binding each one, in some order. The result is a
function of some number of arguments, one for each symbolic variable.
sawscript > x <- fre sh_ sy mbo lic " x" {| [8] |}
sawscript > let t = {{ x + x }}
sawscript > prin t_te rm t
let { x0 = Cryptol . TCSeq ( Crypto l . TCNum 8) C ryptol . TCBit ;
}
in Cryptol . ecPlus x0
(Cryptol.ePArith x0)
x
x
sawscript > let f = ab strac t_ sym bo lic t
sawscript > prin t_te rm f
let { x0 = Cryptol . TCSeq ( Crypto l . TCNum 8) C ryptol . TCBit ;
}
in \( x:: Prelud e . Vec 8 Pre lude . Bool ) ->
Cryptol . ecPlus x0
(Cryptol.ePArith x0)
x
x
©2017 Galois, Inc. Page 22
Using the Software Analysis Workbench (SAW)
If there are multiple symbolic variables in the
Term
passed to
abstract_symbolic
, the ordering of parameters
can be hard to predict. In some cases (such as when a proof is the immediate next step, and it’s expected to
succeed) the order isn’t important. In others, it’s nice to have more control over the order.
The building block for controlled binding is
lambda
. It takes two terms: the one to transform, and the portion
of the term to abstract over. Generally, the first
Term
is one obtained from
fresh_symbolic
and the second
is a Term that would be passed to abstract_symbolic.
sawscript > let f = lambda x t
sawscript > prin t_te rm f
let { x0 = Cryptol . TCSeq ( Crypto l . TCNum 8) C ryptol . TCBit ;
}
in \( x:: Prelud e . Vec 8 Pre lude . Bool ) ->
Cryptol . ecPlus x0
(Cryptol.ePArith x0)
x
x
For Terms with more than one symbolic variable, lambdas allows you to list the order in which they should
be bound. Consider, for example, a Term which adds two symbolic variables:
sawscript > x1 <- f re sh_ sym bolic " x1 " {| [8] |}
sawscript > x2 <- f re sh_ sym bolic " x2 " {| [8] |}
sawscript > let t = {{ x1 + x2 }}
sawscript > prin t_te rm t
let { x0 = Cryptol . TCSeq ( Crypto l . TCNum 8) C ryptol . TCBit ;
x1 = Prelude . Vec 8 Prelud e . Bool ;
}
in Cryptol . ecPlus x0
(Cryptol.ePArith x0)
x1
x2
We can turn tinto a function that takes x1 followed by x2:
sawscr ip t > let f1 = l am bdas [ x1 , x2 ] t
sawscript > pri nt_t erm f1
let { x0 = Cryptol . TCSeq ( Crypto l . TCNum 8) C ryptol . TCBit ;
x1 = Prelude . Vec 8 Prelud e . Bool ;
}
in \( x1 :: x1 ) ->
\( x2 :: x1 ) ->
Cryptol . ecPlus x0
(Cryptol.ePArith x0)
x1
x2
Or we can turn tinto a function that takes x2 followed by x1:
sawscr ip t > let f1 = l am bdas [ x2 , x1 ] t
sawscript > pri nt_t erm f1
let { x0 = Cryptol . TCSeq ( Crypto l . TCNum 8) C ryptol . TCBit ;
x1 = Prelude . Vec 8 Prelud e . Bool ;
}
in \( x2 :: x1 ) ->
©2017 Galois, Inc. Page 23
Using the Software Analysis Workbench (SAW)
\( x1 :: x1 ) ->
Cryptol . ecPlus x0
(Cryptol.ePArith x0)
x1
x2
Monolithic Symbolic Execution
In many cases, the inputs and outputs of a function are more complex than supported by the direct extraction
process just described. In that case, it’s necessary to provide more information. In particular, following the
structure described earlier, we need:
For every pointer or object reference, how much storage space it refers to.
A list of (potentially symbolic) values for some elements of the initial program state.
A list of elements of the final program state to treat as outputs.
This capability is provided by the following built-in functions:
java_symexec : JavaCl ass ->
String ->
[( String , Term ) ] ->
[ String ] ->
Bool ->
TopLevel Term
llvm_symexec : LLVMM odul e ->
String ->
[( String , Int ) ] ->
[( String , Term , Int )] - >
[( String , Int ) ] ->
Bool ->
TopLevel Term
For both functions, the first two arguments are the same as for the direct extraction functions from the
previous section, identifying what code to execute. The final argument for both indicates whether or not to
do branch satisfiability checking.
The remaining arguments are slightly different for the two functions, due to the differences between JVM and
LLVM programs.
For
java_symexec
, the third argument, of type
[(String, Term)]
, provides information to configure the
initial state of the program. Each
String
is an expression describing a component of the state, such as the
name of a parameter, or a field of an object. Each
Term
provides the initial value of that component (which
may include symbolic variables returned by fresh_symbolic).
The syntax of these expressions is as follows:
Arguments to the method being analyzed can be referred to by name (if the
.class
file contains
debugging information, as it will if compiled with
javac -g
). The expression referring to the value
of the argument
x
in the
max
example is simply
x
. For Java methods that do not have debugging
information, arguments can be named positionally with
args[0]
,
args[1]
and so on. The name
this
refers to the same implicit parameter as the keyword in Java.
The expression form
pkg.C.f
refers to the static field
f
of class
C
in package
pkg
(and deeper nesting
of packages is allowed).
The expression return refers to the return value of the method under analysis.
©2017 Galois, Inc. Page 24
Using the Software Analysis Workbench (SAW)
For an expression eof object type, e.f refers to the instance field fof the object described by e.
The value of an expression of array type is the entire contents of the array. There is currently no way
to refer to individual elements of an array.
The fourth argument of java_symexec is a list of expressions describing the elements of the state to return
as outputs. The returned
Term
will be of tuple type if this list contains more than one element, or simply
the value of the one state element if the list contains only one. In addition to the expressions listed above,
this list can contain the special variable
$safety
, which refers to a
Term
describing the conditions under
which the result of symbolic execution is well-defined. It can be useful to obtain this
Term
and prove that it’s
always valid (that the program is always safe), or that it’s valid under the expected preconditions.
The
llvm_symexec
command uses an expression syntax similar to that for
java_symexec
, but not identical.
The syntax is as follows:
Arguments to the function being analyzed can be referred to by name (if the name is reflected in the
LLVM code, as it is with code generated by some versions of Clang). The expression referring to the
value of the argument
x
in the
max
example is simply
x
. For LLVM functions that do not have named
arguments (such as those generated by the Rust compiler), arguments can be named positionally with
args[0],args[1] and so on.
Global variables can be referred to directly by name.
The expression return refers to the return value of the function under analysis.
For any valid expression
e
referring to something with pointer type, the expression
*e
refers to the
value pointed to. There are some differences between this and the equivalent expression in C, however.
If, for instance,
e
has type
int *
, then
*e
will have type
int
. If
e
referred to a pointer to an array, the
C expression
*e
would refer to the first element of that array. In SAWScript, it refers to the entire
contents of the array, and there is no way to refer to individual elements of an array.
For any valid expression
e
referring to a pointer to a
struct
, the expression
e->n
, for some natural
number
n
, refers to the
n
th field of that
struct
. If the LLVM file contains debugging information, the
field names used in the original C types are also allowed.
For any valid expression
e
referring to a
struct
(directly, not via pointer), the expression
e.n
, for
some natural number
n
, refers to the
n
th field of that
struct
. This is particularly useful for fields of
nested structs, even if the outer struct is passed by pointer. As for indirect fields, names are allowed if
debugging information is present.
In addition to the different expression language, the arguments are similar but not identical. The third
argument, of type
[(String, Int)]
, indicates for each pointer how many elements it points to. Before
execution, SAW will allocate the given number of elements of the static type of the given expression. The
strings given here should be expressions identifying pointers rather than the values of those pointers.
The fourth argument, of type [(String, Term, Int)], indicates the initial values to write to the program
state before execution. The elements of this list should include value expressions. For example, if a function
has an argument named
p
of type
int *
, the allocation list might contain the element
("p", 1)
, whereas
the initial values list might contain the element
("*p", v, 1)
, for some value
v
. The
Int
portion of each
tuple indicates how large the term is: for a term with Cryptol type
[n]a
, the
Int
argument should be
n
. In
the future we expect this value to be inferred.
Finally, the fifth argument, of type
[(String, Int)]
, indicates the elements to read from the final state. For
each entry, the
String
should be a value expression and the
Int
parameter indicates how many elements to
read. The number of elements does not need to be the same as the number of elements allocated or written
in the initial state. However, reading past the end of an object or reading a location that has not been
initialized will lead to an error. In this list, the special name $safety works in the same way as for Java.
©2017 Galois, Inc. Page 25
Using the Software Analysis Workbench (SAW)
Examples
The following code is a complete example of using the java_symexec function.
// show that add (x ,y ) == add (y , x) for all x and y
cadd <- ja va_l oad_ clas s " Add ";
x <- f res h_s ym bol ic "x " {| [32] |};
y <- f res h_s ym bol ic "y " {| [32] |};
res <- j ava _sy mex ec cadd " add " [(" x" , x ) , (" y" , y)] [" return " , " $safety
"] true ;
let jadd = {{ res .0 }};
let safe = {{ res .1 }};
jadd '<- abstract_symbolic jadd;
prin t_te rm jadd ';
print " Proving com mu tat ivi ty :";
prov e_p rin t abc {{ \a b -> jadd 'a b == jadd 'b a }};
print " Proving safety :";
pro ve _p rint abc safe ;
print " Done .";
This code first loads the
Add
class and creates two 32-bit symbolic variables,
x
and
y
. It then symbolically
executes the
add
method with the symbolic variables just created as its two arguments, and returns the
symbolic expression denoting the method’s return value.
Once the script has a
Term
in hand (the variable
ja
), it prints the term and then translates the version
containing symbolic variables into a function that takes concrete values for those variables as arguments.
Finally, it proves that the resulting function is commutative.
Running this script through saw gives the following output:
% saw -j < pat h to > rt . jar j ava _sy me x ec . saw
Loadi ng file " j ava_ sy mexe c . saw "
let { x0 = Prelude . Vec 32 P relude . Bool ;
}
in \( x :: x0 ) ->
\( y :: x0 ) -> Pre lu de . b vA dd 32 x y
Proving c ommutiv ity :
Valid
Proving safety :
Valid
Done.
Limitations
Although the
symexec
functions are more flexible than the
extract
functions, they still have some limitations
and assumptions.
When allocating memory for objects or arrays, each allocation is done independently. There is currently
no way to create data structures that share sub-structures. No aliasing is possible. Therefore, it is
important to take into account that any proofs performed on the results of symbolic execution will
not necessarily reflect the behavior of the code being analyzed if it is run in a context where its inputs
involve aliasing or overlapping memory regions.
The sizes and pointer relationships between objects in the heap must be specified before doing symbolic
execution. Therefore, the results may not reflect the behavior of the code when called with, for example,
arrays of different sizes.
©2017 Galois, Inc. Page 26
Using the Software Analysis Workbench (SAW)
In Java, any variables of class type are initialized to refer to an object of that specific, statically-declared
type, while in general they may refer to objects of subtypes of that type. Therefore, the code under
analysis may behave differently when given parameters of more specific types.
Specification-Based Verification
The built-in functions described so far work by extracting models of code that can then be used for a variety
of purposes, including proofs about the properties of the code.
When the goal is to prove equivalence between some Java or LLVM code and a specification, however, a
more declarative approach is sometimes convenient. The following two functions allow for combined model
extraction and verification.
java _ve rif y : Java Clas s ->
String ->
[JavaMethodSpec] ->
JavaSetup () ->
TopLevel JavaMethodSpec
llvm _ve rif y : LLVM Mod ule ->
String ->
[LLVMMethodSpec] ->
LLVMSetup () ->
TopLevel LLVMMethodSpec
Like all of the functions for Java and LLVM analysis, the first two parameters indicate what code to analyze.
The third parameter is used for compositional verification, as described in the next section. For now, we will
use the empty list. The final parameter describes the specification of the code to be analyzed, comprised of
commands of type
JavaSetup
or
LLVMSetup
. In most cases, this parameter will be a
do
block containing
a sequence of commands of the appropriate type. Java and LLVM specifications are slightly different, but
make use of largely the same set of concepts.
Several commands are available to configure the contents of the initial state before symbolic execution.
Several commands are available to describe what to check in the final state after symbolic execution.
One final command describes how to prove that the code under analysis matches the specification.
The following sections describe the details of configuring initial states, stating the expected properties of the
final state, and proving that the final state actually satisfies those properties.
Configuring the Initial State
The first step in configuring the initial state is to specify which program variables are important, and to
specify the types of those variables more precisely. The symbolic execution system currently expects the
layout of memory before symbolic execution to be completely specified. As in
llvm_symexec
, SAW needs
information about how much space every pointer or reference variable points to. With one exception, SAW
assumes that every pointer points to a distinct region of memory.
Because of this structure, separate functions are used to describe variables with values of base types and
variables of pointer type.
For simple integer values, we use java_var or llvm_var.
java_var : String -> JavaType -> JavaSetup Term
llvm_var : String -> LLVMType -> LLVMSetup Term
©2017 Galois, Inc. Page 27
Using the Software Analysis Workbench (SAW)
These functions both take a variable name and a type. The variable names use the same syntax described
earlier for java_symexec and llvm_symexec. The types are built up using the following functions:
java_bool : Jav aType
java_byte : Jav aType
java_char : Jav aType
java _sho rt : JavaType
java_int : JavaType
java_long : Jav aType
java _flo at : JavaType
java _do ubl e : JavaType
java _cla ss : String -> JavaType
java _arr ay : Int -> Jav aType -> Java Type
llvm_int : Int -> LLVMType
llvm _arr ay : Int -> LLV MType -> LLVM Type
llvm _st ruc t : String -> LL VMTyp e
llvm _flo at : LLVMType
llvm _do ubl e : LLVMType
Most of these types are straightforward mappings to the standard Java and LLVM types. The one key
difference is that arrays must have a fixed, concrete size. Therefore, all analysis results are valid only under
the assumption that any arrays have the specific size indicated, and may not hold for other sizes. The
llvm_int function also takes an Int parameter indicating the variable’s bit width.
LLVM types can also be specified in LLVM syntax directly by using the llvm_type function.
llvm_type : String -> LLVMType
For example, llvm_type "i32" yields the same result as llvm_int 32.
The
Term
returned by
java_var
and
llvm_var
is a representation of the initial value of the variable being
declared. It can be used in any later expression.
While
java_var
and
llvm_var
declare elements of the program state that have values representable in the
logic of SAW, pointers and references exist only inside the simulator; they are not representable before or
after symbolic execution. Different functions are used to declare variables of pointer or reference type.
jav a_c lass_ var : String -> Jav aType -> JavaSetu p ()
llvm_ptr : String -> LLVMType -> LLVMSetup ()
The first argument of each function is the name of the state element it refers to. For
java_class_var
, the
second argument is the type of the object, which should always be the result of the
java_class
function
called with an appropriate class name. Arrays in Java are treated as if they were values, rather than references,
since their values are directly representable in SAWCore. For
llvm_ptr
, the second argument is the type of
the value pointed to. Both functions return no useful value (the unit type
()
), since the values of pointers
are not meaningful in SAWCore. In LLVM, arrays are represented as pointers; therefore, the pointer and the
value pointed to must be declared separately.
llvm _p tr " a" ( l lv m_arr ay 10 ( l lv m_int 32) );
a <- llvm_v ar "*a" ( llvm_ arra y 10 ( llvm_int 32) );
The
java_assert
and
llvm_assert
functions take a
Term
of boolean type as an argument, which states a
condition that must be true in the initial state before the function under analysis executes. The term can
refer to the initial values of any declared program variables.
©2017 Galois, Inc. Page 28
Using the Software Analysis Workbench (SAW)
When the condition required of an initial state is that a variable always has a specific, concrete value,
optimized forms of these functions are available. The
java_assert_eq
and
llvm_assert_eq
functions take
two arguments: an expression naming a location in the program state, and a
Term
representing an initial value.
These functions work as destructive updates in the state of the symbolic simulator, and can make branch
conditions more likely to reduce to constants. This means that, although
..._assert
and
..._assert_eq
functions can be used to make semantically-equivalent statements, using the latter can make symbolic
termination more likely.
Finally, although the default configuration of the symbolic simulators in SAW is to make every pointer or
reference refer to a fresh region of memory separate from all other pointers, it is possible to override this
behavior for Java programs by declaring that a set of references can alias each other.
jav a_ ma y_ al ia s : [ String ] -> JavaSetup ()
This function takes a list of names referring to references, and declares that any element of this set may (or
may not) alias any other. Because this is a may-alias relationship, the verification process involves a separate
proof for each possible aliasing configuration. At the moment, LLVM heaps must be completely disjoint.
Another precondition relevant only to Java concerns the set of classes that are initialized before execution of
a particular method. To state that the proof of the method being specified assumes that a class
C
is already
initialized, use java_requires_class "C".
jav a_ re qui re s_ class : String -> Ja vaSe tup ()
During verification, the
java_requires_class
clause instructs the simulator to initialize the named class
before executing the method to be verified.
Finally, one more precondition is relevant only to LLVM programs. The
llvm_assert_null
function works
like llvm_assert_eq except that it works only on pointer variables and assigns the implicit value NULL.
llv m_ass ert_n ull : String -> LLVMSetup ()
Specifying the Final State
The simplest statement about the expected final state of the method or function under analysis is a declaration
of what value it should return (generally as a function of the variables declared as part of the initial state).
java _re tur n : Term -> Jav aSetup ()
llvm _re tur n : Term -> LLV MSetup ()
llv m_ re tu rn_ar bi tr ar y : LLVMSetup ()
The
llvm_return_arbitrary
command indicates that the function does return a value, but that we don’t
want to specify what value it returns.
To account for side effects, the following two functions allow declaration of the final expected value that the
program state should contain for a specific pvariable when execution finishes.
jav a_e nsure _eq : String -> Term -> J avaSetup ()
llv m_e nsure _eq : String -> Term -> L LVMSetup ()
For the most part, these two functions may refer to the same set of variables used to set up the initial state.
However, for functions that return pointers or objects, the special name
return
is also available. It can be
used in
java_class_var
and
llvm_ptr
calls to declare the more specific object or array type of a return
value, and in the
..._ensure_eq
function to declare the associated values. For LLVM arrays, the typical
usage is as follows.
©2017 Galois, Inc. Page 29
Using the Software Analysis Workbench (SAW)
llvm_ensure_eq "*return" v;
The return expression is also useful for fields of returned objects or structs.
jav a_ en su re _e q " return .f" v;
ll vm _en su re_e q " return - >0" v ;
Finally, for LLVM programs it is possible to state that the function being analyzed is expected to allocate a
new object, stored in a given location.
llv m_a lloca tes : String -> LLVMSet up ()
When executing an override containing
llvm_allocates
, the override allocates a new object of the appropriate
type and stores a pointer to it in the given location.
Running Proofs
Once the constraints on the initial and final states have been declared, what remains is to prove that the
code under analysis actually meets these specifications. The goal of SAW is to automate this proof as much
as possible, but some degree of input into the proof process is occasionally necessary, and can be provided
with the following functions.
jav a_ verif y_ tacti c : Pr oofScri pt SatResult -> JavaS etup ()
llv m_ verif y_ tacti c : Pr oofScri pt SatResult -> LLVMS etup ()
The proof script argument to these functions specifies how to perform the proof. If the setup block does not
call one of these functions, SAW prints a warning message and skips the proof; this can sometimes be a useful
behavior during debugging, or in compositional verification as described later.
The process of verification checks all user-specified postconditions, and also checks that the safety condition
(as referred to by
$safety
in
*_symexec
) is valid, and therefore that symbolic execution is always well defined
(under the supplied pre-conditions).
Compositional Verification
The primary advantage of the specification-based approach to verification is that it allows for compositional
reasoning. That is, when proving properties of a given method or function, we can make use of properties we
have already proved about its callees rather than analyzing them anew. This enables us to reason about
much larger and more complex systems than otherwise possible.
The
java_verify
and
llvm_verify
functions returns values of type
JavaMethodSpec
and
LLVMMethodSpec
,
respectively. These values are opaque objects that internally contain both the information provided in the
associated JavaSetup or LLVMSetup blocks and the results of the verification process.
Any of these
MethodSpec
objects can be passed in via the third argument of the
..._verify
functions.
For any function or method specified by one of these parameters, the simulator will not follow calls to the
associated target. Instead, it will perform the following steps:
Check that all ..._assert and ..._assert_eq statements in the specification are satisfied.
For Java programs, check that any aliasing is compatible with the aliasing restrictions stated with
java_may_alias.
For Java programs, check that all classes required by the target method have already been initialized.
Update the simulator state as described in the specification.
Normally, a
MethodSpec
is the result of both simulation and proof of the target code. However, in some
cases, it can be useful to use a
MethodSpec
to specify some code that either doesn’t exist or is hard to prove.
©2017 Galois, Inc. Page 30
Using the Software Analysis Workbench (SAW)
In those cases, the
java_no_simulate
or
llvm_no_simulate
function can be used to indicate not to even
try to simulate the code being specified, and instead return a
MethodSpec
that is assumed to be correct. A
MethodSpec
with
*_no_simulate
can be used to provide a specification for a function or method that is
declared but not defined within the code being analyzed.
The default behavior of
java_verify
disallows allocation within the method being analyzed. This restric-
tion makes it possible to reason about all possible effects of a method, since only effects specified with
java_ensure_eq
or
java_modify
are allowed. For many cryptographic applications, this behavior is ideal,
because it is important to know whether, for instance, temporary variables storing key material have been
cleared after use. Garbage on the heap that has been collected but not cleared could let confidential informa-
tion leak. If allocation is not a concern in a particular application, the
java_allow_alloc
function makes
allocation legal within the method being specified.
Controlling Symbolic Execution
One other set of commands is available to control the symbolic execution process. These commands control
the use of satisfiability checking to determine whether both paths are feasible when encountering branches in
the program, which is particularly relevant for branches controlling the iteration of loops.
jav a_ sat _b ranch es : Bool -> JavaSetup ()
llv m_ sat _b ranch es : Bool -> LLVMSetup ()
The
Bool
parameter has the same effect as the
Bool
parameter passed to
java_symexec
and
llvm_symexec
.
Finally, in some cases, pointers in LLVM can become what look like complex symbolic values during symbolic
simulation, even though they can be simplified down to constants. Using these complex pointers directly is
slow, and simplifying them can greatly speed up symbolic execution of some programs. For other programs,
however, the simplification is wasted effort. Therefore, the
llvm_simplify_addrs
command turns the
simplification of pointer expressions on (with parameter true) or off (with parameter false).
llv m_ si mpl if y_ addrs : Bool -> LLV MSet up ()
LLVM Verification Using Crucible
The verification commands presented for Java and LLVM so far use language-specific symbolic execution
infrastructure. More recently, we have developed a new library for symbolic execution of imperative programs
that is intended to be relatively agnostic to the specific source language in question. It exposes an intermediate
representation based on control-flow graphs containing relatively simple instructions that can be used as the
target of translation from a variety of source languages. We have successfully used it for LLVM, Matlab, and
a variety of machine code ISAs, and have ongoing efforts to use it for several other languages.
In addition to being language-agnostic, Crucible has a larger feature set and generally better performance
than the previous symbolic execution engines for Java and LLVM.
As an alternative to the LLVM verification commands presented earlier, an experimental set of commands for
doing verification using Crucible also exist. At the moment, the key command is
crucible_llvm_verify
,
with roughly similar functionality to
llvm_verify
. Counterparts of
llvm_extract
and
llvm_symexec
do
not currently exist, but are planned.
As with
llvm_verify
,
crucible_llvm_verify
requires a specification as input, describing what the function
under analysis is intended to do. The mechanism for setting up a specification is similar to that for
llvm_verify, but uses a slightly different set of commands.
The most significant difference is that creating fresh symbolic values, describing allocation, and describing
the initial value of allocated memory are distinct operations. This can sometimes result in more verbose
specifications, but is more flexible and more amenable to abstraction. So, with a good set of common patterns
encapsulated in functions, specifications can ultimately become more concise and understandable.
©2017 Galois, Inc. Page 31
Using the Software Analysis Workbench (SAW)
Running a Verification
Verification with Crucible is controlled by the crucible_llvm_verify command.
crucible_llvm_verify : LLVMModule ->
String ->
[ Cru ci bleMe th odSpe c ] ->
Bool ->
Cru cib leS etu p () ->
Proo fSc rip t SatResu lt ->
TopLevel Cr uc ibl eM et hod Sp ec
The first two arguments specify the module and function name to verify, as with
llvm_verify
. The third
argument specifies the list of already-verified specifications to use as overrides for compositional verification
(though note that the types of specifications used by
llvm_verify
and
crucible_llvm_verify
are different,
so they can’t be used interchangeably). The fourth argument specifies whether to do path satisfiability
checking, and the fifth gives the specification of the function to be verified. Finally, the last argument gives
the proof script to use for verification (which is separated from the specification itself, unlike
llvm_verify
).
The result is a proved specification that can be used to simplify verification of functions that call this one.
Now we describe how to construct a value of type CrucibleSetup ().
Structure of a Specification
A specifications for Crucible consists of three logical components:
A specification of the initial state before execution of the function.
A description of how to call the function within that state.
A specification of the expected final value of the program state.
These three portions of the specification are written in sequence within a
do
block of
CrucibleSetup
type.
The command
crucible_execute_func
separates the specification of the initial state from the specification
of the final state, and specifies the arguments to the function in terms of the initial state. Most of the
commands available for state description will work either before or after
crucible_execute_func
, though
with slightly different meaning.
Creating Fresh Variables
In any case where you want to prove a property of a function for an entire class of inputs (perhaps all inputs)
rather than concrete values, the initial values of at least some elements of the program state must contain
fresh variables. These are created in a specification with the crucible_fresh_var command.
cru ci ble_f re sh_va r : String -> LLVMTy pe -> C rucib leS etu p Term
The first parameter is a name, used only for presentation. It’s possible (though not recommended) to create
multiple variables with the same name, but SAW will distinguish between them internally. The second
parameter is the LLVM type of the variable. The resulting
Term
can be used in various subsequent commands.
The SetupValue Type
Many specifications require reasoning about both pure values and about the configuration of the heap. The
SetupValue
type corresponds to values that can occur during symbolic execution, which includes both
Term
values, pointers, and composite types consisting of either of these (both structures and arrays).
The crucible_term function creates a SetupValue from a Term:
cru cib le_ ter m : Term -> SetupV alue
©2017 Galois, Inc. Page 32
Using the Software Analysis Workbench (SAW)
Executing
Once the initial state has been configured, the
crucible_execute_func
command specifies the parameters
of the function being analyzed in terms of the state elements already configured.
cru ci bl e_ ex ecute _f un c : [ Se tup Valu e ] -> Cr uci bleSe tup ()
Return Values
The
crucible_points_to
command can be used to specify changes to portions of the memory accessed by
pointer. For return values, however, use the crucible_return command instead.
cru cib le _re tu rn : SetupV alue -> Cr uci ble Set up ()
A First Simple Example
The commands introuduced so far are sufficient to verify simple programs that do not use pointers (or that
use them only internally). Consider, for instance the program that adds its two arguments together:
ui nt 32_t add ( u in t32_ t x , ui nt 32_t y ) {
return x + y ;
}
We can specify this function’s expected behavior as follows:
let add_s etup = do {
x <- c ruci ble _f res h_va r " x" ( ll vm _i nt 32) ;
y <- c ruci ble _f res h_va r " y" ( ll vm _i nt 32) ;
cru ci bl e_ execu te _f un c [ cr uci ble _te rm x , cru cib le_te rm y ];
cru cib le _re tu rn ( cruci ble _te rm {{ x + y : [32] }}) ;
};
We can then compile the C file add.c into the bitcode file add.bc and verify it with ABC:
m <- l lv m_l oad_ modu le " add . bc ";
add_ms < - cru cibl e_l lvm_ ver if y m " add " [] false add _s et up abc ;
Now say we have a doubling function written in terms of add:
uint 32 _t dbl ( u in t3 2_ t x ) {
retur n add (x , x ) ;
}
It has a similar specification:
let dbl_s etup = do {
x <- c ruci ble _f res h_va r " x" ( ll vm _i nt 32) ;
c ruc ibl e_e xe cu t e_f unc [ c ru c ib le _t er m x ];
cru cib le _re tu rn ( cruci ble _te rm {{ x + x : [32] }}) ;
};
And we can verify it using what we’ve already proved about add:
cr ucib le_ llvm _ve ri fy m " dbl " [ a dd _ms ] f alse dbl _s etup abc ;
In this case, doing the verification compositionally doesn’t save much, since the functions are so simple, but
it illustrates the approach.
©2017 Galois, Inc. Page 33
Using the Software Analysis Workbench (SAW)
Specifying Heap Layout
Most functions that operate on pointers expect that certain pointers point to allocated memory before they
are called. The
crucible_alloc
command allows you to specify that a function expects a particular pointer
to refer to an allocated region appropriate for a specific type.
cru cib le_al loc : LL VMTyp e -> Crucibl eSe tup S etu pVal ue
This command returns a
SetupValue
consisting of a pointer to the allocated space, which can be used
wherever a pointer-valued SetupValue can be used.
In the initial state,
crucible_alloc
specifies that the function expects a pointer to allocated space to exist.
In the final state, it specifies that the function itself performs an allocation.
It’s also possible to construct fresh pointers that do not point to allocated memory (which can be useful for
functions that manipulate pointers but not the values they point to):
cru ci bl e_ fr esh_p oi nt er : LL VMTyp e -> Crucibl eSe tup S etu pVal ue
The NULL pointer is called crucible_null:
cru cib le_ nul l : Set upValue
Pointers to global variables or functions can be accessed with crucible_global:
cru cib le _gl ob al : String -> Setu pVal ue
Specifying Heap Values
Pointers returned by
crucible_alloc
don’t, initially, point to anything. So if you pass such a pointer
directly into a function that tried to dereference it, symbolic execution will fail with a message about an
invalid load. For some functions, such as those that are intended to initialize data structures (writing to the
memory pointed to, but never reading from it), this sort of uninitialized memory is appropriate. In most
cases, however, it’s more useful to state that a pointer points to some specific (usually symbolic) value, which
you can do with the crucible_points_to command.
cru ci ble_p oi nts_t o : Setup Val ue -> Set upV alue -> Cr uci bleSetup ()
This command takes two
SetupValue
arguments, the first of which must be a pointer, and states that the
memory specified by that pointer should contain the value given in the second argument (which may be any
type of SetupValue).
When used in the final state,
crucible_points_to
specifies that the given pointer should point to the given
value when the function finishes.
Working with Compound Types
The commands mentioned so far give us no way to specify the values of compound types (arrays or
struct
s).
Compound values can be dealt with either piecewise or in their entirety. To access them piecewise, the
crucible_elem function yields a pointer to an internal element of a compound value.
cru cib le_ ele m : Se tupV alue -> Int -> Se tupValu e
For arrays, the
Int
parameter is the array index. For
struct
values, it is the field index. For
struct
values,
it can be more convenient to use field names. If debugging information is available in the bitcode file, the
crucible_field function yields a pointer to a particular named field:
cru cib le_fi eld : Setu pValue -> String -> Setu pVal ue
©2017 Galois, Inc. Page 34
Using the Software Analysis Workbench (SAW)
Either of these functions can be used with
crucible_points_to
to specify the value of a particular array
element or
struct
field. Sometimes, however, it is more convenient to specify all array elemnts or field values
at onces. The
crucible_array
and
crucible_struct
functions construct compound values from lists of
element values.
cru cib le_ar ray : [ SetupV alu e ] -> Set upValue
cru cib le _st ru ct : [ S etup Val ue ] -> Se tupValu e
Preconditions and Postconditions
Sometimes a function is only well-defined under certain conditions, or sometimes you may be interested
in certain initial conditions that give rise to specific final conditions. For these cases, you can specify an
arbitrary predicate as a pre-condition or post-condition, using any values in scope at the time.
cru cible _prec ond : Term -> Cr uci bleSe tup ()
cru ci ble _p ostco nd : Term -> Cru cib leSetup ()
These two commands take
Term
arguments, and therefore cannot describe the values of pointers. The
crucible_equal
command states that two
SetupValue
s should be equal, and can be used in either the initial
or the final state.
cru cib le_eq ual : SetupV alue -> Se tupV alue -> Cru cib leSet up ()
The use of
crucible_equal
can also sometimes lead to more efficient symbolic execution when the predicate
of interest is an equality.
A Heap-Based Example
To tie all of the command descriptions from the previous sections together, consider the case of verifying the
correctness of a C program that computes the dot product of two vectors, where the length and value of each
vector are encapsulated together in a struct.
The dot product can be concisely specified in Cryptol as follows:
dotprod : {n , a } ( fin n , fin a) = > [n ][ a] -> [n ][ a ] -> [a ]
dotprod xs ys = sum ( zip (*) xs ys )
To implement this in C, let’s first consider the type of vectors:
typedef struct {
uint3 2_ t * elts ;
uint3 2_ t size ;
} vec_t ;
This struct contains a pointer to an array of 32-bit elements, and a 32-bit value indicating how many elements
that array has.
We can compute the dot product of two of these vectors with the following C code (which uses the size of the
shorter vector if they differ in size).
uint32_t do tpr od _st ruc t ( vec_t *x , vec_t *y ) {
ui nt 32_t s iz e = M IN ( x -> size , y -> s iz e );
uint32_t res = 0;
for ( siz e_t i = 0; i < si ze ; i ++) {
res += x - > el ts [ i] * y - > elts [ i ];
}
return res ;
}
©2017 Galois, Inc. Page 35
Using the Software Analysis Workbench (SAW)
The entirety of this implementation can be found in the
examples/llvm/dotprod_struct.c
file in the
saw-script repository.
To verify this program in SAW, it will be convenient to define a couple of utility functions (which are generally
useful for many heap-manipulating programs). First, combining allocation and initialization to a specific
value can make many scripts more concise:
let allo c_in it ty v = do {
p <- crucible_alloc ty;
cru ci ble_p oi nts_t o p v;
return p;
};
This creates a pointer
p
pointing to enough space to store type
ty
, and then indicates that the pointer points
to value v(which should be of that same type).
A common case for allocation and initialization together is when the initial value should be entirely symbolic.
let ptr _to _fr esh n ty = do {
x <- c ru ci b le _ fre sh_ var n ty ;
p <- alloc_in it ty ( c ruc ibl e_t erm x );
retur n (x , p );
};
This function returns the pointer just allocated along with the fresh symbolic value it points to.
Given these two utility functions, the dotprod_struct function can be specified as follows:
let dot pro d_s pec n = do {
let nt = cr uci ble _term {{ `n : [32] }};
(xs , xsp ) < - ptr_ to _fre sh " xs " ( llv m_ array n ( l lvm_int 32) ) ;
(ys , ysp ) < - ptr_ to _fre sh " ys " ( llv m_ array n ( l lvm_int 32) ) ;
let xval = cruc ib le_ struc t [ xsp , nt ];
let yval = cruc ib le_ struc t [ ysp , nt ];
xp <- alloc_in it ( llv m_s truct " struct . vec_t ") xval ;
yp <- alloc_in it ( llv m_s truct " struct . vec_t ") yval ;
crucible_execute_func [xp, yp];
cru cib le _re tu rn ( cru cib le_te rm {{ dotprod xs ys }}) ;
};
Any instantiation of this specification is for a specific vector length
n
, and assumes that both input vectors
have that length. That length
n
automatically becomes a type variable in the subsequent Cryptol expressions,
and the backtick operator is used to reify that type as a bit vector of length 32.
The entire script can be found in the dotprod_struct-crucible.saw file alongside dotprod_struct.c.
©2017 Galois, Inc. Page 36

Navigation menu