Chat Script Engine And Private Code Manual

ChatScript%20Engine%20and%20Private%20Code%20Manual

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 5

ChatScript Engine and Private Code Manual
© Bruce Wilcox, gowilcox@gmail.com brilligunderstanding.com
Revision 5/18/2016 cs6.5a
This does not cover ChatScript the scripting language. It covers how the internals of the engine work
and how to extend it with private code.
Data Structures & Data types
Part of understanding CS engine programming is understanding the basic data available to CS. The
fundamental datatype is the text string. Text strings represent words, phrases, things to say. They even
represent numbers (converted on the fly when needed to float and int64 values for computation). Text
strings represent the names of functions and concepts and topics. Some of these text strings have
attributes attached to them and are kept in a dictionary for easy lookup.
The other fundamental datatype is the fact, a triple of data. Each field of a fact can be a text string
(stored in the dictionary) or a direct reference to another fact. Facts represent relationships among text
strings. When stored in variables, a fact reference is a number, which in turn is a text string.
User variables hold text strings. Match variables hold text strings, but a match variable is more
complex because usually it holds data found in an input sentence. As such a match variable holds the
original value seen in the sentence, a canonical form of that value, and the position reference for where
in the sentence the data came from. Match variables pass all that information along when assigned to
other match variables, but lose all but one of them when assigned to a user variable or stored as the
field of a fact.
Words are the fundamental unit of information in CS. The original words came from WordNet, and
then were either reduced or expanded. Word are reduced when some or all meanings of them are
removed because they are too difficult to manage. “I”, for example, has a wordnet meaning of the
chemical iodine, and because that is so rare in usage and causes major headaches for ChatScript, that
definition has been expunged along with some 500 other meanings of words. Additional words have
been added, including things that Wordnet doesn't cover like pronouns, prepositions, determiners, and
conjunctions. And more recent words like “animatronic” and “beatbox”. Every word in a pattern has a
value in the dictionary. Even phrases like “read my lips” in a pattern or keyword list are treated as
words in the dictionary.
Words have zillions of bits representing language properties of the word (well, maybe not zillions, but
3x64 bytes worth of bits). Many are permanent core properties like it can be a noun, a singular noun, it
refers to a unit of time (like “month”), it refers to an animate being, it's a word learning typically in first
grade. Other properties result from compiling your script (this word is found in a pattern somewhere in
your script). All of these properties could have been represented as facts, but it would have been
inefficient in either cpu time or memory to have done so.
Even things that are not words, including phrases, can reside in the dictionary and have properties, even
if the property is merely “this is a keyword of some pattern or concept somewhere”.
Some dictionary items are “permanent”, meaning they are loaded when the system starts up, either
from the dictionary or from data in layer 0 and layer 1. Other dictionary items are “transient”. They
come into existence as a result of user input and will disappear when that volley is complete. They may
live on in text as data stored in the user's topic file and will reappear again during the next volley when
the user data is reloaded. Words like dogs are not in the permanent dictionary but will get created as
transient entries if they show up in the user's input.
Facts are simply triples of words that represent relationships between words. The ontology structure of
CS is represented as facts (which allows them to be queried). Words are hierarchically linked using
facts (using the “is” verb). Word are conceptually linked using facts with the verb “member”. Word
entries have lists of facts that use them as either subject or verb or object so that when you do a query
like query(direct_ss dog love ?) CS will retrieve the list of facts that have dog as a subject and consider
those. And all those values of fields of a fact are words in the dictionary so that they will be able to be
queried.
ChatScript support user variables, for considerations of efficiency and ease of reference by scripters.
Variables could have been represented as facts, but it would have increased processing speed, local
memory, and user file sizes, not to mention made scripts harder to read.
Memory Management
Many programs use malloc and free extensively upon demand. These functions are not particularly fast.
And they lead to memory fragmentation, whereupon one might fail a malloc even though overall the
space exists. ChatScript follows video game design principles and manages its own memory. It
allocates everything in advance and then (with rare exception) it never dynamically allocates memory
again, so it cannot fail by calling the OS for memory. And you have control over the allocations upon
startup via command line parameters.
This does not mean CS has a perfect memory management system. Merely that it is extremely fast. It is
based on mark/release, so it allocates space rapidly, and at the end of the volley, it releases all the space
it used back into its own pool.
You might run out of memory allocated to dictionary items while still having memory available for
facts. This means you need to rebalance your allocations. But most people never run into these
problems unless they are on mobile versions of CS.
The other problem is that memory is not released until the volley is over. So conceivably memory is
free but hasn't been freed. But CS supports planning, which means backtracking, which means memory
is really not free along the way because the system might revert things back to some earlier state. This
problem of free memory mostly shows up in document mode, where reading long paragraphs of text
are all considered a single volley and therefore one might run out of memory. CS provides a memory
mark and memory free function so you can explicitly control this while reading a document.
Run-time Model
The fundamental units of computation in ChatScript are functions (system functions and user
outputmacros) and topics.
System functions are predefined C code to perform some activity most of which take arguments that
are evaluated in advance (but some wait until they get them to decide whether to evaluate or not).
Outputmacros are scriper-written stuff that CS dynamically processes at execution time to treat as a
mixture of script statements and user output words. They can have arguments passed to them, but these
arguments are typically not evaluated. This is not pass by value. Outputmacro code is executed as
though it were directly spliced into the original caller's code. ^args are processed by converting them
into what the caller was using, except for format strings ^”xxx” which are evaluated in the caller's
context before being passed as an argument to a macro.
Script Execution
The scripting language is heavily dependent upon the prefix character to tell the system how to behave.
The script compiler normally forces separate of things into separate tokens to allow fast uniform
handling. E.g., “^call(bob hello)” becomes “^call ( bob hello )”. This predictability allows the system to
avoid all the logic involved in knowing where some tokens end and others begin. The other trick the
script compiler uses is to put in characters indicating how far something extends. This jump value is
used for things like if statements to skip over failing segments of the if.
Pos-parsing
The system runs pos-parsing in two passes. The first pass is execution of rule from
LIVEDATA/SYSTEM/ENGLISH which help it prune out possible meanings of words. The goal of
these rules is to reduce ambiguity without ever throwing out actual possible pos values while reducing
incorrect meanings as much as possible. The second pass tries to determine the parse of the sentence,
forcing various pos choices as it goes and altering them if it finds it has made a mistake. It uses a
“garden path” algorithm. It presumes the words form a sentence, and tries to directly find pos values
that make it so in a simple way, changing things if it discovers anomolies.
Queries
Queries like “^query(direct_v ? walk ?) function by having a byte code scripting language stored on the
query name “direct_v”. This byte code is executed to perform the query.
The Dictionary
The dictionary consists of WORD entries, stored in hash buckets when the system starts up. Once
system startup is complete, those buckets are closed and new entries created from user input are all
stored in bucket 0, where they can be undo easily. Linear search within this bucket is unimportant
because the user won't create many new entries. So when the dictionary performs word lookup, it uses
the hash table first, and if it finds nothing there, it uses bucket 0. The hash code is the same for lower
and upper case words, but upper case adds 1 to the bucket is stores in. This means all forms of upper
case of a word hash the same, so there is only 1 actual print form of an upper case word available.
Marking
The system takes the input and splits it into the original input and a canonical one. Both are “marked”.
Marking means taking the words of the sentence in order (where they may have pos-specific values)
and noting on each word where they occur in the sentence (they may occur more than once). From
specific words the system follows the member links to concepts they are members of, and marks those
concepts as occuring at that location in the sentence. And concepts may be members of other concepts,
and so on up the hierarchy. There exist system functions that allow you, from script, to also mark and
unmark words. This allows you to correct or augment meanings.
In addition to marking words, the system generates sequences of 5 contiguous words (phrases), and if it
finds them in the dictionary, they too are marked.
Spell Checking
Spell checking takes a word it doesn't recognize and performs a variety of attempts. These in include
merging it with adjacent words, splitting it into two words, adding/removing hyphens, or hunting
among words whose length is plus or minus one letter, to try making a minimal edit distance.
Script Compiler
In large measure what the compiler does is verify the legality of your script and smooth out the tokens
so there is a clean single space between each token. In addition, it inserts “jump” data that allows it to
quickly move from one rule to another, and from an “if” test to the start of each branch so if the test
fails, it doesn't have to read all the code involved in the failing branch. It also sometimes inserts a
character at the start of a patttern element to identify what kind of element it is. E.g., = before a
comparison token or * before a word that has wildcard spelling.
Private Code
You can add code to the engine without modifying its source files directly. To do this, you create a
directory called privatecode at the top level of ChatScript. You must enable the PRIVATE_CODE
define.
Inside it you place files:
privatesrc.cpp - code you want to add to functionexecute.cpp (your own cs engine functions)
classic definitions compatible with invocation from script look like this:
static FunctionResult Yourfunction(char* buffer)
where ARGUMENT(1) is a first argument passed in.
answers are returned as text in buffer, and success/failure codes as
FunctionResult.
privatetable.cpp – listing of the functions made visible to CS
table entries to connect your functions to script:
{ (char*) “^YourFunction”, YourFunction, 1,0, (char*) “help text of your function”},
1 is the number of evaluated arguments to be passed in
VARIABLE_ARGUMENT_COUNT means args evaled but you have to
detect the end – ARGUMENT(n) will be ?
STREAM_ARG – raw text sent. You have to break it apart and do whatever.
privatesrc.h – header file. It must at least declare:
void PrivateInit(char* params); – called on startup of CS, passed param: private=
void PrivateRestart();- called when CS is restarting
void PrivateShutdown(); - called when CS is exiting.
privatetestingtable.cpp – listing of :debug functions made visible to CS
Debug table entries like this:
{(char*) “:endinfo”, EndInfo,(char*)”Display all end information”},

Navigation menu