Libcsv Manual En

libcsv_development_manual

linmath_manual

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 8

CSV(3) CSV(3)
NAME csv − CSV parser and writer library
SYNOPSIS
#include <libcsv/csv.h>
int csv_init(struct csv_parser *p,unsigned char options);
size_t csv_parse(struct csv_parser *p,
const void *s,
size_t len,
void (*cb1)(void *, size_t, void *),
void (*cb2)(int, void *),
void *data);
int csv_fini(struct csv_parser *p,
void (*cb1)(void *, size_t, void *),
void (*cb2)(int, void *),
void *data);
void csv_free(struct csv_parser *p);
unsigned char csv_get_delim(struct csv_parser *p);
unsigned char csv_get_quote(struct csv_parser *p);
void csv_set_space_func(struct csv_parser *p,int (*f)(unsigned char));
void csv_set_term_func(struct csv_parser *p,int (*f)(unsigned char));
int csv_get_opts(struct csv_parser *p);
int csv_set_opts(struct csv_parser *p,unsigned char options);
int csv_error(struct csv_parser *p);
char * csv_strerror(int error);
size_t csv_write(void *dest,size_t dest_size,const void *src,
size_t src_size);
int csv_fwrite(FILE *fp,const void *src,size_t src_size);
size_t csv_write2(void *dest,size_t dest_size,const void *src,
size_t src_size,unsigned char quote);
int csv_fwrite2(FILE *fp,const void *src,size_t src_size,unsigned char quote);
void csv_set_realloc_func(struct csv_parser *p,void *(*func)(void *, size_t));
void csv_set_free_func(struct csv_parser *p,void (*func)(void *));
void csv_set_blk_size(struct csv_parser *p,size_t size);
size_t csv_get_blk_size(struct csv_parser *p);
size_t csv_get_buffer_size(struct csv_parser *p);
DESCRIPTION
The CSV library provides a flexible, intuitive interface for parsing and writing csv data.
OVERVIEW
The idea behind parsing with libcsv is straight-forward: you initialize a parser object with csv_init() and
feed data to the parser overone or more calls to csv_parse() providing callback functions that handle end-
of-field and end-of-rowevents. csv_parse() parses the data provided calling the user-defined callback func-
tions as it reads fields and rows. When complete, csv_fini() is called to finish processing the current field
and makeafinal call to the callback functions if neccessary. csv_free() is then called to free the parser
object. csv_error() and csv_strerror() provide information about errors encountered by the functions.
csv_write() and csv_fwrite() provide a simple interface for converting rawdata into CSV data and storing
the result into a buffer or file respectively.
9January 2013 1
CSV(3) CSV(3)
CSV is a binary format allowing the storage of arbitrary binary data, files opened for reading or writing
CSV data should be opened in binary mode.
libcsv provides a default mode in which the parser will happily process anydata as CSV without complaint,
this is useful for parsing files which don’tadhere to all the traditional rules. A strict mode is also supported
which will cause anyviolation of the imposed rules to cause a parsing failure.
ROUTINES
PARSING DAT A
csv_init() initializes a pointer to a csv_parser structure. This structure contains housekeeping information
such as the current state of the parser,the buffer,current size and position, etc. The csv_init() function
returns 0 on success and a non-zero value upon failure. csv_init() will fail if the pointer passed to it is a
null pointer.The options argument specifies the parser options, these may be changed later with the
csv_set_opts() function.
OPTIONS
CSV_STRICT
Enables strict mode.
CSV_REPALL_NL
Causes each instance of a carriage return or linefeed outside of a record to be reported.
CSV_STRICT_FINI
Causes unterminated quoted fields encountered in csv_fini() to cause a parsing error (see
below).
CSV_APPEND_NULL
Will cause all fields to be nul-terminated when provided to cb1,introduced in 3.0.0.
CSV_EMPTY_IS_NULL
Will cause NULL to be passed as the first argument to cb1 for empty,unquoted, fields.
Empty means consisting only of either spaces and tabs or the values defined by the a cus-
tom function registered via csv_set_space_func().Added in 3.0.3.
Multiple options can be specified by OR-ing them together.
csv_parse() is the function that does the actual parsing, it takes 6 arguments:
pis a pointer to an initialized struct csv_parser.
sis a pointer to the data to read in, such as a dynamically allocated region of memory containing
data read in from a call to fread().
len is the number of bytes of data to process.
cb1 is a pointer to the callback function that will be called from csv_parse() after an entire field
has been read. cb1 will be called with a pointer to the parsed data (which is NOTnul-terminated
unless the CSV_APPEND_NULL option is set), the number of bytes in the data, and the pointer
that was passed to csv_parse().
cb2 is a pointer to the callback function that will be called when the end of a record is encoun-
tered, it will be called with the character that caused the record to end, cast to an unsigned char,or
-1 if called from csv_fini, and the pointer that was passed to csv_init().
data is a pointer to user-defined data that will be passed to the callback functions when invoked.
cb1 and/or cb2 may be NULL in which case no function will be called for the associated actions.
data may also be NULL butthe callback functions must be prepared to handle receiving a null
pointer.
By default cb2 is not called when rows that do not contain anyfields are encountered. This behavior is
meant to accomodate files using only either a linefeed or a carriage return as a record seperator to be parsed
9January 2013 2
CSV(3) CSV(3)
properly while at the same time being able to parse files with rows terminated by multiple characters from
resulting in blank rows after each actual rowofdata (for example, processing a text CSV file created that
wascreated on a Windows machine on a Unix machine). The CSV_REPALL_NL option will cause cb2
to be called once for every carraige return or linefeed encountered outside of a field. cb2 is called with the
character that prompted the call to the function, , cast to an unsigned char,either CSV_CR for carriage
return, CSV_LF for linefeed, or -1 for record termination from a call to csv_fini() (see below). A carriage
return or linefeed within a non-quoted field always marks both the end of the field and the row. Other char-
acters can be used as rowterminators and thus be provided as an argument to cb2 using
csv_set_space_func().
Note: The first parameter of the cb1 function is void *,not const void *;the pointer passed to the callback
function is actually a pointer to the entry buffer inside the csv_parser struct,this data may safely be modi-
fied from the callback function (or anyfunction that the callback function calls) but you must not attempt to
access more than len bytes and you should not access the data after the callback function returns as the buf-
fer is dynamically allocated and its location and size may change during calls to csv_parse().
Note: Different callback functions may safely be specified during each call to csv_parse() butkeep in mind
that the callback functions may be called manytimes during a single call to csv_parse() depending on the
amount of data being processed in a givencall.
csv_parse() returns the number of bytes processed, on a successful call this will be len,ifitisless than len
an error has occured. An error can occur,for example, if there is insufficient memory to store the contents
of the current field in the entry buffer.Anerror can also occur if malformed data is encountered while run-
ning in strict mode.
The csv_error() function can be used to determine what the error is and the csv_strerror() function can be
used to provide a textual description of the error. csv_error() takes a single argument, a pointer to a struct
csv_parser,and returns one of the following values defined in csv.h:
CSV_EPARSE Aparse error has occured while in strict mode
CSV_ENOMEM There was not enough memory while attempting to increase the entry buffer
for the current field
CSV_ETOOBIG Continuing to process the current field would require a buffer of more than
SIZE_MAX bytes
The value passed to csv_strerror() should be one returned from csv_error().The return value of csv_str-
error() is a pointer to a static string. The pointer may be used for the entire lifetime of the program and the
contents will not change during execution but you must not attempt to modify the string it points to.
When you have finished submitting data to csv_parse(),you need to call the csv_fini() function. This func-
tion will call the cb1 function with anyremaining data in the entry buffer (if there is any) and call the cb2
function unless we are already at the end of a row(the last byte processed was a newline character for
example). It is neccessary to call this function because the file being processed might not end with a car-
riage return or newline but the data that has been read in to this point still needs to be submitted to the call-
back routines. If cb2 is called from within csv_fini() it will be because the rowwas not terminated with a
newline sequence, in this case cb2 will be called with an argument of -1.
Note: Acall to csv_fini implicitly ends the field current field and row. Ifthe last field processed is a quoted
field that ends before a closing quote is encountered, no error will be reported by default, evenif
CSV_STRICT is specified. To cause csv_fini() to report an error in such a case, set the
CSV_STRICT_FINI option (newinversion 1.0.1) in addition to the CSV_STRICT option.
csv_fini() also reinitializes the parser state so that it is ready to be used on the next file or set of data.
csv_fini() does not alter the current buffer size. If the last set of data that was being parsed contained a very
large field that increased the size of the buffer,and you need to free that memory before continuing, you
must call csv_free(),you do not need to call csv_init() again after csv_free().Likecsv_parse, the callback
functions provided to csv_fini() may be NULL. csv_fini() returns 0 on success and a non-zero value if you
pass it a null pointer.
9January 2013 3
CSV(3) CSV(3)
After calling csv_fini() you may continue to use the same struct csv_parser pointer without reinitializing it
(in fact you must not call csv_init() with an initialized csv_parser object or the memory allocated for the
original structure will be lost).
When you are finished using the csv_parser object you can free anydynamically allocated memory associ-
ated with it by calling csv_free().You may call csv_free() at anytime, it need not be preceded by a call to
csv_fini().You must only call csv_free() on a csv_parser object that has been initialized with a successful
call to csv_init().
WRITING DAT A
libcsv provides twofunctions to transform rawdata into CSV formatted data: the csv_write() function
which writes the result to a provided buffer,and the csv_fwrite() function which writes the result to a file.
The functionality of both functions is straight-forward, theywrite out a single field including the opening
and closing quotes and escape each encountered quote with another quote.
The csv_write() function takes a pointer to a source buffer (src)and processes at most src_size characters
from src.csv_write() will write at most dest_size characters to dest and returns the number of characters
that would have been written if dest waslarge enough. This can be used to determine if all the characters
were written and, if not, howlarge dest needs to be to write out all of the data. csv_write() may be called
with a null pointer for the dest argument in which case no data is written but the size required to write out
the data will be returned. The space needed to write out the data is the size of the data + number of quotes
appearing in data (each one will be escaped) + 2 (the leading and terminating quotes). csv_write() and
csv_fwrite() always surround the output data with quotes. If src_size is very large (SIZE_MAX/2 or
greater) it is possible that the number of bytes needed to represent the data, after inserting escaping quotes,
will be greater than SIZE_MAX. In such a case, csv_write will return SIZE_MAX which should be inter-
preted as meaning the data is too large to write to a single field. The csv_fwrite() function is not similiarly
limited.
csv_fwrite() takes a FILE pointer (which should have been opened in binary mode) and converts and writes
the data pointed to by src of size src_size.Itreturns 0on success and EOF if there was an error writing to
the file. csv_fwrite() doesn’tprovide the number of characters processed or written. If this functionality is
required, use the csv_write() function combined with fwrite().
csv_write2() and csv_fwrite2() work similiarly but takeanadditional argument, the quote character to use
when composing the field.
CUSTOMIZING THE PARSER
The csv_set_delim() and csv_set_quote() functions provide a means to change the characters that the
parser will consider the delimiter and quote characters respetively,cast to unsigned char. csv_get_delim()
and csv_get_delim() return the current delimiter and quote characters respectively.When csv_init() is
called the delimiter is set to CSV_COMMA and the quote to CSV_QUOTE.Note that the rest of the
CSV conventions still apply when these functions are used to change the delimiter and/or quote characters,
fields containing the newquote character or delimiter must be quoted and quote characters must be escaped
with an immediately preceeding instance of the same character.Additionally,the csv_set_space_func()
and csv_set_term_func() allowauser-defined function to be provided which will be used determine what
constitutes a space character and what constitutes a record terminator character.The space characters deter-
mine which characters are removedfrom the beginning and end of non-quoted fields and the terminator
characters govern when a record ends. When csv_init() is called, the effect is as if these functions were
each called with a NULL argument in which case no function is called and CSV_SPACEand CSV_TAB
are used for space characters, and CSV_CR and CSV_LF are used for terminator characters.
csv_set_realloc_func() can be used to set the function that is called when the internal buffer needs to be
resized, only realloc, not malloc, is used internally; the default is to use the standard realloc function. Like-
wise, csv_set_free_func() is used to set the function called to free the internal buffer,the default is the
9January 2013 4
CSV(3) CSV(3)
standard free function.
csv_get_blk_size() and csv_set_blk_size() can be used to get and set the block size of the parser respec-
tively.The block size if the amount of extra memory allocated every time the internal buffer needs to be
increased, the default is 128. csv_get_buffer_size() will return the current number of bytes allocated for
the internal buffer.
THE CSV FORMAT
Although quite prevelant there is no standard for the CSV format. There are however, a set of traditional
conventions used by manyapplications. libcsv follows the conventions described at http://www.cre-
ativyst.com/Doc/Articles/CSV/CSV01.htm which seem to reflect the most common usage of the format,
namely:
Fields are seperated with commas.
Rows are delimited by newline sequences (see below).
Fields may be surrounded with quotes.
Fields that contain comma, quote, or newline characters MUST be quoted.
Each instance of a quote character must be escaped with an immediately preceding quote charac-
ter.
Leading and trailing spaces and tabs are removedfrom non-quoted fields.
The final line need not contain a newline sequence.
In strict mode, anydetectable violation of these rules results in an error.
RFC 4180 is an informational memo which attempts to document the CSV format, especially with regards
to its use as a MIME type. There are a several parts of the description documented in this memo which
either do not accurately reflect widely used conventions or artificially limit the usefulness of the format.
The differences between the RFC and libcsv are:
"Each line should contain the same number of fields throughout the file"
libcsv doesn’tcare if every record contains a different number of fields, such a restriction
could easily be enforced by the application itself if desired.
"Spaces are considered part of a field and should not be ignored"
Leading and trailing spaces that are part of non-quoted fields are ignored as this is by far
the most common behavior and expected by manyapplications.
abc , def
is considered equivalent to:
"abc", "def"
"The last field in the record must not be followed by a comma"
The meaning of this statement is not clear but if the last character of a record is a comma,
libcsv will interpret that as a final empty field, i.e.:
"abc", "def",
will be interpreted as 3 fields, equivalent to:
"abc", "def", ""
RFC 4180 limits the allowable characters in a CSV field, libcsv allows anycharacter to be present
in a field provided it adheres to the conventions mentioned above.This makes it possible to store
binary data in CSV format, an attribute that manyapplication rely on.
9January 2013 5
CSV(3) CSV(3)
RFC 4180 states that a Carriage Return plus Linefeed combination is used to delimit records,
libcsv allows anycombination of Carriage Returns and Linefeeds to signify the end of a record.
This is to increase portability among systems that use different combinations to denote a newline
sequence.
PARSING MALFORMED DAT A
libcsv should correctly parse anyCSV data that conforms to the rules discussed above.Bydefault, how-
ev e r, libcsv will also attempt to parse malformed CSV data such as data containing unescaped quotes or
quotes within non-quoted fields. Forexample:
a"c, "d"f"
would be parsed equivalently to the correct form:
"a""c", "d""f"
This is often desirable as there are some applications that do not adhere to the specifications previously dis-
cussed. However, there are instances where malformed CSV data is ambigious, namely when a comma or
newline is the next non-space character following a quote such as:
"Sally said "Hello", Wally said "Goodbye""
This could either be parsed as a single field containing the data:
Sally said "Hello", Wally said "Goodbye"
or as 2 seperate fields:
Sally said "Hello and Wally said "Goodbye""
Since the data is malformed, there is no way to knowifthe quote before the comma is meant to be a literal
quote or if it signifies the end of the field. This is of course not an issue for properly formed data as all
quotes must be escaped. libcsv will parse this example as 2 seperate fields.
libcsv provides a strict mode that will return with a parse error if a quote is seen inside a non-quoted field
or if a non-escaped quote is seen whose next non-space character isn’tacomma or newline sequence.
PARSER DETAILS
Afield is considered quoted if the first non-space character for a newfield is a quote.
If a quote is encountered in a quoted field and the next non-space character is a comma, the field ends at the
closed quote and the field data is submitted when the comma is encountered. If the next non-space charac-
ter after a quote is a newline character,the rowhas ended and the field data is submitted and the end of row
is signalled (via the appropriate callback function). If twoquotes are immediately adjacent, the first one is
interpreted as escaping the second one and one quote is written to the field buffer.Ifthe next non-space
character following a quote is anything else, the quote is interpreted as a non-escaped literal quote and it
and what follows are written to the field buffer,this would cause a parse error in strict mode.
Example 1
"abc"""
Parses as: abc"
The first quote marks the field as quoted, the second quote escapes the following quote and the last quote
ends the field. This is valid in both strict and non-strict modes.
Example 2
9January 2013 6
CSV(3) CSV(3)
"ab"c
Parses as: ab"c
The first qute marks the field as quoted, the second quote is taken as a literal quote since the next non-space
character is not a comma, or newline and the quote is not escaped. The last quote ends the field (assuming
there is a newline character following). A parse error would result upon seeing the character c in strict
mode.
Example 3
"abc" "
Parses as: abc"
In this case, since the next non-space character following the second quote is not a comma or newline char-
acter,aliteral quote is written, the space character after is part of the field, and the last quote terminated the
field. This demonstrates the fact that a quote must immediately precede another quote to escape it. This
would be a strict-mode violation as all quotes are required to be escaped.
If the field is not quoted, anyquote character is taken as part of the field data, anycomma terminated the
field, and anynewline character terminated the field and the record.
Example 4
ab""c
Parses as: ab""c
Quotes are not considered special in non-quoted fields. This would be a strict mode violation since quotes
may not exist in non-quoted fields in strict mode.
EXAMPLES
The following example prints the number of fields and rows in a file. This is a simplified version of the
csvinfo program provided in the examples directory.Error checking not related to libcsv has been removed
for clarity,the csvinfo program also provides an option for enabling strict mode and handles multiple files.
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <stdlib.h>
#include "libcsv/csv.h"
struct counts {
long unsigned fields;
long unsigned rows;
};
void cb1 (void *s, size_t len, void *data) {
((struct counts *)data)->fields++; }
void cb2 (int c, void *data) {
((struct counts *)data)->rows++; }
int main (int argc, char *argv[]) {
FILE *fp;
struct csv_parser p;
char buf[1024];
size_t bytes_read;
struct counts c = {0, 0};
if (csv_init(&p, 0) != 0) exit(EXIT_FAILURE);
fp = fopen(argv[1], "rb");
if (!fp) exit(EXIT_FAILURE);
9January 2013 7
CSV(3) CSV(3)
while ((bytes_read=fread(buf, 1, 1024, fp)) > 0)
if (csv_parse(&p, buf, bytes_read, cb1, cb2, &c) != bytes_read) {
fprintf(stderr,"Error while parsing file: %s\n",
csv_strerror(csv_error(&p)) );
exit(EXIT_FAILURE);
}
csv_fini(&p, cb1, cb2, &c);
fclose(fp);
printf("%lu fields, %lu rows\n", c.fields, c.rows);
csv_free(&p);
exit(EXIT_SUCCESS);
}
See the examples directory for several complete example programs.
AUTHOR
Written by Robert Gamble.
BUGS Please send questions, comments, bugs, etc. to:
rg amble@users.sourceforge.net
9January 2013 8

Navigation menu