Saturday, June 10, 2017

Parsing Expression Grammars

Parsing Expressions Grammar


PEG are a (no longer so) new approach to parsing.

Clibutl offers a set of functions to write parsers in C without having to use an external "parser generator".

Rather than a dry list of functions, I'll present here a short tutorial on how to build a simple parser.
The approach to PEG used by Clibutl is based on the article:
    R. R. Redziejowski. Parsing Expression Grammar as a Primitive
    Recursive-Descent Parser with Backtracking.
    Fundamenta Informaticae 79, 3-4 (2007), pp. 513 - 524. 

Define the grammar

First of all, let's define a grammar. I'll use the classical example of numerical expressions. A full example can be found here: https://github.com/rdentato/clibutl/blob/master/examples/calculator/calculator.c . Let's start simple, just additions and subtractions:

   expr  ::= term (op term)*
   term  ::= num
   op    ::= '+' | '-'
   num   ::= digit+
   digit ::= '0' | '1' | ... | '9' 

Each non-terminal is represented by a pegrule. This is how term looks like:

   pegrule(expr) {
     pegref(term);
     pegstar {
       pegref(op);
       pegref(term);
     }
   }

The pegref() function allows referencing other non-terminals, the pegstar instruction represent unbounded optional repetitions. It is important to notice that the code above is valid C code. Of course after having included utl.h!
Let's write down the entire grammar:
pegrule(expr) {
  pegref(term);
  pegstar {
    pegref(op);
    pegref(term);
  }
}

pegrule(term) {
  pegref(num);
}

pegrule(op) {
  pegchoice {
    pegeither { pegstring("+"); }
    pegor     { pegstring("-"); }
  }
}

pegrule(num) {
  pegplus {
    pegref(digit);
  }
}

pegrule(digit) {
  pegoneof("0123456789");
}
As you can see, this is an almost one-to-one translation of the original grammar. I hope this feature will make peg functions easier to use.
The rule for op shows how to express the PEG ordered choice. The pegchoice instruction reminds the switch statment where the either/or branches are similar to the the case clauses.
The pegplus is similar to pegstar but requires at least one match. You can guess there's also a pegopt instruction (not shown in this example) for optional match.
The pegstring() and pegoneof() functions are atomic recognizers that, respectively, match that specified string or one of the characters in the string.


Do the parse


With those simple instructions we have now defined a recognizing parser, i.e. a parser that can tell if a string abides to the specified grammar or not. Let's see how to invoke it:
#define MAX_EXPR_LEN 256
char expression[MAX_EXPR_LEN];

int main(int argc, char *argv [])
{
  peg_t parser;
  
  parser = pegnew();

  while (fgets(expression,MAX_EXPR_LEN,stdin)) {
    if (pegparse(parser,expr,expression))
      printf("(%.*s) is a valid expression\n",pegpos(parser)-expression,expression);
    else
      printf("(%s) is not a valid expression\n",expression);
    fflush(stdout);
  }  
  
  parser = pegfree(parser);
  exit(0);
}
The pegparse() does all the magic. It takes a parser created with pegnew(), a peg rule to be matched and a string and returns a true value if the initial portion of the string matches the rule or 0 if the string doesn't match at all.
Here some examples of the program output:
"5+4"     -> (5+4) is a valid expression
"1+3XX"   -> (1+3) is a valid expression
"YZX"     -> (YZX) is not a valid expression

"3 + 2"   -> (3) is a valid expression
The last example shows that we need to explicitly take care of white spaces. The recognizer pegwspace matches a possibly empty sequence of white spaces ('\t' and ' '), while pegvspace also matches vertical space like '\r' and '\n'. Look at the expr2.c listing to see how simply you can handle spaces.

Next Steps

Now is the time to make this parser evolve and do something more useful than just recognizing expressions, for example evaluating them. This will require some more detailed discussion and I will deal with it in a future post.

Thursday, January 19, 2017

Exceptions Handling - Mechanics

I've posted (quite some time ago) a description of the try/catch blocks provided by clibutl. I guess it's now the time to look under the hood and see at the implementation.
Below is the code, we'll go through it in the following sections.
typedef struct utl_jb_s {
  jmp_buf           jmp;
  struct utl_jb_s  *prv;
  int               err;
  int32_t           flg;
} utl_jb_t;

extern utl_jb_t *utl_jmp_list; // Defined in utl_hdr.c

#define try  for ( utl_jb_t utl_jb = {.flg=0, .prv=utl_jmp_list}; \
                  !utl_jb.flg && (utl_jmp_list=&utl_jb); \
                   utl_jmp_list=utl_jb.prv, utl_jb.flg=1) \
              if ((utl_jb.err = setjmp(utl_jb.jmp)) == 0)
                 
#define catch(x) else if (   (utl_jb.err == (x)) \
                          && (utl_jmp_list=utl_jb.prv, utl_jb.flg=1)) 

#define catchall else for ( utl_jmp_list=utl_jb.prv; \
                           !utl_jb.flg; \
                            utl_jb.flg=1) 

#define throw(x) do { \
                   int ex_ = x; \
                   if (ex_ > 0 && utl_jmp_list) \
                     longjmp(utl_jmp_list->jmp,ex_); \
                 } while (0)

#define thrown()  utl_jb.err
  
#define rethrow() throw(thrown())

setjmp()/longjmp()


The magic that allow a C program to (almost) freely jump from a function to another is provided by the two functions setjmp() and longjmp() provided in the setjmp.h standard header.
When setjmp() is called it stores the current state of the function and returns 0. When longjmp() is executed, it restores the previously saved state and the execution proceeds as if the setjmp() had returned a non-zero value.
In the minimal example below:
#include <stdio.h>
#include <setjmp.h>

static jmp_buf jbuf;

void do_something(void) {
  printf("Almost there ... ");
  longjmp(jbuf,1);  // jumps back
  printf("done\n"); // not executed!
}

int main() {
  if (setjmp(jbuf) == 0)
    do_something();
  else // only if longjmp() is called
    printf("not done!\n");
  return 0;
}
Executing the program above will result in printing:
    Almost there ... not done!
This is what happens:
  • The call to setjmp(jbuf) returns 0 and the do_something() function is invoked
  • The string "Almost there ... " is printed.
  • The longjump(jbuf,1) is executed, jumping back to the if where setjmp() was called but, this time, returning 1
  • The string "not done!" is printed.
Using setjmp() and longjmp() is subject to limitations and caveats and may cause subtle bugs if not managed properly.

The jmp_buf list


Since try/catch blocks can be nested (directly or indirectly via function call) we need as many jmp_buf buffers as nested levels. Those buffers are kept in a linked list whose head is stored in the global variable utl_jmp_list:
typedef struct utl_jb_s {
  jmp_buf           jmp;
  struct utl_jb_s  *prv;
  int               err;
  int32_t           flg;
} utl_jb_t;

extern utl_jb_t *utl_jmp_list;
Actually, as you have surely noticed, this is a list of utl_jb_t structures, each one containing a jmp_buf element plus what is needed to handle the exceptions.
During the execution, if there are nested try blocks (or if there is a try block in a function called from within another try block) the list is as shown in the following image:


Within inner try block, the head of the list is the structure marked with C.

The try macro


Let's see how the try macro works:
#define try  for ( utl_jb_t utl_jb = {.flg=0, .prv=utl_jmp_list}; \
                  !utl_jb.flg && (utl_jmp_list=&utl_jb); \
                   utl_jmp_list=utl_jb.prv, utl_jb.flg=1) \
              if ((utl_jb.err = setjmp(utl_jb.jmp))== 0)
Using for to create statement-like macros is a trick similar to the well known do { } while(0). It is appropriate when you have to perform something at the end of the pseudo-statement.
The for initialization part:
utl_jb_t utl_jb = {.flg=0, .prv=utl_jmp_list};
creates a new local variable of type utl_jb_t and initializes it so that:
  • The current list is appended as its tail: .prv=utl_jmp_list
  • The flg field, which controls the flow of the for loop, is set to 0
Then we check if we have to enter the loop or not:
!utl_jb.flg && (utl_jmp_list=&utl_jb); 
The first time we execute the loop, the two expressions in AND are true:
  • !utl_jb.flg is true since we just initialized utl_jb.flg to 0;
  • (utl_jmp_list=&utl_jb) is true because the address of utl_jb is surely not NULL.
So, we will enter the body of the loop! Loot at how the newly created variable structure utl_jb is made the head of the list by the assignment utl_jmp_list = &utl_jb.

At the end of the try/catch blocks, the third part of the for statement will be executed:
     utl_jmp_list=utl_jb.prv, utl_jb.flg=1
 removing the local utl_jb structure from the list and setting utl_jb.flg to 1 to ensure that we will not re-enter the for loop. .

When we'll re-execute the test to see if we have to re-enter the for loop:
    !utl_jb.flg && (utl_jmp_list=&utl_jb); 
we'll have utl_jb.flg set to 1, the entire expression is false and we will not re-enter the for loop body. Thanks to the short-circuit behaviour of && the other expression won't be evaluated (which would cause the jump buffer list to be messed up!).


The try/catch blocks


The setjmp() function needs to be in the scope of a conditional statement so to distinguish the normal path (i.e. when the setjmp() is executed) from the exceptional path (i.e. when the longjmp() is executed). In our case the body of the for is a single if statement which guards the the try body:
    if ((utl_jb.err = setjmp(utl_jb.jmp))== 0)
In essence, a try block is an if statement and a catch() block is its else part:
#define catch(x) else if (   (utl_jb.err == (x)) \
                          && (utl_jmp_list = utl_jb.prv, utl_jb.flg = 1))
A catch is an if as well so to be able to accept the next catch.

Looking at the macro expansions should make things clearer:
  try {               |   if ((utl_jb.err = setjmp(utl_jb.jmp))== 0) {
    do_something();   |     do_something();
  }                   |   }
  catch(4) {         ==>  else if ((utl_jb.err == (4)) &&
                      |            (utl_jmp_list = utl_jb.prv, utl_jb.flg = 1)) { 
    // exception 4    |     // exception 4
  }                   |   }
I've omitted the for part for clarity. You can see that each catch is just a branch of a set of nested if. The conditition (utl_jmp_list = utl_jb.prv, utl_jb.flg = 1) is an always true expression that removes the current jump buffer from the head of the list (the assignement to to utl_jb.flg is there only to ensure that the entire expressions evaluates to true).
Removing utl_jb from the list ensures that if we throw() an exception within a catch block, it will be handled by the higher-level try block (and not the same one the catch block belongs to).

Throwing exceptions


The throw() macro will call longjmp() to jump back:
#define throw(x) do { \
                   int ex_ = x; \
                   if (ex_ > 0 && utl_jmp_list) longjmp(utl_jmp_list->jmp,ex_); \
                 } while (0)
The variable ex_ is used to ensure that the macro argument won't be evaluated multiple times. Exceptions are positive integers (so that the setjmp() function won't be fooled).
Testing utl_jmp_list ensures that we are indeed within a try block (or in a function called from within a try block). Note how throw() take the head of the jump buffers list.

If, within a catch block, one needs to know which exception has been thrown he can use the thrown() function which is actully just a macro to access the value of utl_jb.err. This is most probably only useful within a catchall block that handles any exception that has not been caught by a previous catch block.

The rethrow() function is simply defined in terms of throw() and thrown():
#define rethrow() throw(thrown())

Catch all


The last block should be a catchall block for two reasons:
  • It's good practice to ensure that no exception is left unhandled. Even if, at first, one could cover every possible expection in its specific catch(), there is the risk that at a later time, a new exception is introduced and no specific handler for it is defined.
    Having a catchall block at the end, possibly with an error message if the event is really unexpected, ensures that no such problem will arise.
  • The try/catch macros, unfortunately, are not 100% safe. Consider the following code:
    if (some_test()) 
        try 
            do_something();
        catch(EXCEPTION)
            handle_it();
    else
        printf("Not even tried!\n");
    
    It's clear that the intention is to have it printing "Not even tried!" if some_test() fails.
    Too bad that the hidden if in the catch macro will claim the else statement as its own!
An obvious way to solve this is to enclose the try/catch block within braces:
if (some_test()) {
    try 
        do_something();
    catch(EXCEPTION)
        handle_it();
}
else
    printf("Not even tried!\n");
 However, had the catchall block being present, the problem would disappear entirely:
if (some_test()) 
    try 
        do_something();
    catch(EXCEPTION)
        handle_it();
    catchall
        handle_others();
else
    printf("Not even tried!\n");
The code above correctly prints "Not even tried!" if some_test() fails.

To achieve that, the catchall macro is defined as:
#define catchall else for ( utl_jmp_list=utl_jb.prv; \
                           !utl_jb.flg; \
                            utl_jb.flg=1) 
The else statement will close the set of nested if. The for statement removes the jump list head (as for the other catch blocks) and uses utl_jb.flg to ensure it is executed exactly once.

*  *  *  *
Adding exception handling to C is surprisingly easy and it has proven itself an enjoyable and instructive task. I hope you can find it useful as well.

Saturday, January 14, 2017

Tracing Tests

Having read an interesting blog article by Kartik Agaram about using traces for testing, I decided to add this capability to c-libutl.

On top of the existing logging functions:

  • logerror() for messages that report about critical failures;
  • logwarning() for messages that report about potential issues (but are not real errors);
  • loginfo() for messages about the status of the program (e.g. for auditing purposes);
  • logdebug() for debugging messages;
  • logtrace() for detailed traces of program executions (for debugging or to better understanding of software internals);
and the checking functions:
  • logcheck() for checking a boolean condition;
  • logexpect() for checking and print a failure notice;
  • logassert() for checking and aborting on failure;

I've intrduced a logwatch() statement that will check whether a specific string appears or not in the log file:

       62:  logopen("mylog.log","w");
       63:  for (k=0; k &lt NUM_VALVES; k++) {
       64:    logwatch ("[open]","[closed]","![failed]") {
       65:      vlv = opennextvalve();
       65:      if (vlv ! NULL) {
       66:        loginfo("The valve %d is [open]",vlv);
       67:        flowthrough(vlv);
       68:      ]
       69:      else logerror("Valve [failed] to open at step %d",k);
       70:      if (closevalve(vlv))
       71:        loginfo("Valve %d is now [closed]",vlv);
       72:      else 
       73:        logerror("Valve %d [failed] to close",vlv);
       74:    }
       75:  }
       75:  logclose();
In the example above, we are checking that, within the logwatch() scope, the strings "[open]" and "[closed]" appear in the log at least once. We are also checking that the string "[failed]" does not appear at all.

Say that everything goes well, the log file will look like this:
       ...     
       2017-01-10 21:47:25 WCH START  :mytest.c:64 
       2017-01-10 21:47:25 INF The valve 321 is [open]  :mytest.c:66 
       2017-01-10 21:47:25 CHK PASS ([open])?  :mytest.c:66 
       2017-01-10 21:47:25 INF Valve 321 is now [closed] :mytest.c:71 
       2017-01-10 21:47:25 CHK PASS ([closed])?  :mytest.c:71 
       2017-01-10 21:47:25 CHK PASS (![failed])? :mytest.c:64 
       2017-01-10 21:47:25 WCH START  :mytest.c:64 
       ...
       2017-01-10 21:47:25 CHK #KO: 0 (of 12)
       2017-01-10 21:47:25 LOG STOP
The string "[failed]" is expected not to appear being preceeded, in the logwatch(), by the character '!'.
In our example if a valve can't be closed, we'll have a log like:
       ...     
       2017-01-10 21:47:25 WCH START  :mytest.c:64 
       2017-01-10 21:47:25 INF The valve 321 is [open]  :mytest.c:66 
       2017-01-10 21:47:25 CHK PASS ([open])?  :mytest.c:66 
       2017-01-10 21:47:25 ERR Valve 321 [failed] to close :mytest.c:73 
       2017-01-10 21:47:25 CHK FAIL (![failed])? :mytest.c:73 
       2017-01-10 21:47:25 CHK FAIL ([closed])?  :mytest.c:64 
       2017-01-10 21:47:25 WCH STOP  :mytest.c:64 
       ...
       2017-01-10 21:47:25 CHK #KO: 2 (of 12)
       2017-01-10 21:47:25 LOG STOP
and we'll get two failed checks: one for the unexpected presence of the string "[failed]" and one for the lack of the string "[closed]" in the log.

*   *   *   *

This mechanism is made very flexibile by the fact that the strings to specify are, actually, patterns that follow the pmx syntax specified here. I'll write soon a post on pmx, the c-libutl pattern matching expressions, for more details.

As Kartik pointed out in his post, writing unit tests based based on traces can prove itself more effective than the usual approach of directly checking the functions return value (or the variables value). The need to restructure tests should decrease as traces would contain informaton about high-level events whereas variables and functions could change for
Since the events reported in the log files are at an higher level of abstraction, they are more resilient to changes in the code. The drawback is that this is an indirect test and that it may be difficult to ensure that the result is indeed what one expects.

In the end, I believe this is a very useful tool that could help in writing robust and effective unit tests and I hope you'll find it useful too.

Saturday, June 23, 2012

Finite State Machines in C

FSM in C are often implemented through loop (while or for), switch and control variables.
This approach has the biggest disadvantage that is almost impossible from the code to understand the original FSM diagram.
That's why c-libutl offers simple macros to implement FSM with a 1-to-1 relationship with the original diagram.

I'm pretty sure you will immediately relate this code:

  fsm {
    fsmSTATE(S1) {
      printf("S1: %d\n",x); 
      x++;
      if (x > 3) fsmGOTO(S2); 
      fsmGOTO(S1);
    }
    
    fsmSTATE(S3) {
      printf("S3: %d\n",x); 
      x++;
      if (x > 9) fsmGOTO(exit); 
      fsmGOTO(S3);
    }
    
    fsmSTATE(S2) {
      printf("S2: %d\n",x); 
      x++;
      if (x > 6) fsmGOTO(S3); 
      fsmGOTO(S2);
    }

    fsmSTATE(exit) {
    }
  } 

to this diagram:



The C code for the fsm macros is extremely simple:
  #define fsm           
  #define fsmGOTO(x)    goto fsm_state_##x
  #define fsmSTATE(x)   fsm_state_##x :
  #define fsmSTART      
This simple technique was explained by Tim Cooper in the the article [*Goto? Yes Goto!*](http://ftp.math.utah.edu/pub/tex/bib/complang.html#Cooper:1991:GYG) published on the May 1991 issue of the *Computer Language* magazine and, since then, I've found many uses for it.

Monday, June 18, 2012

Exception handling in C

Adding exception handling in C is a rather controversial topic. Some think that exceptions are pure evil and should never be considered, others say that those wanting to use exceptions should leave the realm of pure C and fly to C++ (or, even, Java). Others suggest that goto is a perfectly fine way to deal with exceptions.

I think that there are quite some occasions when the try/catch mechanism offered by C++ (and other languages) can make code clearer.
Let me go back to what I believe is the the #1 job of programmers:

To write code as simple and as clean as possible so that the code itself conveys its purpose.

We all know how much this is easier said than done. I believe that anything that could help in expressing oneself clearer is beneficial.

The first scenario that comes to mind is the need of acquire a set of resources (e.g. memory, files, connections,...). If a resource is not available, the ones previously acquired should be released. This is usually solved using nested if or using a goto to a specific cleanup section of the code. Personally I don't like neither one or the other and I believe that using try/catch would make the intent clearer.

If C had them, that is!

Actually, it turns out it's not so difficult to put together few macros to do it. C-libutl has them, they are very easy to use and only require a C99 compiler. Here is an example:

   #define EXCEPTION_NOMEMORY    1
   #define EXCEPTION_NORESOURCE  2
   #define EXCEPTION_NOFILE      3
   #define EXCEPTION_FATALERROR  5

   int main(int argc, char *argv[])
   {
     ...
     try {
        printf("Do something\n");
        f = fopen("myfile","r");
        if (f == NULL) throw(EXCEPTION_NOFILE);
        printf("Continue doing\n");
      } 
      catch (EXCEPTION_NOMEMORY) {
         printf("Not enough memory\n");
      }
      catch (EXCEPTION_NOFILE) {
         printf("Unable to find file\n");
      }
      ...
   }

I'm sure you guessed that if "myfile" doesn't exist this fragment of code will print:

       Do Something
       Unable to find file

So far it doesn't seem too useful: it's just a sort of goto in disguise! Or, even better, this could be handled with a simple if.
Actually, having try/catch blocks become more useful when you consider that you can throw an exception from a function and having it handled in the calling function:

      void do_it(void)
      {
         ...
         m = malloc(128);
         if ( m == NULL) throw(EXCEPTION_NOMEMORY);
         ...
      }

      int main(int argc, char *argv[])
      {
        ...
        try {
          printf("Do something\n");
          do_it();
          printf("Completed\n");
        } 
        catch (EXCEPTION_NOMEMORY) {
           printf("Not enough memory\n");
        }
        catch (EXCEPTION_NOFILE) {
           printf("Unable to find file\n");
        }
        catchall {
           printf("Something unexpected happened\n");
        }
        ...
      }

Now, if whitin the do_it() function the system runs out of memory, the fragment will print:

       Do Something
       Not enough memory

A rather useful mechanism to handle herrors neatly ... or for messing everything up! As a C programmer you're used to sharp tools, I'm sure you'll find a responsible way of using it.

The catchall block at the end will be executed if an exception is thrown but is not managed earlier.

In a catch block, you can retrieve the exception code through the function thrown() (actually this is useful only in a catchall block.

Try/catch blocks can be nested but, usually, this is a rather bad idea. What is useful, instead, is the ability to pass the exception up to have it handled from the calling function like in this example:

      void do_it(void)
      {
        ...
        try {
          ...
          m = malloc(128);
          if ( m == NULL) throw(EXCEPTION_NOMEMORY);
          ...
          f = fopen("myfile","r");
          if (f == NULL) throw(EXCEPTION_NOFILE);
          ...
        } 
        catch (EXCEPTION_NOFILE) {
          f = stdin;
          free(m); // Continue in this function
        }
        catchall {
          rethrow(); // Excpetion will be handled in the calling function
        }
        ...
      }

      int main(int argc, char *argv[])
      {
        ...
        try {
          printf("Do something\n");
          do_it();
          printf("Completed\n");
        } 
        catch (EXCEPTION_NOMEMORY) {
           printf("Not enough memory\n");
        }
        catch (EXCEPTION_NOFILE) {
           printf("Unable to find file\n");
        }
        ...
      }
In the do_it() function, the exception NOFILE is handled directly but the exception NOMEMORY is passed to the upper try/catch block via the rethrow() function.

*  *  *  *

Using exceptios can be a neat way to propagate errors through function calls. The alternative is to define specific error codes to be returned by the functions, or to pass additional parameters to functions for error handling (or use global variables!). All these solutions could lead to messy code, where the error handling logic obscures the actual program logic. In such cases, a well thought try/catch blocks could save the day.

The only thing that is left is to see how it is implemented. You can have a look at the utl_try.h and associated test cases ut_try.c file or read a more detailed post that explain the inner working mechanism of this flavour of try/catch.