Chapter 1: Introduction

This manual describes flexc++, a tool for generating lexical scanners: programs recognizing patterns in text. Usually, scanners are used in combination with parsers which can be generated by, e.g., bisonc++.

Flexc++ reads one or more input files (called `lexer' in this manual), containing rules: regular expressions, optionally associated with C++ code. From this Flexc++ generates several files, containing the declaration and implementation of a class (Scanner by default). The member function lex is used to analyze input: it looks for text matching the regular expressions. Whenever it finds a match, it executes the associated C++ code.

Flexc++ is highly comparable to the programs flex and flex++, written by Vern Paxson. Our goal was to create a similar program, completely implementing it in C++, and merely generating C++ code. Most flex / flex++ grammars should be usable with flexc++, with minor adjustments (see also `differences with flex/flex++ 2').

This edition of the manual documents version 2.02.00 and provides detailed information on flexc++'s use and inner workings. Some texts are adapted from the flex manual. The manual page flexc++(1) provides an overview of the command line options and option directives, flexc++api(3) provides an overview of the application programmer's interface, and flexc++input(7) describes the organization of flexc++'s input s.

The most recent version of both this manual and flexc++ itself can be found at http://flexcpp.sourceforge.net/. If you find a bug in flexc++ or mistakes in the documentation, please report it to the authors.

Flexc++ was designed and written by Frank B. Brokken, Jean-Paul van Oosten, and (up to version 0.5.3) Richard Berendsen.

1.1: Running Flexc++

Flexc++(1) was designed after flex(1) and flex++(1). Like these latter two programs flexc++ generates code performing pattern-matching on text, possibly executing actions when the input matches its regular expressions.

Contrary to flex and flex++, flexc++ generates code that is explicitly intended for use by C++ programs. The well-known flex(1) program generates C source-code and flex++(1) merely offers a C++-like shell around the yylex function generated by flex(1) and hardly supports present-day ideas about C++ software design.

Flexc++ creates a C++ class offering a predefined member function lex which matches input against regular expressions and possibly executes C++ code once regular expressions are matched. The code generated by flexc++ is pure C++, allowing its users to apply all of the features offered by that language.

Flexc++'s synopsis is:

flexc++ [OPTIONS] rules-file

Its options are covered in section 1.1.1, the format of its rules-file is discussed in chapter 3.

1.1.1: Flexc++ options

Where available, single letter options are listed between parentheses following their associated long-option variants. Single letter options require arguments if their associated long options require arguments as well. Options affecting the class header or implementation header file are ignored if these files already exist. Options accepting a `filename' do not accept path names, i.e., they cannot contain directory separators (/); options accepting a 'pathname' may contain directory separators.

Some options may generate errors. This happens when an option conflicts with the contents of an existing file which flexc++ cannot modify (e.g., a scanner class header file exists, but doesn't define a name space, but a --namespace option was provided). To solve the error the offending option could be omitted, the existing file could be removed, or the existing file could be hand-edited according to the option's specification. Note that flexc++ currently does not handle the opposite error condition: if a previously used option is omitted, then flexc++ does not detect the inconsistency. In those cases you may encounter compilation errors.

--baseclass-header=filename (-b)
Use filename as the name of the file to contain the scanner class's base class. Defaults to the name of the scanner class plus base.h
It is an error if this option is used and an already existing scanner-class header file does not include `filename'.
--baseclass-skeleton=pathname (-C)
Use pathname as the path to the file containing the skeleton of the scanner class's base class. Its filename defaults to flexc++base.h.
--case-insensitive
Use this option to generate a scanner case insensitively matching regular expressions. All regular expressions specified in flexc++'s input file are interpreted case insensitively and the resulting scanner object will case insensitively interpret its input.
When this option is specified the resulting scanner does not distinguish between the following rules:
```
        First       // initial F is transformed to f
        first
        FIRST       // all capitals are transformed to lower case chars
                
```
With a case-insensitive scanner only the first rule can be matched, and flexc++ will issue warnings for the second and third rule about rules that cannot be matched.
Input processed by a case-insensitive scanner is also handled case insensitively. The above mentioned First rule is matched for all of the following input words: first First FIRST firST.
Although the matching process proceeds case insensitively, the matched text (as returned by the scanner's matched() member) always contains the original, unmodified text. So, with the above input matched() returns, respectively first, First, FIRST and firST, while matching the rule First.
--class-header=filename (-c)
Use filename as the name of the file to contain the scanner class. Defaults to the name of the scanner class plus the suffix .h
--class-name=className
Use className (rather than Scanner) as the name of the scanner class. Unless overridden by other options generated files will be given the (transformed to lower case) className* name instead of scanner*.
It is an error if this option is used and an already existing scanner-class header file does not define class `className'
--class-skeleton=pathname (-C)
Use pathname as the path to the file containing the skeleton of the scanner class. Its filename defaults to flexc++.h.
--construction (-K)
Write details about the lexical scanner to the file `rules-file'.output. Details cover the used character ranges, information about the regexes, the raw NFA states, and the final DFAs.
--debug (-d)
Provide lex and its support functions with debugging code, showing the actual parsing process on the standard output stream. When included, the debugging output is active by default, but its activity may be controlled using the setDebug(bool on-off) member. Note that #ifdef DEBUG macros are not used anymore. By rerunning flexc++ without the --debug option an equivalent scanner is generated not containing the debugging code.
--filenames=genericName (-f)
Generic name of generated files (header files, not the lex-function source file, see the --lex-source option for that). By default the header file names will be equal to the name of the generated class.
--help (-h)
Write basic usage information to the standard output stream and terminate.
--implementation-header=filename (-i)
Use filename as the name of the file to contain the implementation header. Defaults to the name of the generated scanner class plus the suffix .ih. The implementation header should contain all directives and declarations only used by the implementations of the scanner's member functions. It is the only header file that is included by the source file containing lex()'s implementation. User defined implementation of other class members may use the same convention, thus concentrating all directives and declarations that are required for the compilation of other source files belonging to the scanner class in one header file.
It is an error if this option is used and an already 'filename' file does not include the scanner class header file.
--implementation-skeleton=pathname (-I)
Use pathname as the path to the file containing the skeleton of the implementation header. Its filename defaults to flexc++.ih.
--lex-skeleton=pathname (-L)
Use pathname as the path to the file containing the lex() member function's skeleton. Its filename defaults to flexc++.cc.
--lex-function-name=funname
Use funname rather than lex as the name of the member function performing the lexical scanning.
--lex-source=filename (-l)
Define filename as the name of the source file to contain the scanner member function lex. Defaults to lex.cc.
--matched-rules (-'R')
The generated scanner will write the numbers of matched rules to the standard output. It is implied by the --debug option. Displaying the matched rules can be suppressed by calling the generated scanner's member setDebug(false) (or, of course, by re-generating the scanner without using specifying --matched-rules).
--max-depth=depth (-m)
Set the maximum inclusion depth of the lexical scanner's specification files to depth. By default the maximum depth is set to 10. When more than depth specification files are used the scanner throws a Max stream stack size exceeded std::length_error exception.
--namespace=identifier
Define the scanner class in the namespace identifier. By default no namespace is used. If this options is used the implementation header is provided with a commented out using namespace declaration for the requested namespace. In addition, the scanner and scanner base class header files also use the specified namespace to define their include guard directives.
It is an error if this option is used and an already scanner-class header file does not define namespace identifier.
--no-baseclass-header
Do not write the file containing the scanner's base class interface even if it doesn't yet exist. By default the file containing the scanner's base class interface is (re)written each time flexc++ is called.
--no-lines
Do not put #line preprocessor directives in the file containing the scanner's lex function. By default #line directives are entered at the beginning of the action statements in the generated lex.cc file, allowing the compiler and debuggers to associate errors with lines in your grammar specification file, rather than with the source file containing the lex function itself.
--no-lex-source
Do not write the file containing the scanner's predefined scanner member functions, even if that file doesn't yet exist. By default the file containing the scanner's lex member function is (re)written each time flexc++ is called. This option should normally be avoided, as this file contains parsing tables which are altered whenever the grammar definition is modified.
--own-tokens (-T)
The tokens returned as well as the text matched when flexc++ reads its input files(s) are shown when this option is used.
This option does not result in the generated program displaying returned tokens and matched text. If that is what you want, use the --print-tokens option.
--print-tokens (-t)
The tokens returned as well as the text matched by the generated lex function are displayed on the standard output stream, just before returning the token to lex's caller. Displaying tokens and matched text is suppressed again when the lex.cc file is generated without using this option. The function showing the tokens (ScannerBase::print__) is called from Scanner::printTokens, which is defined in-line in Scanner.h. Calling ScannerBase::print__, therefore, can also easily be controlled by an option controlled by the program using the scanner object.
This option does not show the tokens returned and text matched by flexc++ itself when reading its input s. If that is what you want, use the --own-tokens option.
--regex-calls
Show the function call order when parsing regular expressions (this option is normally not required. Its main purpose is to help developers understand what happens when regular expressions are parsed).
--show-filenames (-F)
Write the names of the files that are generated to the standard error stream.
--skeleton-directory=pathname (-S)
Defines the directory containing the skeleton files. This option can be overridden by the specific skeleton-specifying options (-B -C, -H, and -I).
--target-directory=pathname
Specifies the directory where generated files should be written. By default this is the directory where flexc++ is called.
--usage (-h)
Write basic usage information to the standard output stream and terminate.
--verbose(-V)
The verbose option generates on the standard output stream various pieces of additional information, not covered by the --construction and --show-filenames options.
--version (-v)
Display flexc++'s version number and terminate.

1.2: Some simple examples

1.2.1: A simple lexer file and main function

The following lexer file detects identifiers:

%%
[_a-zA-Z][_a-zA-Z0-9]* return 1;

The main() function below defines a Scanner object, and calls lex() as long as it does not return 0. lex() returns 0 if the end of the input stream is reached. (By default std::cin will be used).

#include <iostream>
#include "Scanner.h"

using namespace std;

int main()
{
	Scanner scanner;
	while (scanner.lex())
		cout << "[Identifier: " << scanner.matched() << "]";
}

Each identifier on the input stream is replaced by itself and some surrounding text. By default, flexc++ echoes all characters it cannot match to cout. If you do not want this, simply use the following pattern:

%%
[_a-zA-Z][_a-zA-Z0-9]*		return 1;
.|\n						// ignore

The second pattern will cause flexc++ to ignore all characters on the input stream. The first pattern will still match all identifiers, even those that consist of only one letter. But everything else is ignored. The second pattern has no associated action, and that is precisely what happens in lex: nothing. The stream is simply scanned for more characters.

It is also possible to let the generated lexer do all the work. The simple lexer below shows all encountered identifiers.

%%
[_a-zA-Z][_a-zA-Z0-9]*      {
            std::cout << "[Identifier: " << matched() << "]\n";
        }
.|\n                        // ignore

Note how a compound statement may be used instead of a one line statement at the end of the line. The opening bracket must appear on the same line as the pattern, however. Also note that inside an action, we can use Scanner's members. E.g., matched() contains the text of the token that was last matched. The following main function can be used to activate the generated scanner.

#include "Scanner.h"

int main()
{
	Scanner scanner;
	scanner.lex();
}

Note how simple this function is. Scanner::lex() does not return until the entire input stream has been processed, because none of the patterns has an associated action using a return statement.

1.2.2: An interactive scanner supporting command-line editing

The flexc++(1) manual page contains an example of an interactive scanner. Let's add command-line editing and command-line history to that scanner.

Command-line editing and history is provided by the Gnu readline library. The bobcat library offers a class FBB::ReadLineStream encapsulating Gnu's readline library's facilities. This class wass used by the following example to implement the required features.

The lexical scanner is a simple one. It recognizes C++ identifiers and \n characters, and ignores all other characters. Here is its specification:


%class-name Scanner
%interactive

%%

[[:alpha:]_][[:alnum:]_]*   return 1;
\n                          return '\n';
.

Create the lexical scanner from this specification file:


    flexc++ lexer

Assuming that the directory containing the specification file also contains the file main.cc whose implementation is shown below, then execute the following command to create the interactive scanner program:


    g++ --std=c++11 *.cc -lbobcat

This completes the construction of the interactive scanner. Here is the file main.cc:


#include <iostream>
#include <bobcat/readlinestream>

#include "Scanner.h"

using namespace std;
using namespace FBB;

int main()
{
    ReadLineStream rls("? ");       // create the ReadLineStream, using "? "
                                    // as a prompt before each line
                                    
    Scanner scanner(rls);           // pass `rls' to the interactive scanner

                                    // process all the line's tokens
                                    // (the prompt is provided by `rls')
    while (int token = scanner.lex())
    {                                   
        if (token == '\n')          // end of line: new prompt
            continue;
                                    // process other tokens
        cout << scanner.matched() << '\n';
        if (scanner.matched()[0] == 'q')
            return 0;
    }
}

An interactive session with the above program might look like this (end-of-line comment is not entered, but was added by us for documentary purposes):

   
    $ a.out
    ? hello world               // enter some words
    hello 
    world                       // echoed after pressing Enter
    ? hello world               // this is shown after pressing up-arrow
    ? hello world^H^H^Hman      // do some editing and press Enter
    hello                       // the tokens as edited are returned 
    woman
    ? q                         // end the program
    $

The interactive scanner only supports one constructor, by default using std::cin, to read from, and by default using std::cout to write to:


    explicit Scanner(std::istream &in = std::cin,
                     std::ostream &out = std::cout);

Interactive scanners only support switching output streams (through switchOstream members).