This manual describes flexc++, a tool for generating lexical scanners: programs recognizing patterns in text. Usually, scanners are used in combination with parsers which can be generated by, e.g., bisonc++.
Flexc++ reads one or more input files (called `lexer' in this manual),
containing rules: regular expressions, optionally associated with C++
code. From this Flexc++ generates several files, containing the declaration and
implementation of a class (Scanner
by default). The member function
lex
is used to analyze input: it looks for text matching the regular
expressions. Whenever it finds a match, it executes the associated C++
code.
Flexc++ is highly comparable to the programs flex and flex++, written by Vern Paxson. Our goal was to create a similar program, completely implementing it in C++, and merely generating C++ code. Most flex / flex++ grammars should be usable with flexc++, with minor adjustments (see also `differences with flex/flex++ 2').
This edition of the manual documents version 2.02.00 and provides detailed information on flexc++'s use and inner workings. Some texts are adapted from the flex manual. The manual page flexc++(1) provides an overview of the command line options and option directives, flexc++api(3) provides an overview of the application programmer's interface, and flexc++input(7) describes the organization of flexc++'s input s.
The most recent version of both this manual and flexc++ itself can be found at http://flexcpp.sourceforge.net/. If you find a bug in flexc++ or mistakes in the documentation, please report it to the authors.
Flexc++ was designed and written by Frank B. Brokken, Jean-Paul van Oosten, and (up to version 0.5.3) Richard Berendsen.
Contrary to flex and flex++, flexc++ generates code that is
explicitly intended for use by C++ programs. The well-known flex(1)
program generates C source-code and flex++(1) merely offers a
C++-like shell around the yylex
function generated by flex(1) and
hardly supports present-day ideas about C++ software design.
Flexc++ creates a C++ class offering a predefined member function lex which matches input against regular expressions and possibly executes C++ code once regular expressions are matched. The code generated by flexc++ is pure C++, allowing its users to apply all of the features offered by that language.
Flexc++'s synopsis is:
flexc++ [OPTIONS] rules-file
Its options are covered in section 1.1.1, the format of its
rules-file is discussed in chapter 3.
/
); options accepting a
'pathname' may contain directory separators.
Some options may generate errors. This happens when an option conflicts with
the contents of an existing file which flexc++ cannot modify (e.g., a scanner
class header file exists, but doesn't define a name space, but a
--namespace
option was provided). To solve the error the offending option
could be omitted, the existing file could be removed, or the existing file
could be hand-edited according to the option's specification. Note that flexc++
currently does not handle the opposite error condition: if a previously used
option is omitted, then flexc++ does not detect the inconsistency. In those
cases you may encounter compilation errors.
filename
(-b)filename
as the name of the file to contain the scanner
class's base class. Defaults to the name of the scanner class plus
base.h
It is an error if this option is used and an already
existing scanner-class header file does not include
`filename'
.
pathname
(-C)pathname
as the path to the file containing the skeleton of
the scanner class's base class. Its filename defaults to
flexc++base.h
.
When this option is specified the resulting scanner does not distinguish between the following rules:
First // initial F is transformed to f first FIRST // all capitals are transformed to lower case charsWith a case-insensitive scanner only the first rule can be matched, and flexc++ will issue warnings for the second and third rule about rules that cannot be matched.
Input processed by a case-insensitive scanner is also handled case
insensitively. The above mentioned First
rule is matched for
all of the following input words: first First FIRST firST
.
Although the matching process proceeds case insensitively, the
matched text (as returned by the scanner's matched()
member)
always contains the original, unmodified text. So, with the above
input matched()
returns, respectively first, First, FIRST
and firST
, while matching the rule First
.
filename
(-c)filename
as the name of the file to contain the scanner
class. Defaults to the name of the scanner class plus the suffix
.h
className
className
(rather than Scanner
) as the name of the
scanner class. Unless overridden by other options generated files
will be given the (transformed to lower case) className*
name
instead of scanner
*.
It is an error if this option is used and an already
existing scanner-class header file does not define class
`className'
pathname
(-C)pathname
as the path to the file containing the skeleton of
the scanner class. Its filename defaults to flexc++.h
.
`rules-file'.output
. Details cover the used character ranges,
information about the regexes, the raw NFA states, and the final
DFAs.
lex
and its support functions with debugging code,
showing the actual parsing process on the standard output
stream. When included, the debugging output is active by default,
but its activity may be controlled using the setDebug(bool
on-off)
member. Note that #ifdef DEBUG
macros are not used
anymore. By rerunning flexc++ without the --debug option an
equivalent scanner is generated not containing the debugging
code.
genericName
(-f)lex
-function source file, see the --lex-source
option for
that). By default the header file names will be equal to the name
of the generated class.
filename
(-i)filename
as the name of the file to contain the
implementation header. Defaults to the name of the generated
scanner class plus the suffix .ih
. The implementation header
should contain all directives and declarations only used by
the implementations of the scanner's member functions. It is the
only header file that is included by the source file containing
lex()'s implementation. User defined implementation of other
class members may use the same convention, thus concentrating all
directives and declarations that are required for the compilation
of other source files belonging to the scanner class in one header
file.
It is an error if this option is used and an already
'filename'
file does not include the scanner class header
file.
pathname
(-I)pathname
as the path to the file containing the skeleton of
the implementation header. Its filename defaults to
flexc++.ih
.
pathname
(-L)pathname
as the path to the file containing the
lex()
member function's skeleton. Its filename defaults to
flexc++.cc
.
funname
funname
rather than lex
as the name of the member
function performing the lexical scanning.
filename
(-l)filename
as the name of the source file to contain the
scanner member function lex
. Defaults to lex.cc
.
--debug
option.
Displaying the matched rules can be suppressed by calling the
generated scanner's member setDebug(false)
(or, of course, by
re-generating the scanner without using specifying
--matched-rules
).
depth
(-m)depth
. By default the maximum depth is
set to 10. When more than depth
specification files are used
the scanner throws a Max stream stack size exceeded
std::length_error
exception.
identifier
identifier
. By default
no namespace is used. If this options is used the
implementation header is provided with a commented out using
namespace
declaration for the requested namespace. In addition,
the scanner and scanner base class header files also use the
specified namespace to define their include guard directives.
It is an error if this option is used and an already
scanner-class header file does not define namespace
identifier
.
lex
function. By default #line
directives
are entered at the beginning of the action statements in the
generated lex.cc
file, allowing the compiler and debuggers
to associate errors with lines in your grammar specification
file, rather than with the source file containing the lex
function itself.
lex
member function is
(re)written each time flexc++ is called. This option
should normally be avoided, as this file contains parsing
tables which are altered whenever the grammar definition is
modified.
This option does not result in the generated program displaying
returned tokens and matched text. If that is what you want, use
the --print-tokens
option.
lex
function are displayed on the standard output stream, just
before returning the token to lex
's caller. Displaying tokens
and matched text is suppressed again when the lex.cc
file is
generated without using this option. The function showing the
tokens (ScannerBase::print__
) is called from
Scanner::printTokens
, which is defined in-line in
Scanner.h
. Calling ScannerBase::print__
, therefore, can
also easily be controlled by an option controlled by the program
using the scanner object.
This option does not show the tokens returned and text matched
by flexc++ itself when reading its input s. If that is what
you want, use the --own-tokens
option.
pathname
(-S)-B -C, -H,
and -I
).
pathname
--construction
and --show-filenames
options.
%% [_a-zA-Z][_a-zA-Z0-9]* return 1;
The main()
function below defines a Scanner object, and calls lex()
as
long as it does not return 0. lex()
returns 0 if the end of the input
stream is reached. (By default std::cin
will be used).
#include <iostream> #include "Scanner.h" using namespace std; int main() { Scanner scanner; while (scanner.lex()) cout << "[Identifier: " << scanner.matched() << "]"; }
Each identifier on the input stream is replaced by itself and some surrounding
text. By default, flexc++ echoes all characters it cannot match to cout
. If
you do not want this, simply use the following pattern:
%% [_a-zA-Z][_a-zA-Z0-9]* return 1; .|\n // ignore
The second pattern will cause flexc++ to ignore all characters on the input stream. The first pattern will still match all identifiers, even those that consist of only one letter. But everything else is ignored. The second pattern has no associated action, and that is precisely what happens in lex: nothing. The stream is simply scanned for more characters.
It is also possible to let the generated lexer do all the work. The simple lexer below shows all encountered identifiers.
%% [_a-zA-Z][_a-zA-Z0-9]* { std::cout << "[Identifier: " << matched() << "]\n"; } .|\n // ignore
Note how a compound statement may be used instead of a one line statement at
the end of the line. The opening bracket must appear on the same line as the
pattern, however. Also note that inside an action, we can use Scanner
's
members. E.g., matched()
contains the text of the token that was last
matched. The following main
function can be used to activate the
generated scanner.
#include "Scanner.h" int main() { Scanner scanner; scanner.lex(); }
Note how simple this function is. Scanner::lex()
does not
return until the entire input stream has been processed, because none of
the patterns has an associated action using a return statement.
Command-line editing and history is provided by the Gnu readline library. The
bobcat library offers a class
FBB::ReadLineStream
encapsulating Gnu's readline library's facilities.
This class wass used by the following example to implement the required
features.
The lexical scanner is a simple one. It recognizes C++ identifiers and
\n
characters, and ignores all other characters. Here is its
specification:
%class-name Scanner %interactive %% [[:alpha:]_][[:alnum:]_]* return 1; \n return '\n'; .Create the lexical scanner from this specification file:
flexc++ lexer
Assuming that the directory containing the specification file also
contains the file main.cc
whose implementation is shown below, then
execute the following command to create the interactive scanner program:
g++ --std=c++11 *.cc -lbobcatThis completes the construction of the interactive scanner. Here is the file
main.cc
:
#include <iostream> #include <bobcat/readlinestream> #include "Scanner.h" using namespace std; using namespace FBB; int main() { ReadLineStream rls("? "); // create the ReadLineStream, using "? " // as a prompt before each line Scanner scanner(rls); // pass `rls' to the interactive scanner // process all the line's tokens // (the prompt is provided by `rls') while (int token = scanner.lex()) { if (token == '\n') // end of line: new prompt continue; // process other tokens cout << scanner.matched() << '\n'; if (scanner.matched()[0] == 'q') return 0; } }An interactive session with the above program might look like this (end-of-line comment is not entered, but was added by us for documentary purposes):
$ a.out ? hello world // enter some words hello world // echoed after pressing Enter ? hello world // this is shown after pressing up-arrow ? hello world^H^H^Hman // do some editing and press Enter hello // the tokens as edited are returned woman ? q // end the program $The interactive scanner only supports one constructor, by default using
std::cin
, to read from, and by default using std::cout
to write to:
explicit Scanner(std::istream &in = std::cin, std::ostream &out = std::cout);Interactive scanners only support switching output streams (through
switchOstream
members).