Flexc++(1) was designed after flex(1) and flex++(1). Like these latter two programs flexc++ generates code performing pattern-matching on text, possibly executing actions when certain regular expressions are recognized.
Flexc++, contrary to flex and flex++, generates code that is explicitly intended for use by C++ programs. The well-known flex(1) program generates C source-code and flex++(1) merely offers a C++-like shell around the yylex function generated by flex(1) and hardly supports present-day ideas about C++ software development.
Contrary to this, flexc++ creates a C++ class offering a predefined member function lex matching input against regular expressions and possibly executing C++ code once regular expressions were matched. The code generated by flexc++ is pure C++, allowing its users to apply all of the features offered by that language.
Below, the following sections may be consulted for specific details:
A bare-bones, no-frills scanner is generated as follows:
%% [ \t\n]+ // skip white space chars. [0-9]+ return NUMBER; [[:alpha:]_][[:alpha:][:digit:]_]* return IDENTIFIER; . return matched()[0];
flexc++ lexerThis generates four files
class Scanner: public ScannerBase { public: enum Tokens { IDENTIFIER = 0x100, NUMBER }; // ... (etc, as generated by flexc++)
#include <iostream> #include "Scanner.h" using namespace std; int main() { Scanner scanner; // define a Scanner object while (int token = scanner.lex()) // get all tokens { string const &text = scanner.matched(); switch (token) { case Scanner::IDENTIFIER: cout << "identifier: " << text << '\n'; break; case Scanner::NUMBER: cout << "number: " << text << '\n'; break; default: cout << "char. token: `" << text << "'\n"; break; } } }
g++ --std=c++11 *.cc
a.out < main.cc)
To interface flexc++ to the bisonc++(1) parser generator proceed as follows:
%scanner ../scanner/Scanner.h %scanner-token-function d_scanner.lex() %token IDENTIFIER NUMBER CHAR %% startrule: startrule tokenshow | tokenshow ; tokenshow: token { std::cout << "matched: " << d_scanner.matched() << '\n'; } ; token: IDENTIFIER | NUMBER | CHAR ;
%% [ \t\n]+ // skip white space chars. [0-9]+ return Parser::NUMBER; [[:alpha:]_][[:alpha:][:digit:]_]* return Parser::IDENTIFIER; . return Parser::CHAR;This causes the scanner to return Parser tokens to the generated parser.
#include "../parser/Parserbase.h"to the file scanner/Scanner.ih
#include "parser/Parser.h" int main(int argc, char **argv) { Parser parser; parser.parse(); }
flexc++ lexer
bisonc++ grammar
g++ --std=c++0x *.cc */*.cc
a.out < main.cc
Flexc++ generates four files from a well-formed input file:
Where available, single letter options are listed between parentheses following their associated long-option variants. Single letter options require arguments if their associated long options require arguments as well. Options affecting the class header or implementation header file are ignored if these files already exist. Options accepting a `filename' do not accept path names, i.e., they cannot contain directory separators (/); options accepting a 'pathname' may contain directory separators.
Some options may generate warnings. This happens when an option conflicts with the contents of a file which flexc++ cannot modify (e.g., a scanner class header file exists, but doesn't define a name space, but a --namespace option was provided). In those cases the option is ignored, and hand-editing may then be required to effectuate the option.
A warning is issued if this option is used and an already existing scanner-class header file does not include `filename'.
When this option is specified the resulting scanner does not distinguish between the following rules:
First // initial F is transformed to f first FIRST // all capitals are transformed to lower case charsWith a case-insensitive scanner only the first rule can be matched, and flexc++ will issue warnings for the second and third rule about rules that cannot be matched.
Input processed by a case-insensitive scanner is also handled case insensitively. The above mentioned First rule is matched for all of the following input words: first First FIRST firST.
Although the matching process proceeds case insensitively, the matched text (as returned by the scanner's matched() member) always contains the original, unmodified text. So, with the above input matched() returns, respectively first, First, FIRST and firST, while matching the rule First.
A warning is issued if this option is used and an already existing scanner-class header file does not define class `className'
A warning is issued if this option is used and an already existing 'filename' file does not include the scanner class header file.
A warning is issued if this option is used and an already existing scanner-class header file does not define namespace identifier.
This option does not result in the generated program displaying returned tokens and matched text. If that is what you want, use the --print-tokens option.
This option does not show the tokens returned and text matched by flexc++ itself when reading its input s. If that is what you want, use the --own-tokens option.
An interactive scanner is characterized by the fact that scanning is postponed until an end-of-line character has been received, followed by reading all information on the line, read so far. Flexc++ supports the %interactive directive), generating an interactive scanner. Here it is assumed that Scanner is the name of the scanner class generated by flexc++.
Caveat: generating interactive and non-interactive scanners should not be mixed as their class organizations fundamentally differ, and several of the Scanner class's members are only available in the non-interactive scanner. As the Scanner.h file contains the Scanner class's interface, which is normally left untouched by flexc++, flexc++ cannot adapt the Scanner class when requested to change the interactivity of an existing Scanner class. Because of this support for the --interactive option was discontinued at flexc++'s 1.01.00 release.
The interactive scanner generated by flexc++ has the following characteristics:
- If the token returned by the scanner is not equal to 0 it is returned as then next token;
- Otherwise the next line is retrieved from the input stream passed to the Scanner's constructor (by default std::cin). If this fails, 0 is returned.
- A '\n' character is appended to the just read line, and the scanner's std::istringstream base class object is re-initialized with that line;
- The member lex__ returns the next token.
Here is an example of how such a scanner could be used:
// scanner generated using 'flexc++ lexer' with lexer containing // the %interactive directive int main() { Scanner scanner; // by default: read from std::cin while (true) { cout << "? "; // prompt at each line while (true) // process all the line's tokens { int token = scanner.lex(); if (token == '\n') // end of line: new prompt break; if (token == 0) // end of input: done return 0; // process other tokens cout << scanner.matched() << '\n'; if (scanner.matched()[0] == 'q') return 0; } } }
Flexc++ expects an input file containing directives and the regular expressions that should be recognized by objects of the scanner class generated by flexc++. In this man page the elements and organization of flexc++'s input file is described.
Flexc++'s input file consists of two sections, separated from each other by a line merely containing two consecutive percent characters:
%%The section before this separator contains directives; the section following this separator contains regular expressions and possibly actions to perform when these regular expressions are matched by the object of the scanner class generated by flexc++.
White space is usually ignored, as is comment, which may be of the traditional C form (i.e., /*, followed by (possibly multi-line) comment text, followed by */, and it may be C++ end-of-line comment: two consecutive slashes (//) start the comment, which continues up to the next newline character.
Flexc++'s input file may be split into multiple files. This allows for the definition of logically separate elements of the specifications in different files. Include directives must be specified on a line of their own. To switch to another specification file the following stanza is used:
//include file-locationThe //include directive starts in the line's first column. File locations can be absolute or relative to the location of the file containing the //include directive. White space characters following //include and before the end of the line are ignored. The file specification may be surrounded by double quotes, but these double quotes are not required and are ignored (removed) if present. All remaining characters are expected to define the name of the file where flexc++'s rules specifications continue. Once end of file of a sub-file has been reached, processing continues at the line beyond the //include directive of the previously scanned file. The end-of-file of the file that was initially specified when flexc++ was called indicates the end of flexc++'s rules specification.
The first section of flexc++'s input file consists of directives. In addition it may associate regular expressions with symbolic names, allowing you to use these identifiers in the rules section. Each directive is defined on a line of its own. When available, directives are overridden by flexc++ command line options.
Some directives require arguments, which are usually provided following separating (but optional) = characters. Arguments of directives, are text, surrounded by double quotes (strings). If a string must itself contain a double quote or a backslash, then precede these characters by a backslash. The exceptions are the %s and %x directives, which are immediately followed by name lists, consisting of identifiers separated by blanks. Here is an example of the definition of a directive:
%class-name = "MyScanner"
Directives accepting a `filename' do not accept path names, i.e., they cannot contain directory separators (/); options accepting a 'pathname' may contain directory separators. A 'pathname' using blank characters should be surrounded by double quotes.
Some directives may generate warnings. This happens when a directive conflicts with the contents of a file which flexc++ cannot modify (e.g., a scanner class header file exists, but doesn't define a name space, but a %namespace directive was provided). In those cases the directive is ignored, and hand-editing may then be required to effectuate the directive.
A warning is issued if this option is used and an already existing scanner-class header file does not include `filename'.
Corresponding command-line option: --cases-insensitive.
When this directive is specified the resulting scanner does not distinguish between the following rules:
First // initial F is transformed to f first FIRST // all capitals are transformed to lower case charsWith a case-insensitive scanner only the first rule can be matched, and flexc++ will issue warnings for the second and third rule about rules that cannot be matched.
Input processed by a case-insensitive scanner is also handled case insensitively. The above mentioned First rule is matched for all of the following input words: first First FIRST firST.
Although the matching process proceeds case insensitively, the matched text (as returned by the scanner's matched() member) always contains the original, unmodified text. So, with the above input matched() returns, respectively first, First, FIRST and firST, while matching the rule First.
A warning is issued if this option is used and an already existing scanner-class header file does not define class `className'.
%filenames = "scanner"the names of the generated files are, respectively, scanner.h, scanner.ih, and scannerbase.h. Corresponding command-line option: --filenames. The name of the source file (by default lex.cc) is controlled by the %lex-source directive.
A warning is issued if this option is used and an already existing 'filename' file does not include the scanner class header file.
A warning is issued if this option is used and an already existing scanner-class header file does not define namespace identifier.
Mini scanners come in two flavors: inclusive mini scanners and exclusive mini scanners. The rules that apply to an inclusive mini scanner are the mini scanner's own rules as well as the rules which apply to no mini scanners in particular (i.e., the rules that apply to the default (or INITIAL) mini scanner). Exclusive mini scanners only use the rules that were defined for them.
To define an inclusive mini scanner use %s, followed by one or more identifiers specifying the name(s) of the mini-scanner(s). To define an exclusive mini scanner use %x, followed by or more identifiers specifying the name(s) of the mini-scanner(s). The following example defines the names of two mini scanners: string and comment:
%x string commentFollowing this, rules defined in the context of the string mini scanner (see below) will only be used when that mini scanner is active.
A flexc++ input file may contain multiple %s and %x specifications.
Definitions are of the form
identifier regular-expressionEach definition must be entered on a line of its own. Definitions associate identifiers with regular expressions, allowing the use of ${identifier} as synonym for its regular expression in the rules section of the flexc++ input file. One defined, the identifiers representing regular expressions can also be used in subsequent definitions.
Example:
FIRST [A-Za-z_] NAME {FIRST}[-A-Za-z0-9_]*
Following directives and definitions a line merely containing two consecutive % characters is expected. Following this line the rules are defined. Rules consist of regular expressions which should be recognized, possibly followed by actions to be executed once a rule's regular expression has been matched.
The regular expressions defined in flexc++'s rules files are matched against the information passed to the scanner's lex function.
Regular expressions begin as the first non-blank character on a line. Comment is interpreted as comment as long as it isn't part of the regular expresssion. To define a regular expression starting with two slashes (at least) the first slash can be escaped or double quoted. (E.g., "//".* defines C++ comment to end-of-line).
Regular expressions end at the first blank character (to add a blank character, e.g., a space character, to a regular expression, prefix it by a backslash or put it in a double-quoted string).
Actions may be associated with regular expressions. At a match the action that is associated with the regular expression is executed, after which scanning continues when the lexical scanning function (e.g., lex) is called again. Actions are not required, and regular expressions can be defined without any actions at all. If such action-less regular expressions are matched then the match is performed silently, after which processing continues.
Flexc++ tries to match as many characters of the input file as possible (i.e., it uses `greedy matching'). Non-greedy matching is accomplished by a combination of a scanner and parser and/or by using the `lookahead' operator (/).
The following regular expression `building blocks' are available. More complex regular expressions are created by combining them:
Inside a character class all regular expression operators lose their special meanings, except for the escape character (\) and the character class operators -, ]], and, at the beginning of the class, ^. To add a closing bracket to a character class use []. To add a closing bracket to a negated character class use [^]. Once a character class has started, all subsequent character (ranges) are added to the set, until the final closing bracket (]) has been reached.
The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom. From lowest to highest precedence, the operators are:
The lex standard defines concatenation as having a higher precedence than the interval expression. This is different from many other regular expression engines, and flexc++ follows these latter engines, giving all `multiplication operators' equal priority.
Name expansion has the same precedence as grouping (using parentheses to influence the precedence of the other operators in the regular expression). Since the name expansion is treated as a group in flexc++, it is not allowed to use the lookahead operator in a name definition (a named pattern, defined in the definition section).
Character classes can also contain character class expressions. These are expressions enclosed inside [: and :] delimiters (which themselves must appear between the [ and ] of the character class. Other elements may occur inside the character class as well). The character class expressions are:
[:alnum:] [:alpha:] [:blank:] [:cntrl:] [:digit:] [:graph:] [:lower:] [:print:] [:punct:] [:space:] [:upper:] [:xdigit:]
Character class expressions designate a set of characters equivalent to the corresponding standard C isXXX function. For example, [:alnum:] designates those characters for which isalnum returns true - i.e., any alphabetic or numeric character. For example, the following character classes are all equivalent:
[[:alnum:]] [[:alpha:][:digit:]] [[:alpha:][0-9]] [a-zA-Z0-9]
A negated character class such as the example [^A-Z] above will match a newline unless \n (or an equivalent escape sequence) is one of the characters explicitly present in the negated character class (e.g., [^A-Z\n]). This differs from the way many other regular expression tools treat negated character classes, but unfortunately the inconsistency is historically entrenched. Matching newlines means that a pattern like [^"]* can match the entire input unless there's another quote in the input.
Flexc++ allows negation of character class expressions by prepending ^ to the POSIX character class name.
[:^alnum:] [:^alpha:] [:^blank:] [:^cntrl:] [:^digit:] [:^graph:] [:^lower:] [:^print:] [:^punct:] [:^space:] [:^upper:] [:^xdigit:]
The {-} operator computes the difference of two character classes. For example, [a-c]{-}[b-z] represents all the characters in the class [a-c] that are not in the class [b-z] (which in this case, is just the single character a). The {-} operator is left associative, so [abc]{-}[b]{-}[c] is the same as [a].
The {+} operator computes the union of two character classes. For example, [a-z]{+}[0-9] is the same as [a-z0-9]. This operator is useful when preceded by the result of a difference operation, as in, [[:alpha:]]{-}[[:lower:]]{+}[q], which is equivalent to [A-Zq] in the C locale.
A rule can have at most one instance of trailing context (the / operator or the $ operator). The start condition, ^, and <<EOF>> patterns can only occur at the beginning of a pattern, and cannot be surrounded by parentheses. The characters ^ and $ only have their special properties at, respectively, the beginning and end of regular expressions. In all other cases they are treated as a normal characters.
%option debug %x comment NAME [[:alpha:]][_[:alnum:]]* %% "//".* // ignore "/*" begin(comment); <comment>.|\n // ignore <comment>"*/" begin(INITIAL); ^a return 1; a return 2; a$ return 3; {NAME} return 4; .|\n // ignore
)
By default, flexc++ generates a file Scanner.h containing the initial interface of the scanner class performing the lexical scan according to the specifications given in flexc++'s input file. The name of the file that is generated can easily be changed using flexc++'s --class-header option. In this man-page we'll stick to using the default name.
The file Scanner.h is generated only once, unless an explicit request is made to rewrite it (using flexc++'s --force-class-header option).
The provided interface is very light-weight, primarily offering a link to the scanner's base class (see this manpage's sections 8.1 through 8.8).
Many of the facilities offered by the scanner class are inherited from the ScannerBase base class. Additional facilities offered by the Scanner class. are covered below.
All symbols that are required by the generated scanner class end in two consecutive underscore characters (e.g., executeAction__). These names should not be redefined. As they are part of the Scanner and ScannerBase class their scope is immediately clear and confusion with identically named identifiers elsewhere is unlikely.
Some member functions do not use the underscore convention. These are the scanner class's constructors, or names that are similar or equal to names that have historically been used (e.g., length). Also, some functions are offered offering hooks into the implementation (like preCode). The latter category of function also have names that don't end in underscores.
With interactive scanners input stream switching or stacking is not available; switching output streams, however, is.
This constructor is not available with interactive scanners.
inline int Scanner::lex() { return lex__(); }
Caveat: with interactive scanners the lex function is defined in the generated lex.cc file. Once flexc++ has generated the scanner class header file this scanner class header file isn't automatically rewritten by flexc++. If, at some later stage, an interactive scanner must be generated, then the inline lex implementation must be removed `by hand' from the scanner class header file. Likewise, a lex member implementation (like the above) must be provided `by hand' if a non-interactive scanner is required after first having generated files implementing an interactive scanner.
int Scanner::lex__() { ... preCode(); while (true) { size_t ch = get__(); // fetch next char ... switch (actionType__(range)) // determine the action { ... maybe return } ... no return, continue scanning preCode(); } // while }
Displaying is suppressed when the lex.cc file is (re)generated without using this directive. The function actually showing the tokens (ScannerBase::print__) is called from print, which is defined in-line in Scanner.h. Calling ScannerBase::print__, therefore, can also easily be controlled by an option controlled by the program using the scanner object.
#ifndef Scanner_H_INCLUDED_ #define Scanner_H_INCLUDED_ // $insert baseclass_h #include "Scannerbase.h" class Scanner: public ScannerBase { public: explicit Scanner(std::istream &in = std::cin, std::ostream &out = std::cout); Scanner(std::string const &infile, std::string const &outfile); // $insert lexFunctionDecl int lex(); private: int lex__(); int executeAction__(size_t ruleNr); void preCode(); // re-implement this function for code to be // exec'ed before the pattern matching starts }; inline void Scanner::preCode() { // optionally replace by your own code } inline Scanner::Scanner(std::istream &in, std::ostream &out) : ScannerBase(in, out) {} inline Scanner::Scanner(std::string const &infile, std::string const &outfile) : ScannerBase(infile, outfile) {} // $insert inlineLexFunction inline int Scanner::lex() { return lex__(); } #endif // Scanner_H_INCLUDED_
By default, flexc++ generates a file Scannerbase.h containing the interface of the base class of the scanner class also generated by flexc++. The name of the file that is generated can easily be changed using flexc++'s --baseclass-header option. In this man-page we use the default name.
The file Scannerbase.h is generated at each new flexc++ run. It contains no user-serviceable or extensible parts. Rewriting can be prevented by specifying flexc++'s --no-baseclass-header option).
begin(StartCondition__::INITIAL);
There are no public constructors. ScannerBase is a base class for the Scanner class generated by flexc++. ScannerBase only offers protected constructors.
This member is not available with interactive scanners.
The current output stream is closed, and output is written to outfilename. If this file already exists, it is rewritten.
This member is not available with interactive scanners.
If outfilename == "-" then the standard output stream is used as the scanner's output medium; if outfilename == "" then the standard error stream is used as the scanner's output medium.
This member is not available with interactive scanners.
This member is not available for interactive scanners.
All member functions ending in two underscore characters are for internal use only and should not be called by user-defined members of the Scanner class.
The following members, however, can safely be called by members of the generated Scanner class:
begin(StartCondition__::INITIAL);
regex-to-match { if (int ret = memberFunction()) return ret; }The member leave removes the need for constructions like the above. The member leave can be called from within member functions encapsulating actions performed when a regular expression has been matched. It ends lex, returning retValue to its caller. The above rule can now be written like this:
regex-to-match memberFunction();and memberFunction could be implemented as follows:
void memberFunction() { if (someCondition()) { // any action, e.g., // switch mini-scanner begin(StartCondition__::INITIAL); leave(Parser::TOKENVALUE); // lex returns TOKENVALUE // this point is never reached } pushStream(d_matched); // switch to the next stream // lex continues }The member leave should only (indirectly) be called (usually nested) from actions defined in the scanner's specification s; calling leave outside of this context results in undefined behavior.
This member is not available with interactive scanners.
This member is not available with interactive scanners.
This member is not available with interactive scanners.
All protected data members are for internal use only, allowing lex__ to access them. All of them end in two underscore characters.
Flex++ (old) | Flexc++ (new) | |
lineno() | lineNr() | |
YYText() | matched() | |
less() | accept() |
Flexc++ generates a file Scannerbase.h defining the scanner class's base class, by default named ScannerBase (which is the name used in this man-page). The base class ScannerBase contains a nested class Input whose interface looks like this:
class Input { public: Input(); Input(std::istream *iStream, size_t lineNr = 1); size_t get(); size_t lineNr() const; void reRead(size_t ch); void reRead(std::string const &str, size_t fmIdx); void close(); };The members of this class are all required and offer a level in between the operations of ScannerBase and flexc++'s actual input file that's being processed.
By default, flexc++ provides an implementation for all of Input's required members. Therefore, in most situations this man-page can safely be ignored.
However, users may define and extend their own Input class and provide flexc++'s base class with that Input class. To do so flexc++'s rules file must contain the following two directives:
%input-implementation = "sourcefile" %input-interface = "interface"Here, interface is the name of a file containing the class Input's interface. This interface is then inserted into ScannerBase's interface instead of the default class Input's interface. This interface must at least offer the above-mentioned members and constructors (their functions are described below). The class may contain additional members if required by the user-defined implementation. The implementation itself is expected in sourcefile. The contents of this file are inserted in the generated lex.cc file instead of Input's default implementation. The file sourcefile should probably not have a .cc extension to prevent its compilation by a program maintenance utility.
When the lexical scanner generated by flexc++ switches streams using the //include directive (see section 6.1. FILE SWITCHING) the input stream that's currently processed is pushed on an Input stack maintained by ScannerBase, and processing continues at the file named at the //include directive. Once the latter file has been processed, the previously pushed stream is popped off the stack, and processing of the popped stream continues. This implies that Input objects must be `stack-able'. The required interface is designed to satisfy this requirement.
The new input stream's line counter is set to lineNr, by default 1.
Flexc++'s default skeleton files are in /usr/share/flexc++.
By default, flexc++ generates the following files:
bisonc++(1)
Flexc++ was originally started as a programming project by Jean-Paul van Oosten and Richard Berendsen in the 2007-2008 academic year. After graduating, Richard left the project and moved to Amsterdam. Jean-Paul remained in Groningen, and after on-and-off activities on the project, in close cooperation with Frank B. Brokken, Frank undertook a rewrite of the project's code around 2010. During the development of flexc++, the lookahead-operator handling continuously threatened the completion of the project. By now, the project has evolved to a level that we feel it's defensible to publish the program, although we still tend to consider the program in its experimental stage; it will remain that way until we decide to move its version from the 0.9x.xx series to the 1.xx.xx series.