%%
'. The section above %%
contains option specifications and
definitions; the section below %%
contains the regular expressions (and
their (optional) actions). The general layout of flexc++'s input file,
therefore, looks like this:
definitions %% rules
Optionally, a final line containing `%%
' may follow the rules. The
following sections cover the `definitions' and `rules' sections.
The definitions section may also contain declarations of named regular expressions. A named regular expression looks like this:
name pattern
Here, name
is an identfier, which may also contain hyphens (-
);
`pattern
' is a regular expression, see section 3.4. Patterns
start at the first non-blank character following the name, and end at the
line's last non-blank character. A named regular expression cannot contain
comment.
Finally, the definitions section may be used to declare mini-scanners (a.k.a. start conditions), cf. section 3.6. Start conditions are very useful for defining small `sub-languages' inside the language whose tokens must be recognized by the scanner. A commonly encountered example is the start condition recognizing C style multi-line comment.
=
characters. Arguments of directives are text,
surrounded by double quotes (strings), or embedded in raw string literals
(rawstrings). Double quotes or backslashes inside strings must themselves be
preceded by backslashes; these backslashes are not required when rawstrings
are used.
The %s
and %x
directives are immediately followed by name lists,
consisting of identifiers separated by blanks. Here is an example of the
definition of a directive:
%class-name = "MyScanner"
Directives accepting a `filename' do not accept path names, i.e., they
cannot contain directory separators (/
); options accepting a 'pathname'
may contain directory separators. A 'pathname' using blank characters should
be surrounded by double quotes.
Some directives may generate errors. This happens when a directive conflicts
with the contents of an existing file which flexc++ cannot modify (e.g., a
scanner class header file exists, but doesn't define a name space, but a
%namespace
directive was provided). To solve the error the offending
directive could be omitted, the existing file could be removed, or the
existing file could be hand-edited according to the directive's specification.
Note that flexc++ currently does not handle the opposite error condition: if a
previously used directive is omitted, then flexc++ does not detect the
inconsistency. In those cases you may encounter compilation errors.
= "filename"
--baseclass-header
.
It is an error if this directive is used and an already
existing scanner-class header file does not include
`filename'
.
Corresponding command-line option: --cases-insensitive
.
When this directive is specified the resulting scanner does not distinguish between the following rules:
First // initial F is transformed to f first FIRST // all capitals are transformed to lower case charsWith a case-insensitive scanner only the first rule can be matched, and flexc++ will issue warnings for the second and third rule about rules that cannot be matched.
Input processed by a case-insensitive scanner is also handled case
insensitively. The above mentioned First
rule is matched for
all of the following input words: first First FIRST firST
.
Although the matching process proceeds case insensitively, the
matched text (as returned by the scanner's matched()
member)
always contains the original, unmodified text. So, with the above
input matched()
returns, respectively first, First, FIRST
and firST
, while matching the rule First
.
= "filename"
--class-header
.
= "className"
%name
directive used by
flex++(1). Contrary to flex++'s %name
declaration,
class-name
may appear anywhere in the first section of the
grammar specification file. It may be defined only once. If no
class-name
is specified the default class name (Scanner
)
is used. Corresponding command-line option:
--class-name
.
It is an error if this directive is used and an already
existing scanner-class header file does not define class
`className'
.
lex
and its support functions with debugging code,
showing the actual parsing process on the standard output
stream. When included, the debugging output is active by default,
but its activity may be controlled using the setDebug(bool
on-off)
member. Note that no #ifdef DEBUG
macros are used in
the generated code.
= "basename"
Scanner.h, Scanner.ih,
and
Scannerbase.h
files. E.g., when using the directive
%filenames = "scanner"the names of the generated files are, respectively,
scanner.h,
scanner.ih,
and scannerbase.h
. Corresponding command-line
option: --filenames
. The name of the source file (by default
lex.cc
) is controlled by the %lex-source
directive.
= "filename"
--implementation-header
.
It is an error if this directive is used and an already
'filename'
file does not include the scanner class header
file.
= "sourcefile"
Input
class.
= "interface"
Input
class. See section 17. THE CLASS INPUT
in the flexc++api(3) manual page for additional information
about user-defined Input
classes.
Scanner(std::istream &in, std::ostream &out)
constructor, by default assuming that input is read from
std::cin
. See also section 1. INTERACTIVE SCANNER
section
in the flexc++api(3) manual page.
= "funname"
lex
) is used. Corresponding command-line option:
--lex-function-name
.
= "filename"
lex
. Corresponding command-line option: --lex-source
.
#line
preprocessor directives in the file containing
the scanner's lex
function. If omitted #line
directives
are added to this file, unless overridden by the command line
options --lines
and --no-lines
.
= "identifer"
identifier
. By
default no namespace is used. If this directives is used the
implementation header is provided with a commented out using
namespace
declaration for the requested namespace. In addition,
the scanner and scanner base class header files also use the
specified namespace to define their include guard directives.
It is an error if this directive is used and an already
scanner-class header file does not define namespace
identifier
.
lex
's caller. Displaying is suppressed again when
the lex.cc
file is generated without using this directive. The
function showing the tokens (ScannerBase::print__
) is called
from Scanner::print()
, which is defined in-line in
Scanner.h
. Calling ScannerBase::print__
, therefore, can
also easily be controlled by an option controlled by the program
using the scanner object.
this directive does not show the tokens returned and text
matched by flexc++ itself when reading its input s. If that is
what you want, use the --own-tokens
option.
namelist
%s
directive is followed by a list of one or more
identifiers, separated by blanks. Each identifier is the name of
an inclusive start condition.
= "pathname"
pathname
rather than the default (e.g.,
/usr/share/flexc++
) path when looking for flexc++'s skeleton
files. Corresponding command-line option:
--skeleton-directory
.
= "pathname"
Pathname
defines the directory where generated files should be
written. By default this is the directory where flexc++ is
called. This directive is overruled by the --target-directory
command-line option.
namelist
%x
directive is followed by a list of one or more
identifiers, separated by blanks. Each identifier is the name of
an exclusive start condition.
pattern action
Action is optional, and is separated from pattern by spaces and/or tabs. It consists of a single-line C++-statement, or it consists of a compound statement that may span several lines.
Alternatively, an action may consist of a vertical bar (`|'). A vertical bar indicates that pattern uses the same action as the next rule.
/* ... */
) and
C++ style end-of-line comment (i.e., // ...
) can be used. Indentation
is optional.
When comment is encountered outside of an action, flexc++ discards the comment, while all comment provided in the contect of actions are copied verbatim to the generated source file.
Comment cannot be used when defining named regular expressions in the definitions section.
x
[xyz]
[abj-oZ]
[^A-Z]
[^A-Z\n]
[:predef:]
[[:alnum:]]
);
s1{+}s2
s1
and s2
are character classes: the union of the characters in
s1
and s2
;
s1{-}s2
s1
and s2
are character classes: the set-difference of the
characters in s1
minus the characters in s2
;
"[xyz]\"foo"
[xyz]"foo
';
R"([xyz]\"foo)"
[xyz]\"foo
' (using a raw string
literal). Raw string literals using labels (which must be identifiers,
e.g., R"label( labelled raw string )label"
are also supported;
R"label("(xyz"))label"
"(xyz")
' (using a labeled rawstring);
\X
\0
\123
\x2a
(r)
r
by itself. It is used to override precedence
(see below);
{name}
r*
r
s, where r is any regular expression;
r+
r
s;
r?
r
s (that is, an optional r);
rs
r{m, n}
1 <= m <= n
: match `r' at least m, but at most n times; called
interval expression;
r{m,}
1 <= m
: match `r' m or more times;
r{m}
1 <= m
: match `r' exactly m times;
r|s
r/s
/
-character is commonly referred to as the lookahead
operator.
A warning is generated when the r
-pattern may match no text. This is a
potentially dangerous situation. Consider this pattern
a*/bwith input
b
. This input matches a*/b
, but b
is pushed back on
to the input stream. Then the process is repeated, resulting in a
continuous loop.
If flexc++ detects patterns potentially not matching any text it generates warnings like this:
[Warning] input, line 7: null-matching regular expressionBy placing the comment
//%nowarnon the line just before a regular expression that potentially does not match any text, the warning for that regular expression is suppressed;
^r
r
appears
elsewhere on a line it isn't matched by this rule; if the ^
-character
is not the first character of a regular expression it is interpreted as a
plain ^
-character;
r$
r$
is equivalent
to the expression `r/\n
'. When r
appears
elsewhere on a line it isn't matched by this rule; if the $
-character
is not the last character of a regular expression it is interpreted as a
plain $
-character. A dollar-terminated regular expression, however,
may be followed by an action or vertical bar indicating that the regular
expression uses the same action as the next rule;
<s>r
<s1,s2,s3>r
<*>r
r
is used in any start condition;
<sc-list>{compound rules}
<<EOF>>
<sc-list><<EOF>>
Character classes
Inside a character class all regular expression operators lose their special
meanings, except for the escape character (\
), the character range
operator -
, the end of character class operator ]
, and, at the
beginning of the class, ^
. All ordinary escape sequences are supported,
all other escaped characters are interpreted as literal characters (e.g.,
\c
is a literal c
).
To add a closing bracket to a character class use []
or \]
. To add a
closing bracket to a negated character class use [^]
(or use [^
followed by \]
somewhere within the character class). Minus characters are
used to define character ranges (e.g., [a-d]
, defining [abcd]
) except
in the following cases, where flexc++ recognizes a literal minus character:
[-
, or [^-
(a minus at the very beginning of a character class);
-]
(a minus at the very end of a character class);
or \-
(an escaped minus character)
]
) has been reached.
Operator precedence
The operators used in specifying regular expressions have the following priorities (listed from lowest to highest):
|
^r
and r$
:^
: at the beginning of a regular expression r
: r
only matches
when encountered at the beginning of a line;$
: at the end of a regular expression r
: r
only matches when
encountered at the end of a line;
/
|
rs
r
and s
;
multipliers
*, +, ?
and the interval specification (i.e., {...}
);
{+}, {-}
(r)
Different from the lex-standard, but in line with most other regular
expression engines the interval operator is given higher precedence than
concatenation. To require two repetitions of the word hello
use
(hello){2}
rather than hello{2}
, which to flexc++ is identical to the
regular expression helloo
.
Named regular expressions have the same precedence as parenthesized regular expressions. So after
WORD xyz[a-zA-Z] %% {WORD}{2}the input
xyzaxyzb
is matched, whereas xyzab
isn't.
In addition to characters and ranges of characters, character classes can also
contain predefined character sets. These consist of certain names between
[:
and :]
delimiters. The predefined character sets are:
[:alnum:] [:alpha:] [:blank:] [:cntrl:] [:digit:] [:graph:] [:lower:] [:print:] [:punct:] [:space:] [:upper:] [:xdigit:]These predefined sets designate sets of characters equivalent to the corresponding standard C
isXXX
function. For example, [:alnum:]
defines all characters for which isalnum
returns true.
As an illustration, the following character classes are equivalent:
[[:alnum:]] [[:alpha:][:digit:]] [[:alpha:][0-9]] [a-zA-Z0-9]
Note that a negated character class like [^A-Z]
matches a newline unless
\n
(or an equivalent escape sequence) is one of the characters explicitly
present in the negated character class (e.g., [^A-Z\n]
). This differs from
the way many other regular expression engines treat negated character classes.
Matching newlines means that a pattern like [^"]*
can match the entire
input unless there's another quote in the input.
Flexc++ allows negation of character class expressions by prepending ^
to
the name of a predefined character set. Here are the negated predefined
character sets:
[:^alnum:] [:^alpha:] [:^blank:] [:^cntrl:] [:^digit:] [:^graph:] [:^lower:] [:^print:] [:^punct:] [:^space:] [:^upper:] [:^xdigit:]
The `{+}
' operator computes the union of two character classes. For
example, [a-z]{+}[0-9]
is the same as [a-z0-9]
.
The `{-}
' operator computes the difference of two character classes. For
example, [a-c]{-}[b-z]
represents all the characters in the class
[a-c]
that are not in the class [b-z]
(which in this case, is just the
single character `a
').
A rule can have at most one instance of trailing context (the /
operator
or the $
operator). The start condition, ^
, and <<EOF>>
patterns can only occur at the beginning of a pattern, and, as well as with
/
and $
, cannot be grouped inside parentheses. A ^
which does not
occur at the beginning of a rule or a $
which does not occur at the end of
a rule loses its special properties and is treated as a normal character.
The following are invalid:
foo/bar$ <sc1>foo<sc2>barNote that the first of these can be rewritten `foo/bar\n'.
If the desired meaning is a `foo' or a `bar'-followed-by-a-newline, the following could be used (the special | action is explained below, see section 3.5):
foo | bar$ /* action goes here */A comparable definition can be used to match a `foo' or a `bar'-at-the-beginning-of-a-line.
Specifications of patterns end at the first unescaped white space character;
the action then starts at the first non-white space character. It usually
contains C++ code, with two exceptions: the empty and the bar (|
)
action (see below). If the C++ code starts with a brace ({
), the action
can span multiple lines until the matching closing brace (}
) is
encountered. Flexc++ correctly handles braces in strings and comments.
Actions can be empty (omitted). Empty actions discard the matched pattern. To avoid confusion it is advised to provide at least a simple comment stating that the matched input is ignored.
The bar action is an action containing only a single vertical bar (|
).
This tells flexc++ to use the action of the next rule. This can be repeated so
the following rules all use the same action:
a | b | c std::cout << "Matched " << match() << "\n";Actions can return an
int
value, which is usually interpreted as a
token by the program calling the scanner's lex
member. When lex
is
called after it has returned it continues its pattern-matching process just
beyond the last-matched point in the input stream.
For flexible handling of these sub-languages flexc++, like flex, offers start conditions, a.k.a. mini scanners. A start condition can be declared in the definition section of the lexer file:
%x string %% ...A
%x
is used to declare exclusive start conditions. Following
%x
a list (no commas) of start condition names is expected. Rules
specified for exclusive start conditions only apply to that particular mini
scanner. It is also possible to define inclusive start condition using
%s
. Rules not explicitly associated with a start condition (or with the
(default) start condition StartCondition__::INITIAL
also apply to
inclusive start conditions.
A start condition is used in the rules section of the lexical scanner specification file as indicated in section 3.4. Here is a concrete example:
%x string %% \" { more(); begin(StartCondition__::string); } <string>{ \" { begin(StartCondition__::INITIAL); return Token::STRING; } \\.|. more(); }This tells flexc++ that the double quote starts (begins) the
StartCondition__::string
start condition. The string
start condition's
rules then define what happens to double quoted strings. All its characters
are collected, and eventually the string's content is returned by
matched()
.
By default, scanners generated by flexc++ start in the
StartCondition__::INITIAL
start condition. When encountering a double
quote, the scanner switches to the StartCondition__::string
mini
scanner. Now, only the rules that are defined for the string
start
condition are active. Once flexc++ encounters an unescaped double quote, it
switches back to the StartCondition__::INITIAL
start condition and returns
Token::STRING
to its called, indicating that it has seen a C string.
string
start condition once again,
now using explicit start condition specifications:
%x string %% \" { more(); begin(StartCondition__::string); } <string>\" { begin(StartCondition__::INITIAL); return Token::STRING; } <string>\\.|. more(); }
The Scanner
class offers the following members, which can be called from
within actions (or by members called from
those actions):
accept(n)
returns all but the first `nChars' characters of the
current token back to the input stream, where they will be rescanned
when the scanner looks for the next match. So, it matches `nChars' of
the characters in the input buffer, rescanning the rest. This function
effectively sets length
's return value to nChars
(note: with
flex++ this function was called less
);
StartCondition__ startCondition
. As this enumeration is a strongly
typed enum the StartCondition__
scope must be specified as
well. E.g.,
begin(StartCondition__::INITIAL);
true
if --debug
or %debug
was specified, otherwise
false
.
matched
) is inserted into the scanner object's output stream;
lex
. With
flex++ this function was called leng
.
lineno
) after using the %lineno
option).
lex
(note: flex++ offers a similar
member called YYText
).
true
is
returned, otherwise (e.g., when the stream stack is empty) false
is returned;
Scanner.h
. It can safely be replaced by a user-defined
implementation. This function is called by lex__
, just before it
starts to match input characters against its rules: preCode
is called
by lex__
when lex__
is called and also after having executed the
actions of a rule which did not execute a return
statement. The
outline of lex__
's implementation looks like this:
int Scanner::lex__() { ... preCode(); while (true) { size_t ch = get__(); // fetch next char ... switch (actionType__(range)) // determine the action { ... maybe return } ... no return, continue scanning preCode(); } // while }
Scanner.h
. It can safely be replaced by a user-defined
implementation. This function is called by lex__
, just after a rule
has been matched, where PostEnum__
's value indicates the
characteristic of the matched rule. PostEnum__
has four values:
lex__
immediately returns 0
once postCode
returns, indicating the end of the input was
reached;
lex__
doesn't return, it simply
coontinues processing the previously pushed stream;
lex__
immediately returns
once postCode
returns, returning the next token;
lex__
has matched a
non-returning rule, and continues its rule-matching process.
ch
is pushed back onto the input stream. I.e., it will be
the character that is retrieved at the next attempt to obtain a
character from the input stream;
txt
are pushed back onto the input
stream. I.e., they will be the characters that are retrieved at the
next attempt to obtain characters from the input stream. The
characters in txt
are retrieved from the first character to the
last. So if txt == "hello"
then the 'h'
will be the character
that's retrieved next, followed by 'e'
, etc, until 'o'
;
curStream
on the stream stack;
This member is not available with interactive scanners.
curName
is opened first, and the resulting
istream
is pushed on the stream stack;
This member is not available with interactive scanners.
accept
but its argument counts backward from
the end of the matched text. All but these nChars
characters are
kept and the last nChar
characters are rescanned. This function
effectively reduces length
's return value by nChars
;
true
or
false
. Switching on debugging output only has visible effects if the
debug
option has been specified when generating lex.cc
;
filename
to
name
;
text
in the matched text buffer. Following a
call to this function matched
returns text
;
in
, writing output to out
. This is
not a stack-operation: after processing in
processing
does not return to the original stream.
When flexc++ generates an interactive scanner, this member is available (as a protected member). However, it should be considered an internal use only member;
infilename
processing does not return to the original stream.
infilename
processing does not return to the original stream.
ScannerBase
class. the pattern-matching algorithm
retrieves the next character from a class Input
, nested under
ScannerBase
. This class will usually provide all the required
functionality, but users of flexc++ may optionally provide their own Input
class.
In situations where the default Input
implementation doesn't suffice
simply `roll your own', implementing the following interface and use the
%option input-interface
and %option input-implementation
options in
the lexer
file to include, respectively, your own class Input
interface in the generated Scannerbase.h
file and Input
member
function implementations in the generated lex.cc
file.
When implementing your own class Input
, the following public interface
must at least be provided:
class Input { public: Input(); Input(std::istream *iStream); // dynamically allocated iStream size_t get(); // the next character size_t lineNr() const; size_t nPending() const; void setPending(size_t nPending); void reRead(size_t ch); // push back 'ch' (if <= 0x100) // push back str from idx 'fmIdx' void reRead(std::string const &str, size_t fmIdx); void close(); // delete dynamically allocated };This interface may be augmented with additional members, but the shown interface is used by
ScannerBase
. Flexc++ places Input
in
ScannerBase
's private interface and all communication with Input
is
handled by ScannerBase
. Input
's members must perform the following
tasks:
Input()
: the default constructor performs no special tasks, it
ensures that an Input
object is in a valid state, in particular allowing
close
to do its job.
Input
's interface. The default implementation uses Input
's
default copy constructor so there was no need to add it explicitly to the
interface.
Input(std::istream *iStream)
: information is read from the
istream
which is passed to Input
. The istream iStream
points was
dynamically allocated, is open and is ready for reading. Stream switching is
not an act performed by Input
, but by ScannerBase
. Also the names of
streams currently being read (e.g., when using //include
directives in
specification files) are administered and maintained by
ScannerBase
. Although iStream
points to a dynamically allocated piece
of memory Input
should treat the pointer as plain old data (POD). No
copy constructor, overloaded assignment operator or destructor is required to
process the pointer. In the default implementation iStream
is assigned to
one of Input
's data members, and is simply copied when Input
's copy
constructor or assignment operator is called, and ignored by its default
destructor. Externally provided implementations may handle the pointer
comparably.
size_t get()
: this member must return the next character as an
unsigned char
. At end-of-file is must return the value (predefined by
ScannerBase
) AT_EOF
.
size_t lineNr() const
: the line number of the currently processed
line should be returned. By convention these are numbers, so while processing
the first line lineNr
should return 1.
size_t nPending() const
: should return the number of pending
characters (i.e., the number of characters which were passed back to the
Input
object using its reRead
members which were not yet retrieved
again by its get
member).
void setPending(size_t nPending)
: should remove nPending
characters from the head of the Input
object's pending input queue. The
lexical scanner always passes the value received from nPending
to
setPending
, without calling get
in between.
void reRead(size_t ch)
: the character stored in ch
is pushed back
into the Input
object. The call should be ignored if ch
exceeds the
value 0xff
.
void reRead(std::string const &str, size_t fmIdx)
: the characters in
str
are pushed back into the Input
object in reverse order from
str
's final character down to (and including) the character at offset
fmIdx
.
void close()
: this member must delete the memory to which iStream
points, en passant closing the stream. It is called by
ScannerBase::popStream
at end-of-file.