flexc++

flexc++.1.07.00.tar.gz

2008-2013


flexc++(1)

flexc++(1)

flexc++.1.07.00.tar.gz flexc++ scanner generator

2008-2013

NAME

flexc++ - Generate a C++ scanner class and parsing function

SYNOPSIS

flexc++ [options] rules-file

DESCRIPTION

Flexc++(1) was designed after flex(1) and flex++(1). Like these latter two programs flexc++ generates code performing pattern-matching on text, possibly executing actions when certain regular expressions are recognized.

Flexc++, contrary to flex and flex++, generates code that is explicitly intended for use by C++ programs. The well-known flex(1) program generates C source-code and flex++(1) merely offers a C++-like shell around the yylex function generated by flex(1) and hardly supports present-day ideas about C++ software development.

Contrary to this, flexc++ creates a C++ class offering a predefined member function lex matching input against regular expressions and possibly executing C++ code once regular expressions were matched. The code generated by flexc++ is pure C++, allowing its users to apply all of the features offered by that language.

Below, the following sections may be consulted for specific details:

1. QUICK START

A bare-bones, no-frills scanner is generated as follows:

:Scanner.h, Scanner.ih, Scannerbase.h, and lex.cc

  • Edit Scanner.h, add the enum defining the token-symbols in (usually) the public section of the class Scanner. E.g.,
    
    class Scanner: public ScannerBase
    {
        public:
            enum Tokens
            {
                IDENTIFIER = 0x100,
                NUMBER
            };
        // ... (etc, as generated by flexc++)
            
    

  • Create a file defining int main, e.g.:
    
    #include <iostream>
    #include "Scanner.h"
    
    using namespace std;
    
    int main()
    {
        Scanner scanner;        // define a Scanner object
    
        while (int token = scanner.lex())   // get all tokens
        {
            string const &text = scanner.matched();
            switch (token)
            {
                case Scanner::IDENTIFIER:
                    cout << "identifier: " << text << '\n';
                break;
    
                case Scanner::NUMBER:
                    cout << "number: " << text << '\n';
                break;
    
                default:
                    cout << "char. token: `" << text << "'\n";
                break;
            }
        }
    }
            
    
  • Compile all .cc files:
    
        g++ --std=c++11 *.cc
            
    

  • To `tokenize' main.cc, execute:
    
        a.out < main.cc
            
    
    )

    QUICK START: FLEXC++ and BISONC++

    To interface flexc++ to the bisonc++(1) parser generator proceed as follows:

    3. GENERATED FILES

    Flexc++ generates four files from a well-formed input file:

    4. OPTIONS

    Where available, single letter options are listed between parentheses following their associated long-option variants. Single letter options require arguments if their associated long options require arguments as well. Options affecting the class header or implementation header file are ignored if these files already exist. Options accepting a `filename' do not accept path names, i.e., they cannot contain directory separators (/); options accepting a 'pathname' may contain directory separators.

    Some options may generate warnings. This happens when an option conflicts with the contents of a file which flexc++ cannot modify (e.g., a scanner class header file exists, but doesn't define a name space, but a --namespace option was provided). In those cases the option is ignored, and hand-editing may then be required to effectuate the option.

    5. INTERACTIVE SCANNERS

    An interactive scanner is characterized by the fact that scanning is postponed until an end-of-line character has been received, followed by reading all information on the line, read so far. Flexc++ supports the %interactive directive), generating an interactive scanner. Here it is assumed that Scanner is the name of the scanner class generated by flexc++.

    Caveat: generating interactive and non-interactive scanners should not be mixed as their class organizations fundamentally differ, and several of the Scanner class's members are only available in the non-interactive scanner. As the Scanner.h file contains the Scanner class's interface, which is normally left untouched by flexc++, flexc++ cannot adapt the Scanner class when requested to change the interactivity of an existing Scanner class. Because of this support for the --interactive option was discontinued at flexc++'s 1.01.00 release.

    The interactive scanner generated by flexc++ has the following characteristics:

    This implementation allows code calling Scanner::lex() to conclude, as usual, that the input is exhausted when lex returns 0.

    Here is an example of how such a scanner could be used:

    6. SPECIFICATION FILE(S)

    Flexc++ expects an input file containing directives and the regular expressions that should be recognized by objects of the scanner class generated by flexc++. In this man page the elements and organization of flexc++'s input file is described.

    Flexc++'s input file consists of two sections, separated from each other by a line merely containing two consecutive percent characters:

    
    %%
        
    
    The section before this separator contains directives; the section following this separator contains regular expressions and possibly actions to perform when these regular expressions are matched by the object of the scanner class generated by flexc++.

    White space is usually ignored, as is comment, which may be of the traditional C form (i.e., /*, followed by (possibly multi-line) comment text, followed by */, and it may be C++ end-of-line comment: two consecutive slashes (//) start the comment, which continues up to the next newline character.

    6.1. FILE SWITCHING

    Flexc++'s input file may be split into multiple files. This allows for the definition of logically separate elements of the specifications in different files. Include directives must be specified on a line of their own. To switch to another specification file the following stanza is used:

    
    //include file-location
            
    
    The //include directive starts in the line's first column. File locations can be absolute or relative to the location of the file containing the //include directive. White space characters following //include and before the end of the line are ignored. The file specification may be surrounded by double quotes, but these double quotes are not required and are ignored (removed) if present. All remaining characters are expected to define the name of the file where flexc++'s rules specifications continue. Once end of file of a sub-file has been reached, processing continues at the line beyond the //include directive of the previously scanned file. The end-of-file of the file that was initially specified when flexc++ was called indicates the end of flexc++'s rules specification.

    6.2. DIRECTIVES

    The first section of flexc++'s input file consists of directives. In addition it may associate regular expressions with symbolic names, allowing you to use these identifiers in the rules section. Each directive is defined on a line of its own. When available, directives are overridden by flexc++ command line options.

    Some directives require arguments, which are usually provided following separating (but optional) = characters. Arguments of directives, are text, surrounded by double quotes (strings). If a string must itself contain a double quote or a backslash, then precede these characters by a backslash. The exceptions are the %s and %x directives, which are immediately followed by name lists, consisting of identifiers separated by blanks. Here is an example of the definition of a directive:

    
        %class-name = "MyScanner"
            
    

    Directives accepting a `filename' do not accept path names, i.e., they cannot contain directory separators (/); options accepting a 'pathname' may contain directory separators. A 'pathname' using blank characters should be surrounded by double quotes.

    Some directives may generate warnings. This happens when a directive conflicts with the contents of a file which flexc++ cannot modify (e.g., a scanner class header file exists, but doesn't define a name space, but a %namespace directive was provided). In those cases the directive is ignored, and hand-editing may then be required to effectuate the directive.

    6.3. MINI SCANNERS

    Mini scanners come in two flavors: inclusive mini scanners and exclusive mini scanners. The rules that apply to an inclusive mini scanner are the mini scanner's own rules as well as the rules which apply to no mini scanners in particular (i.e., the rules that apply to the default (or INITIAL) mini scanner). Exclusive mini scanners only use the rules that were defined for them.

    To define an inclusive mini scanner use %s, followed by one or more identifiers specifying the name(s) of the mini-scanner(s). To define an exclusive mini scanner use %x, followed by or more identifiers specifying the name(s) of the mini-scanner(s). The following example defines the names of two mini scanners: string and comment:

    
        %x string comment 
            
    
    Following this, rules defined in the context of the string mini scanner (see below) will only be used when that mini scanner is active.

    A flexc++ input file may contain multiple %s and %x specifications.

    6.4. DEFINITIONS

    Definitions are of the form

    
    identifier  regular-expression
            
    
    Each definition must be entered on a line of its own. Definitions associate identifiers with regular expressions, allowing the use of ${identifier} as synonym for its regular expression in the rules section of the flexc++ input file. One defined, the identifiers representing regular expressions can also be used in subsequent definitions.

    Example:

    
    FIRST                   [A-Za-z_]
    NAME                    {FIRST}[-A-Za-z0-9_]*
            
    

    6.5. %% SEPARATOR

    Following directives and definitions a line merely containing two consecutive % characters is expected. Following this line the rules are defined. Rules consist of regular expressions which should be recognized, possibly followed by actions to be executed once a rule's regular expression has been matched.

    6.6. REGULAR EXPRESSIONS

    The regular expressions defined in flexc++'s rules files are matched against the information passed to the scanner's lex function.

    Regular expressions begin as the first non-blank character on a line. Comment is interpreted as comment as long as it isn't part of the regular expresssion. To define a regular expression starting with two slashes (at least) the first slash can be escaped or double quoted. (E.g., "//".* defines C++ comment to end-of-line).

    Regular expressions end at the first blank character (to add a blank character, e.g., a space character, to a regular expression, prefix it by a backslash or put it in a double-quoted string).

    Actions may be associated with regular expressions. At a match the action that is associated with the regular expression is executed, after which scanning continues when the lexical scanning function (e.g., lex) is called again. Actions are not required, and regular expressions can be defined without any actions at all. If such action-less regular expressions are matched then the match is performed silently, after which processing continues.

    Flexc++ tries to match as many characters of the input file as possible (i.e., it uses `greedy matching'). Non-greedy matching is accomplished by a combination of a scanner and parser and/or by using the `lookahead' operator (/).

    The following regular expression `building blocks' are available. More complex regular expressions are created by combining them:

    x
    the character `x'

    .
    any character (byte) except newline

    [xyz]
    a character class; in this case, the pattern matches either an `x', a `y', or a `z'

    [abj-oZ]
    a character class containing a range; matches an `a', a `b', any letter from `j' through `o', or a `Z'

    [^A-Z]
    a negated character class, i.e., any character except for those in the class. In this example, any non-capital character.

    "[xyz]\"foo"
    text between double quotes matches the literal string: [xyz]"foo.

    \X
    if X is `a', `b', `f', `n', `r', `t', or `v', then the ANSI-C interpretation of `\x' is matched. Otherwise, a literal `X' is matched (this is used to escape operators such as `*').

    \0
    a NUL character (ASCII code 0).

    \123
    the character with octal value 123.

    \x2a
    the character with hexadecimal value 2a.

    (r)
    the regular expression `r'; parentheses are used to override precedence (see below)

    {name}
    the expansion of the `name' definition.

    r*
    zero or more regular expressions `r'. This also matches the empty string.

    r+
    one or more regular expressions `r'.

    r?
    zero or one regular expression `r'. This also matches the empty string.

    rs
    the regular expression `r' followed by the regular expression `s'; called concatenation

    r{m, n}
    regular expression `r' at least m, but at most n times (1 <= m <= n).

    r{m,}
    regular expression `r' m or more times (1 <= m).

    r{m}
    regular expression `r' exactly m times (1 <= m).

    r|s
    either regular expression `r' or regular expression `s'

    r/s
    regular expression `r' if it is followed by regular expression `s'. The text matched by `s' is included when determining whether this rule results in the longest match, but `s' is then returned to the input before the rule's action (if defined) is executed.

    ^r
    a regular expression `r' at the beginning of a line or file.

    r$
    a regular expression `r', occurring at the end of a line. This pattern is identical to `r/\n'.

    <s>r
    a regular exprression `r' in start condition `s'

    <s1,s2,s3>r
    a regular exprression `r' in start conditions s1, s2, or s3.

    <*>r
    a regular exprression `r' in all start conditions.

    <<EOF>>
    an end-of-file.

    <s1,s2><<EOF>>
    an end-of-file when in start conditions s1 or s2

    Inside a character class all regular expression operators lose their special meanings, except for the escape character (\) and the character class operators -, ]], and, at the beginning of the class, ^. To add a closing bracket to a character class use []. To add a closing bracket to a negated character class use [^]. Once a character class has started, all subsequent character (ranges) are added to the set, until the final closing bracket (]) has been reached.

    The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom. From lowest to highest precedence, the operators are:

    The lex standard defines concatenation as having a higher precedence than the interval expression. This is different from many other regular expression engines, and flexc++ follows these latter engines, giving all `multiplication operators' equal priority.

    Name expansion has the same precedence as grouping (using parentheses to influence the precedence of the other operators in the regular expression). Since the name expansion is treated as a group in flexc++, it is not allowed to use the lookahead operator in a name definition (a named pattern, defined in the definition section).

    Character classes can also contain character class expressions. These are expressions enclosed inside [: and :] delimiters (which themselves must appear between the [ and ] of the character class. Other elements may occur inside the character class as well). The character class expressions are:

         
         [:alnum:] [:alpha:] [:blank:]
         [:cntrl:] [:digit:] [:graph:]
         [:lower:] [:print:] [:punct:]
         [:space:] [:upper:] [:xdigit:]
            
    

    Character class expressions designate a set of characters equivalent to the corresponding standard C isXXX function. For example, [:alnum:] designates those characters for which isalnum returns true - i.e., any alphabetic or numeric character. For example, the following character classes are all equivalent:

     
        [[:alnum:]]
        [[:alpha:][:digit:]]
        [[:alpha:][0-9]]
        [a-zA-Z0-9]
            
    

    A negated character class such as the example [^A-Z] above will match a newline unless \n (or an equivalent escape sequence) is one of the characters explicitly present in the negated character class (e.g., [^A-Z\n]). This differs from the way many other regular expression tools treat negated character classes, but unfortunately the inconsistency is historically entrenched. Matching newlines means that a pattern like [^"]* can match the entire input unless there's another quote in the input.

    Flexc++ allows negation of character class expressions by prepending ^ to the POSIX character class name.

                    
        [:^alnum:] [:^alpha:] [:^blank:]
        [:^cntrl:] [:^digit:] [:^graph:]
        [:^lower:] [:^print:] [:^punct:]
        [:^space:] [:^upper:] [:^xdigit:]
            
    

    The {-} operator computes the difference of two character classes. For example, [a-c]{-}[b-z] represents all the characters in the class [a-c] that are not in the class [b-z] (which in this case, is just the single character a). The {-} operator is left associative, so [abc]{-}[b]{-}[c] is the same as [a].

    The {+} operator computes the union of two character classes. For example, [a-z]{+}[0-9] is the same as [a-z0-9]. This operator is useful when preceded by the result of a difference operation, as in, [[:alpha:]]{-}[[:lower:]]{+}[q], which is equivalent to [A-Zq] in the C locale.

    A rule can have at most one instance of trailing context (the / operator or the $ operator). The start condition, ^, and <<EOF>> patterns can only occur at the beginning of a pattern, and cannot be surrounded by parentheses. The characters ^ and $ only have their special properties at, respectively, the beginning and end of regular expressions. In all other cases they are treated as a normal characters.

    6.7. SPECIFICATION EXAMPLE

    
    %option debug
    
    %x comment
    
    NAME    [[:alpha:]][_[:alnum:]]*
    
    %%
    
    "//".*          // ignore
    
    "/*"            begin(comment);
    
    <comment>.|\n   // ignore
    <comment>"*/"   begin(INITIAL);
    
    ^a              return 1;
    a               return 2;
    a$              return 3;
    {NAME}          return 4;
    
    .|\n            // ignore
            
    

    )

    7. THE CLASS INTERFACE: SCANNER.H

    By default, flexc++ generates a file Scanner.h containing the initial interface of the scanner class performing the lexical scan according to the specifications given in flexc++'s input file. The name of the file that is generated can easily be changed using flexc++'s --class-header option. In this man-page we'll stick to using the default name.

    The file Scanner.h is generated only once, unless an explicit request is made to rewrite it (using flexc++'s --force-class-header option).

    The provided interface is very light-weight, primarily offering a link to the scanner's base class (see this manpage's sections 8.1 through 8.8).

    Many of the facilities offered by the scanner class are inherited from the ScannerBase base class. Additional facilities offered by the Scanner class. are covered below.

    7.1. NAMING CONVENTION

    All symbols that are required by the generated scanner class end in two consecutive underscore characters (e.g., executeAction__). These names should not be redefined. As they are part of the Scanner and ScannerBase class their scope is immediately clear and confusion with identically named identifiers elsewhere is unlikely.

    Some member functions do not use the underscore convention. These are the scanner class's constructors, or names that are similar or equal to names that have historically been used (e.g., length). Also, some functions are offered offering hooks into the implementation (like preCode). The latter category of function also have names that don't end in underscores.

    7.2 CONSTRUCTORS

    7.3. PUBLIC MEMBER FUNCTIONS

    7.4. PRIVATE MEMBER FUNCTIONS

    7.5. SCANNER CLASS HEADER EXAMPLE

    
    #ifndef Scanner_H_INCLUDED_
    #define Scanner_H_INCLUDED_
    
    // $insert baseclass_h
    #include "Scannerbase.h"
    
    
    class Scanner: public ScannerBase
    {
        public:
            explicit Scanner(std::istream &in = std::cin, 
                        std::ostream &out = std::cout);
            
            Scanner(std::string const &infile, std::string const &outfile);
    
            // $insert lexFunctionDecl
            int lex();
    
        private:
            int lex__();
            int executeAction__(size_t ruleNr);
    
            void preCode(); // re-implement this function for code to be
                            // exec'ed before the pattern matching starts
    };
    
    inline void Scanner::preCode() 
    {
        // optionally replace by your own code
    }
    
    inline Scanner::Scanner(std::istream &in, std::ostream &out)
    :
        ScannerBase(in, out)
    {}
    
    inline Scanner::Scanner(std::string const &infile, 
                                                std::string const &outfile)
    :
        ScannerBase(infile, outfile)
    {}
    
    // $insert inlineLexFunction
    inline int Scanner::lex()
    {
        return lex__();
    }
    
    #endif // Scanner_H_INCLUDED_
            
    

    8.1. THE SCANNER BASE CLASS

    By default, flexc++ generates a file Scannerbase.h containing the interface of the base class of the scanner class also generated by flexc++. The name of the file that is generated can easily be changed using flexc++'s --baseclass-header option. In this man-page we use the default name.

    The file Scannerbase.h is generated at each new flexc++ run. It contains no user-serviceable or extensible parts. Rewriting can be prevented by specifying flexc++'s --no-baseclass-header option).

    8.2. PUBLIC ENUMS AND -TYPES

    8.3. PROTECTED ENUMS AND -TYPES

    8.4. NO PUBLIC CONSTRUCTORS

    There are no public constructors. ScannerBase is a base class for the Scanner class generated by flexc++. ScannerBase only offers protected constructors.

    8.5. PUBLIC MEMBER FUNCTIONS

    8.6. PROTECTED CONSTRUCTORS

    8.7. PROTECTED MEMBER FUNCTIONS

    All member functions ending in two underscore characters are for internal use only and should not be called by user-defined members of the Scanner class.

    The following members, however, can safely be called by members of the generated Scanner class:

    8.8. PROTECTED DATA MEMBERS

    All protected data members are for internal use only, allowing lex__ to access them. All of them end in two underscore characters.

    8.9. FLEX++ TO FLEXC++ MEMBERS


    Flex++ (old) Flexc++ (new)

    lineno() lineNr()
    YYText() matched()
    less() accept()

    9.1 THE CLASS INPUT

    Flexc++ generates a file Scannerbase.h defining the scanner class's base class, by default named ScannerBase (which is the name used in this man-page). The base class ScannerBase contains a nested class Input whose interface looks like this:

    
    class Input
    {
        public:
            Input();
            Input(std::istream *iStream, size_t lineNr = 1);
            size_t get();
            size_t lineNr() const;          
            void reRead(size_t ch);
            void reRead(std::string const &str, size_t fmIdx);
            void close();
    };
            
    
    The members of this class are all required and offer a level in between the operations of ScannerBase and flexc++'s actual input file that's being processed.

    By default, flexc++ provides an implementation for all of Input's required members. Therefore, in most situations this man-page can safely be ignored.

    However, users may define and extend their own Input class and provide flexc++'s base class with that Input class. To do so flexc++'s rules file must contain the following two directives:

    
           %input-implementation = "sourcefile"
           %input-interface = "interface"
            
    
    Here, interface is the name of a file containing the class Input's interface. This interface is then inserted into ScannerBase's interface instead of the default class Input's interface. This interface must at least offer the above-mentioned members and constructors (their functions are described below). The class may contain additional members if required by the user-defined implementation. The implementation itself is expected in sourcefile. The contents of this file are inserted in the generated lex.cc file instead of Input's default implementation. The file sourcefile should probably not have a .cc extension to prevent its compilation by a program maintenance utility.

    When the lexical scanner generated by flexc++ switches streams using the //include directive (see section 6.1. FILE SWITCHING) the input stream that's currently processed is pushed on an Input stack maintained by ScannerBase, and processing continues at the file named at the //include directive. Once the latter file has been processed, the previously pushed stream is popped off the stack, and processing of the popped stream continues. This implies that Input objects must be `stack-able'. The required interface is designed to satisfy this requirement.

    9.2. CONSTRUCTORS

    9.3. REQUIRED PUBLIC MEMBER FUNCTIONS

    FILES

    Flexc++'s default skeleton files are in /usr/share/flexc++.
    By default, flexc++ generates the following files:

    SEE ALSO

    bisonc++(1)

    BUGS

    ABOUT flexc++

    Flexc++ was originally started as a programming project by Jean-Paul van Oosten and Richard Berendsen in the 2007-2008 academic year. After graduating, Richard left the project and moved to Amsterdam. Jean-Paul remained in Groningen, and after on-and-off activities on the project, in close cooperation with Frank B. Brokken, Frank undertook a rewrite of the project's code around 2010. During the development of flexc++, the lookahead-operator handling continuously threatened the completion of the project. By now, the project has evolved to a level that we feel it's defensible to publish the program, although we still tend to consider the program in its experimental stage; it will remain that way until we decide to move its version from the 0.9x.xx series to the 1.xx.xx series.

    COPYRIGHT

    This is free software, distributed under the terms of the GNU General Public License (GPL).

    AUTHOR

    Frank B. Brokken (f.b.brokken@rug.nl),
    Jean-Paul van Oosten (j.p.van.oosten@rug.nl),
    Richard Berendsen (richardberendsen@xs4all.nl) (until 2010).