dwww Home | Manual pages | Find package

RE2C(1)                                                                RE2C(1)

NAME
       re2c - generate fast lexical analyzers for C/C++, Go and Rust

SYNOPSIS
       Note: examples are in C++ (but can be easily adapted to C).

          re2c    [ OPTIONS ] [ WARNINGS ] INPUT
          re2go   [ OPTIONS ] [ WARNINGS ] INPUT
          re2rust [ OPTIONS ] [ WARNINGS ] INPUT

       Input can be either a file or - for stdin.

INTRODUCTION
       re2c works as a preprocessor. It reads the input file (which is usually
       a program in the target language, but can be anything)  and  looks  for
       blocks  of  code enclosed in special-form comments. The text outside of
       these blocks is copied verbatim into the output file. The  contents  of
       the  blocks  are  processed  by re2c. It translates them to code in the
       target language and outputs the generated code in place of the block.

       Here is an example of a small program that checks  if  a  given  string
       contains a decimal number:

          // re2c $INPUT -o $OUTPUT -i --case-ranges
          #include <assert.h>

          bool lex(const char *s) {
              const char *YYCURSOR = s;
              /*!re2c
                  re2c:yyfill:enable = 0;
                  re2c:define:YYCTYPE = char;

                  number = [1-9][0-9]*;

                  number { return true; }
                  *      { return false; }
              */
          }

          int main() {
              assert(lex("1234"));
              return 0;
          }

       In  the output everything between /*!re2c and */ has been replaced with
       the generated code:

          /* Generated by re2c */
          // re2c $INPUT -o $OUTPUT -i --case-ranges
          #include <assert.h>

          bool lex(const char *s) {
              const char *YYCURSOR = s;

          {
              char yych;
              yych = *YYCURSOR;
              switch (yych) {
                  case '1' ... '9': goto yy2;
                  default: goto yy1;
              }
          yy1:
              ++YYCURSOR;
              { return false; }
          yy2:
              yych = *++YYCURSOR;
              switch (yych) {
                  case '0' ... '9': goto yy2;
                  default: goto yy3;
              }
          yy3:
              { return true; }
          }

          }

          int main() {
              assert(lex("1234"));
              return 0;
          }

SYNTAX
       A re2c program consists of a sequence of blocks intermixed with code in
       the target language. There are three main kinds of blocks:

          /*!re2c[:<name>] ... */
                 A  global  block contains definitions, configurations, direc-
                 tives and rules.  re2c compiles regular  expressions  associ-
                 ated  with  each  rule into a deterministic finite automaton,
                 encodes it in the form of conditional  jumps  in  the  target
                 language  and  replaces  the  block  with the generated code.
                 Names and configurations defined in a global block are  added
                 to  the global scope and become visible to subsequent blocks.
                 At the start of the program the global scope  is  initialized
                 with  command-line options.  The :<name> part is optional: if
                 specified, the name can be used to refer to the block in  an-
                 other part of the program.

          /*!local:re2c[:<name>] ... */
                 A  local block is like a global block, but the names and con-
                 figurations in it have local scope (they do not affect  other
                 blocks).

          /*!rules:re2c[:<name>] ... */
                 A rules block is like a local block, but it does not generate
                 any code and is meant to be reused in other blocks. This is a
                 way of sharing code (more details in the reusable blocks sec-
                 tion).

       There are also many auxiliary blocks; see section blocks and directives
       for  a  full  list  of them. A block may contain the following kinds of
       statements:

          <name> = <regular expression>;
                 A definition binds a name to a regular expression. Names  may
                 contain  alphanumeric  characters and underscore. The regular
                 expressions section gives an overview of re2c syntax for reg-
                 ular expressions. Once defined, the name can be used in other
                 regular expressions and in rules. Recursion in named  defini-
                 tions  is not allowed, and each name should be defined before
                 it is used. A  block  inherits  named  definitions  from  the
                 global  scope.   Redefining a name that exists in the current
                 scope is an error.

          <configuration> = <value>;
                 A configuration allows one to change re2c behavior  and  cus-
                 tomize  the generated code. For a full list of configurations
                 supported by re2c see the configurations  section.  Depending
                 on  a particular configuration, the value can be a keyword, a
                 nonnegative integer number or a one-line string which  should
                 be  enclosed in double or single quotes unless it consists of
                 alphanumeric characters. A block inherits configurations from
                 the  global scope and may redefine them or add new ones. Con-
                 figurations defined inside of a block affect the whole block,
                 even if they appear at the end of it.

          <regular expression> { <code> }
                 A  rule  binds  a  regular expression to a semantic action (a
                 block of code in the target language). If the regular expres-
                 sion  matches, the associated semantic action is executed. If
                 multiple rules match, the longest match takes precedence.  If
                 multiple  rules match the same string, the earliest one takes
                 precedence. There are two special rules: the default  rule  *
                 and  the  end of input rule $. The default rule should always
                 be defined, it has the  lowest  priority  regardless  of  its
                 place  in the block, and it matches any code unit (not neces-
                 sarily a valid character, see the encoding support  section).
                 The  end of input rule should be defined if the corresponding
                 method for handling the end of input is used. If start condi-
                 tions are used, rules have more complex syntax.

          !<directive>;
                 A directive is one of the special predefined statements. Each
                 directive has a unique purpose. For example, the !use  direc-
                 tive  merges  a  rules  block  into  the current one (see the
                 reusable blocks section), and the !include  directive  allows
                 one to include an outer file (see the include files section).

PROGRAM INTERFACE
       The  generated  code interfaces with the outer program with the help of
       primitives -- symbolic names that can be defined  as  variables,  func-
       tions or macros in the target language (collectively referred to as the
       API).  The definition of primitives is left for the  user:  this  gives
       them both freedom in customizing the lexer and responsibility to under-
       stand how it works.  Not all primitives have to  be  defined  ---  only
       those used by a given program.  The manual provides definitions for the
       most popular use cases. For a full list of primitives and their meaning
       see the API primitives section.

       There  are  two  API  flavors that define the set of primitives used by
       re2c:

          Pointer API
                 This API is based on C pointer arithmetic.  It  was  histori-
                 cally  the  first,  and for a long time the only one. It con-
                 sists of pointer-like primitives YYCURSOR,  YYMARKER,  YYCTX-
                 MARKER,  YYLIMIT  (which  are normally defined as pointers of
                 type YYCTYPE*) and YYFILL. This API is enabled by default for
                 C, and it cannot be used with other backends that do not sup-
                 port pointer arithmetic.

          Generic API
                 This API is more flexible. It consists generic operations and
                 does not assume any particular implementation. The primitives
                 are YYPEEK, YYSKIP, YYBACKUP, YYBACKUPCTX, YYSTAGP,  YYSTAGN,
                 YYMTAGP,   YYMTAGN,  YYRESTORE,  YYRESTORECTX,  YYRESTORETAG,
                 YYSHIFT, YYSHIFTSTAG,  YYSHIFTMTAG,  YYLESSTHAN  and  YYFILL.
                 For  the  C  backend generic API is enabled with --api custom
                 option or re2c:api = custom; configuration; for Go  and  Rust
                 it  is  enabled  by default. Generic API was added in version
                 0.14.

       There are two API styles that determine the form in  which  the  primi-
       tives should be defined:

          Free-form
                 Free-form  style is enabled with configuration re2c:api:style
                 = free-form;.  In this style interface primitives  should  be
                 defined  as  free-form pieces of code with interpolated vari-
                 ables of the form @@{var} or optionally just @@ if there is a
                 single  variable.   The  set of variables is specific to each
                 primitive.  Generic API can be defined in terms  of  pointers
                 cursor, limit, marker and ctxmarker as follows:

                     /*!re2c
                       re2c:define:YYPEEK       = "*cursor";
                       re2c:define:YYSKIP       = "++cursor;";
                       re2c:define:YYBACKUP     = "marker = cursor;";
                       re2c:define:YYRESTORE    = "cursor = marker;";
                       re2c:define:YYBACKUPCTX  = "ctxmarker = cursor;";
                       re2c:define:YYRESTORECTX = "cursor = ctxmarker;";
                       re2c:define:YYRESTORETAG = "cursor = ${tag};";
                       re2c:define:YYLESSTHAN   = "limit - cursor < @@{len}";
                       re2c:define:YYSTAGP      = "@@{tag} = cursor;";
                       re2c:define:YYSTAGN      = "@@{tag} = NULL;";
                       re2c:define:YYSHIFT      = "cursor += @@{shift};";
                       re2c:define:YYSHIFTSTAG  = "@@{tag} += @@{shift};";
                     */

          Function-like
                 Function-like    style    is   enabled   with   configuration
                 re2c:api:style = functions;. In this style primitives  should
                 be defined as functions or macros with parentheses, accepting
                 the necessary arguments.  For  historical  reasons  this  API
                 style  is  the default for C/C++ backend.  Generic API can be
                 defined in terms of pointers cursor, limit, marker  and  ctx-
                 marker as follows:

                     #define  YYPEEK()                 *cursor
                     #define  YYSKIP()                 ++cursor
                     #define  YYBACKUP()               marker = cursor
                     #define  YYRESTORE()              cursor = marker
                     #define  YYBACKUPCTX()            ctxmarker = cursor
                     #define  YYRESTORECTX()           cursor = ctxmarker
                     #define  YYRESTORETAG(tag)        cursor = tag
                     #define  YYLESSTHAN(len)          limit - cursor < len
                     #define  YYSTAGP(tag)             tag = cursor
                     #define  YYSTAGN(tag)             tag = NULL
                     #define  YYSHIFT(shift)           cursor += shift
                     #define  YYSHIFTSTAG(tag, shift)  tag += shift

       For  YYFILL  definition  and  instructions  how to customize or disable
       end-of-input checks see the handling the end of input  and  buffer  re-
       filling sections.

OPTIONS
       Some  of  the  options  have  corresponding  configurations, others are
       global and cannot be changed after re2c starts reading the input  file.
       Debug  options  generally require building re2c in debug configuration.
       Internal options are useful for experimenting with the algorithms  used
       in re2c.

       -? --help -h
              Show help message.

       --api --input <default | custom>
              Specify  the  API  used  by the generated code to interface with
              used-defined code: default is the API based  on  pointer  arith-
              metic  (the  default  for C), and custom is the generic API (the
              default for Go and Rust).

       --bit-vectors -b
              Optimize conditional jumps using bit masks.  This option implies
              --nested-ifs.

       --case-insensitive
              Treat  single-quoted  and double-quoted strings as case-insensi-
              tive.

       --case-inverted
              Invert the meaning of single-quoted and  double-quoted  strings:
              treat  single-quoted strings as case-sensitive and double-quoted
              strings as case-insensitive.

       --case-ranges
              Collapse consecutive cases in a switch statements into  a  range
              of the form low ... high. This syntax is a C/C++ language exten-
              sion that is supported by compilers like GCC, Clang and Tcc. The
              main advantage over using single cases is smaller generated code
              and faster generation time, although for some compilers like Tcc
              it  also  results  in  smaller binary size.  This option is sup-
              ported only for C.

       --computed-gotos -g
              Optimize conditional jumps using  non-standard  "computed  goto"
              extension (which must be supported by the compiler). re2c gener-
              ates jump tables only in complex cases with a lot of conditional
              branches.   Complexity   threshold   can   be   configured  with
              cgoto:threshold configuration. This  option  implies  --bit-vec-
              tors. It is supported only for C.

       --conditions --start-conditions -c
              Enable  support of Flex-like "conditions": multiple interrelated
              lexers within one block. This  is  an  alternative  to  manually
              specifying different re2c blocks connected with goto or function
              calls.

       --depfile FILE
              Write dependency information to FILE in the form of  a  Makefile
              rule  <output-file>  : <input-file> [include-file ...]. This al-
              lows one to track build dependencies  in  the  presence  of  in-
              clude:re2c  directives,  so that updating include files triggers
              regeneration of the output file.  This  option  depends  on  the
              --output option.

       --ebcdic --ecb -e
              Generate  a  lexer that reads input in EBCDIC encoding. re2c as-
              sumes that the character range is 0 -- 0xFF and  character  size
              is 1 byte.

       --empty-class <match-empty | match-none | error>
              Define  the  way  re2c  treats  empty  character  classes.  With
              match-empty (the default) empty class matches empty input (which
              is  illogical,  but backwards-compatible). With match-none empty
              class always fails to match.  With error empty  class  raises  a
              compilation error.

       --encoding-policy <fail | substitute | ignore>
              Define  the  way re2c treats Unicode surrogates.  With fail re2c
              aborts with an error when a surrogate is encountered.  With sub-
              stitute  re2c  silently  replaces surrogates with the error code
              point 0xFFFD. With ignore (the default) re2c  treats  surrogates
              as normal code points. The Unicode standard says that standalone
              surrogates are invalid, but real-world  libraries  and  programs
              behave in different ways.

       --flex-syntax -F
              Partial  support for Flex syntax: in this mode named definitions
              don't need the equal sign and  the  terminating  semicolon,  and
              when used they must be surrounded with curly braces. Names with-
              out curly braces are treated as double-quoted strings.

       --header --type-header -t HEADER
              Generate a HEADER file. The contents of the file can  be  speci-
              fied  with  directives  header:re2c:on  and header:re2c:off.  If
              conditions are used the header will have a condition enum  auto-
              matically  appended  to  it  (unless there is an explicit condi-
              tions:re2c directive).

       -I PATH
              Add PATH to the list of locations which are used when  searching
              for include files. This option is useful in combination with in-
              clude:re2c directive. re2c looks for FILE in  the  directory  of
              the  parent  file and in the include locations specified with -I
              option.

       --input-encoding <ascii | utf8>
              Specify the way re2c parses  regular  expressions.   With  ascii
              (the  default) re2c handles input as ASCII-encoded: any sequence
              of code units is a sequence  of  standalone  1-byte  characters.
              With  utf8  re2c  handles  input  as UTF8-encoded and recognizes
              multibyte characters.

       --lang <c | go | rust>
              Specify the output language. Supported languages are C,  Go  and
              Rust.   The  default  is  C  for re2c, Go for re2go and Rust for
              re2rust.

       --location-format <gnu | msvc>
              Specify location format in messages.   With  gnu  locations  are
              printed as 'filename:line:column: ...'.  With msvc locations are
              printed as 'filename(line,column) ...'.  The default is gnu.

       --loop-switch
              Encode DFA in a form of a loop over a switch statement. Individ-
              ual  states  are  switch cases. The current state is stored in a
              variable yystate.  Transitions between states update yystate  to
              the case label of the destination state and continue to the head
              of the loop. This option is always enabled for Rust, as  it  has
              no  goto  statement and cannot use the goto/label approach which
              is the default for C and Go backends.

       --nested-ifs -s
              Use nested if statements instead of switch statements in  condi-
              tional  jumps.  This usually results in more efficient code with
              non-optimizing compilers.

       --no-debug-info -i
              Do not output line directives. This may be useful when the  gen-
              erated code is stored in a version control system (to avoid huge
              autogenerated diffs on small changes). This option is on by  de-
              fault for Rust, as it does not have line directives.

       --no-generation-date
              Suppress date output in the generated file.

       --no-version
              Suppress version output in the generated file.

       --no-unsafe
              Do  not generate unsafe wrapper over YYPEEK (this option is spe-
              cific to Rust). For  performance  reasons  YYPEEK  should  avoid
              bounds-checking,  as  the  lexer  already  performs end-of-input
              checks in a more efficient way.  The user may choose to  provide
              a safe YYPEEK definition, or a definition that is unsafe only in
              release builds, in which case the --no-unsafe  option  helps  to
              avoid warnings about redundant unsafe blocks.

       --output -o OUTPUT
              Specify the OUTPUT file.

       --posix-captures -P
              Enable submatch extraction with POSIX-style capturing groups.

       --reusable -r
              Deprecated since version 2.2 (reusable blocks are allowed by de-
              fault now).

       --skeleton -S
              Ignore user-defined interface code and generate a self-contained
              "skeleton"  program.  Additionally,  generate  input  files with
              strings derived from the regular grammar  and  compressed  match
              results  that  are used to verify "skeleton" behavior on all in-
              puts. This option is useful for finding  bugs  in  optimizations
              and code generation. This option is supported only for C.

       --storable-state -f
              Generate  a lexer which can store its inner state.  This is use-
              ful in push-model lexers which are stopped by an  outer  program
              when there is not enough input, and then resumed when more input
              becomes available. In this mode users should additionally define
              YYGETSTATE  and YYSETSTATE primitives, and variables yych, yyac-
              cept and state should be part of the stored lexer state.

       --tags -T
              Enable submatch extraction with tags.

       --ucs2 --wide-chars -w
              Generate a lexer that reads  UCS2-encoded  input.  re2c  assumes
              that  the character range is 0 -- 0xFFFF and character size is 2
              bytes.  This option implies --nested-ifs.

       --utf8 --utf-8 -8
              Generate a lexer that reads input in UTF-8  encoding.  re2c  as-
              sumes  that  the  character range is 0 -- 0x10FFFF and character
              size is 1 byte.

       --utf16 --utf-16 -x
              Generate a lexer that reads UTF16-encoded  input.  re2c  assumes
              that  the character range is 0 -- 0x10FFFF and character size is
              2 bytes.  This option implies --nested-ifs.

       --utf32 --unicode -u
              Generate a lexer that reads UTF32-encoded  input.  re2c  assumes
              that  the character range is 0 -- 0x10FFFF and character size is
              4 bytes.  This option implies --nested-ifs.

       --verbose
              Output a short message in case of success.

       --vernum -V
              Show version information in MMmmpp format (major, minor, patch).

       --version -v
              Show version information.

       --single-pass -1
              Deprecated. Does nothing (single pass is the default now).

       --debug-output -d
              Emit YYDEBUG invocations in the generated code. This  is  useful
              to trace lexer execution.

       --dump-adfa
              Debug option: output DFA after tunneling (in .dot format).

       --dump-cfg
              Debug  option:  output  control  flow graph of tag variables (in
              .dot format).

       --dump-closure-stats
              Debug option: output statistics on the number of states in  clo-
              sure.

       --dump-dfa-det
              Debug  option:  output DFA immediately after determinization (in
              .dot format).

       --dump-dfa-min
              Debug option: output DFA after minimization (in .dot format).

       --dump-dfa-tagopt
              Debug option: output DFA after tag optimizations (in  .dot  for-
              mat).

       --dump-dfa-tree
              Debug  option:  output DFA under construction with states repre-
              sented as tag history trees (in .dot format).

       --dump-dfa-raw
              Debug  option:  output  DFA  under  construction  with  expanded
              state-sets (in .dot format).

       --dump-interf
              Debug  option:  output  interference  table produced by liveness
              analysis of tag variables.

       --dump-nfa
              Debug option: output NFA (in .dot format).

       --emit-dot -D
              Instead of normal output generate lexer graph  in  .dot  format.
              The  output  can  be  converted  to  an  image  with the help of
              Graphviz (e.g. something like dot -Tpng -odfa.png dfa.dot).

       --dfa-minimization <moore | table>
              Internal option: DFA minimization algorithm used  by  re2c.  The
              moore option is the Moore algorithm (it is the default). The ta-
              ble option is the "table  filling"  algorithm.  Both  algorithms
              should produce the same DFA up to states relabeling; table fill-
              ing is simpler and much slower and serves as a reference  imple-
              mentation.

       --eager-skip
              Internal  option: make the generated lexer advance the input po-
              sition eagerly -- immediately after reading  the  input  symbol.
              This changes the default behavior when the input position is ad-
              vanced lazily -- after transition to the next state. This option
              is implied by --no-lookahead.

       --no-lookahead
              Internal  option:  use  TDFA(0) instead of TDFA(1).  This option
              has effect only with --tags or --posix-captures options.

       --no-optimize-tags
              Internal option: suppress optimization of tag variables  (useful
              for debugging).

       --posix-closure <gor1 | gtop>
              Internal  option:  specify  shortest-path algorithm used for the
              construction of epsilon-closure with POSIX disambiguation seman-
              tics:  gor1  (the default) stands for Goldberg-Radzik algorithm,
              and gtop stands for "global topological order" algorithm.

       --posix-prectable <complex | naive>
              Internal option: specify the algorithm  used  to  compute  POSIX
              precedence  table. The complex algorithm computes precedence ta-
              ble in one traversal of tag history tree and has quadratic  com-
              plexity  in  the  number  of TNFA states; it is the default. The
              naive algorithm has worst-case cubic complexity in the number of
              TNFA  states,  but  it  is  much simpler than complex and may be
              slightly faster in non-pathological cases.

       --stadfa
              Internal option: use staDFA algorithm for  submatch  extraction.
              The  main  difference with TDFA is that tag operations in staDFA
              are placed in states, not on transitions.

       --fixed-tags <none | toplevel | all>
              Internal option:  specify  whether  the  fixed-tag  optimization
              should  be  applied  to  all tags (all), none of them (none), or
              only those in toplevel concatenation (toplevel). The default  is
              all.   "Fixed"  tags  are  those that are located within a fixed
              distance to some other tag (called "base"). In such  cases  only
              the base tag needs to be tracked, and the value of the fixed tag
              can be computed as the value of the base tag plus a static  off-
              set.  For  tags  that  are under alternative or repetition it is
              also necessary to check if the base tag has a no-match value (in
              that case fixed tag should also be set to no-match, disregarding
              the offset). For tags in top-level concatenation  the  check  is
              not needed, because they always match.

WARNINGS
       Warnings  can  be invividually enabled, disabled and turned into an er-
       ror.

       -W     Turn on all warnings.

       -Werror
              Turn warnings into errors. Note that this option  alone  doesn't
              turn  on  any warnings; it only affects those warnings that have
              been turned on so far or will be turned on later.

       -W<warning>
              Turn on warning.

       -Wno-<warning>
              Turn off warning.

       -Werror-<warning>
              Turn on warning and treat it as an error (this implies  -W<warn-
              ing>).

       -Wno-error-<warning>
              Don't  treat  this  particular warning as an error. This doesn't
              turn off the warning itself.

       -Wcondition-order
              Warn if the generated program makes implicit  assumptions  about
              condition  numbering. One should use either the ---header option
              or the conditions:re2c directive to generate a mapping of condi-
              tion  names  to numbers and then use the autogenerated condition
              names.

       -Wempty-character-class
              Warn if a regular expression contains an empty character  class.
              Trying  to  match  an  empty  character class makes no sense: it
              should always fail.  However, for backwards  compatibility  rea-
              sons  re2c  permits  empty  character classes and treats them as
              empty strings. Use the --empty-class option to  change  the  de-
              fault behavior.

       -Wmatch-empty-string
              Warn  if  a  rule is nullable (matches an empty string).  If the
              lexer runs in a loop and the empty match is  unintentional,  the
              lexer may unexpectedly hang in an infinite loop.

       -Wswapped-range
              Warn  if  the  lower  bound of a range is greater than its upper
              bound. The default  behavior  is  to  silently  swap  the  range
              bounds.

       -Wundefined-control-flow
              Warn  if  some input strings cause undefined control flow in the
              lexer (the faulty patterns are reported). This  is  a  dangerous
              and common mistake. It can be easily fixed by adding the default
              rule * which has the lowest priority, matches any code unit, and
              always consumes a single code unit.

       -Wunreachable-rules
              Warn about rules that are shadowed by other rules and will never
              match.

       -Wuseless-escape
              Warn if a symbol is escaped when it shouldn't be.   By  default,
              re2c  silently  ignores such escapes, but this may as well indi-
              cate a typo or an error in the escape sequence.

       -Wnondeterministic-tags
              Warn if a tag has n-th degree  of  nondeterminism,  where  n  is
              greater than 1.

       -Wsentinel-in-midrule
              Warn  if  the sentinel symbol occurs in the middle of a rule ---
              this may cause reads past the end of buffer, crashes  or  memory
              corruption in the generated lexer. This warning is only applica-
              ble if the sentinel method of checking for the end of  input  is
              used.   It  is set to an error if re2c:sentinel configuration is
              used.

BLOCKS AND DIRECTIVES
       Below is the list of re2c directives (syntactic  constructs  that  mark
       the  beginning  and  end of the code that should be processed by re2c).
       Named blocks were added in re2c version 2.2. They are exactly the  same
       as  unnamed  blocks,  except  that  the name can be used to reference a
       block in other parts of the program. More information on each directive
       can be found in the related sections.

       /*!re2c[:<name>] ... */
              A global re2c block with an optional name. The block may contain
              named definitions, configurations and rules in any order.  Named
              definitions  and configurations are defined in the global scope,
              so they are inherited by  subsequent  blocks.  The  code  for  a
              global block is generated at the point where the block is speci-
              fied.

       /*!local:re2c[:<name>] ... */
              A local re2c block with an optional name. Unlike global  blocks,
              definitions  and  configurations inside of a local block are not
              added into the global scope. In all other respects local  blocks
              are the same as global blocks.

       /*!rules:re2c[:<name>] ... */
              A  reusable  block  with an optional name. Rules blocks have the
              same structure as local or global blocks, but they do  not  pro-
              duce  any  code  and  they can be reused multiple times in other
              blocks  with  the  help  of  a  !use:<name>;  directive   or   a
              /*!use:re2c[:<name>] ... */ block. A rules block on its own does
              not add any definitions into the global scope. The code  for  it
              is  generated  at  the  point  of use. Prior to re2c version 2.2
              rules blocks required -r --reusable option.

       /*!use:re2c[:<name>] ... */
              A use block that references a previously defined rules block. If
              the  name  is specified, re2c looks for a rules blocks with this
              name. Otherwise the most recent rules block is  used  (either  a
              named  or an unnamed one). A use block can add definitions, con-
              figurations and rules of its own, which are added  to  those  of
              the referenced rules block. Prior to re2c version 2.2 use blocks
              required -r --reusable option.

       !use:<name>;
              An in-block use directive that merges a previously defined rules
              block with the specified name into the current block. Named def-
              initions, configurations and rules of the referenced  block  are
              added  to  the current ones. Conflicts between overlapping rules
              and configurations are resolved in the usual way: the first rule
              takes  priority, and the latest configuration overrides the pre-
              ceding ones. One exception is the special rules *, $ and <!> for
              which  a block-local definition always takes priority. A use di-
              rective can be placed anywhere inside of a block,  and  multiple
              use directives are allowed.

       /*!max:re2c[:<name1>[:<name2>...]] ... */
              A  directive  that  generates YYMAXFILL definition.  An optional
              list of block names specifies which blocks  should  be  included
              when computing YYMAXFILL value (if the list is empty, all blocks
              are included).  By default the generated code is a macro-defini-
              tion  for C (#define YYMAXFILL <n>), or a global variable for Go
              (var YYMAXFILL int = <n>). It can be customized with an optional
              configuration  format  that  specifies  a  template string where
              @@{max} (or @@ for short) is replaced with the numeric value  of
              YYMAXFILL.

       /*!maxnmatch:re2c[:<name1>[:<name2>...]] ... */
              A  directive  that generates YYMAXNMATCH definition (it requires
              -P --posix-captures option).  An optional list  of  block  names
              specifies which blocks should be included when computing YYMAXN-
              MATCH value (if the list is empty, all blocks are included).  By
              default  the generated code is a macro-definition for C (#define
              YYMAXNMATCH <n>), or a global variable for Go  (var  YYMAXNMATCH
              int  = <n>). It can be customized with an optional configuration
              format that specifies a template string where @@{max} (or @@ for
              short) is replaced with the numeric value of YYMAXNMATCH.

       /*!stags:re2c[:<name1>[:<name2>...]]               ...              */,
       /*!mtags:re2c[:<name1>[:<name2>...]] ... */
              Directives that specify a template piece of  code  that  is  ex-
              panded  for each s-tag/m-tag variable generated by re2c.  An op-
              tional list of block names specifies which blocks should be  in-
              cluded  when  computing the set of tag variables (if the list is
              empty, all blocks are included).  There are two optional config-
              urations:  format and separator.  Configuration format specifies
              a template string where @@(tag} (or @@ for  short)  is  replaced
              with  the  name  of  each tag variable.  Configuration separator
              specifies a piece of code used  to  join  the  generated  format
              pieces for different tag variables.

       /*!getstate:re2c[:<name1>[:<name2>...]] ... */
              A  directive  that  generates  conditional dispatch on the lexer
              state (it requires --storable-state option).  An  optional  list
              of  block names specifies which blocks should be included in the
              state dispatch. The default transition goes to the  start  label
              of the first block on the list. If the list is empty, all blocks
              are included, and the default transition goes to the first block
              in the file that has a start label.  This directive is incompat-
              ible with the --loop-switch option  and  Rust,  as  it  requires
              cross-block  transitions  that  are unsupported without the goto
              statement.

       /*!conditions:re2c[:<name1>[:<name2>...]] ... */, /*!types:re2c... */
              A directive that generates condition  enumeration  (it  requires
              --conditions option).  An optional list of block names specifies
              which blocks should be included when computing the set of condi-
              tions  (if  the list is empty, all blocks are included).  By de-
              fault the generated code is an enumeration YYCONDTYPE. It can be
              customized  with  optional  configurations format and separator.
              Configuration format specifies a template string where  @@(cond}
              (or  @@  for short) is replaced with the name of each condition,
              and @@{num} is replaced with a numeric index of that  condition.
              Configuration  separator  specifies a piece of code used to join
              the generated format pieces for different conditions.

       /*!include:re2c <file> */
              This directive allows one to include <file>,  which  must  be  a
              double-quoted  file path. The contents of the file are literally
              substituted in place of the directive, in the same way  as  #in-
              clude  works  in C/C++. This directive can be used together with
              the --depfile option to generate build  system  dependencies  on
              the included files.

       !include <file>;
              This  directive is the same as /*!include:re2c <file> */, except
              that it should be used inside of a re2c block.

       /*!header:re2c:on*/
              This directive marks the start of header file. Everything  after
              it  and  up  to  the following /*!header:re2c:off*/ directive is
              processed by re2c and written to the header file specified  with
              -t --type-header option.

       /*!header:re2c:off*/
              This  directive  marks  the  end  of  header  file  started with
              /*!header:re2c:on*/.

       /*!ignore:re2c ... */
              A block which contents are ignored and removed from  the  output
              file.

       %{ ... %}
              A  global  re2c block in the --flex-support mode. This is depre-
              cated and exists for backward compatibility.

API PRIMITIVES
       Here is a list of API primitives that may be used by the generated code
       in  order  to  interface  with the outer program.  Which primitives are
       needed depends on multiple factors, including the complexity of regular
       expressions,  input  representation, buffering, the use of various fea-
       tures and so on.  All the necessary primitives should be defined by the
       user  in  the form of macros, functions, variables, free-form pieces of
       code, or any other suitable form.  re2c does not (and cannot) check the
       definitions,  so if anything is missing or defined incorrectly the gen-
       erated code will not compile.

       YYCTYPE
              The type of the  input  characters  (code  units).   For  ASCII,
              EBCDIC and UTF-8 encodings it should be 1-byte unsigned integer.
              For UTF-16 or UCS-2 it should be 2-byte  unsigned  integer.  For
              UTF-32 it should be 4-byte unsigned integer.

       YYCURSOR
              A  pointer-like  l-value  that stores the current input position
              (usually a pointer of type YYCTYPE*). Initially YYCURSOR  should
              point to the first input character. It is advanced by the gener-
              ated code.  When a rule matches, YYCURSOR points to the position
              after  the  last matched character. It is used only in C pointer
              API.

       YYLIMIT
              A pointer-like r-value that stores the  end  of  input  position
              (usually  a  pointer of type YYCTYPE*). Initially YYLIMIT should
              point to the position after the last available input  character.
              It  is not changed by the generated code. The lexer compares YY-
              CURSOR to YYLIMIT in order to determine if there are enough  in-
              put characters left.  YYLIMIT is used only in C pointer API.

       YYMARKER
              A pointer-like l-value (usually a pointer of type YYCTYPE*) that
              stores the position of the latest matched rule. It  is  used  to
              restore  the YYCURSOR position if the longer match fails and the
              lexer needs to rollback. Initialization is not needed.  YYMARKER
              is used only in C pointer API.

       YYCTXMARKER
              A  pointer-like l-value that stores the position of the trailing
              context (usually a pointer of type YYCTYPE*). No  initialization
              is  needed.  It is used only in C pointer API, and only with the
              lookahead operator /.

       YYFILL A generic API primitive with one argument  len.   YYFILL  should
              provide at least len more input characters or fail.  If re2c:eof
              is used, then len is always 1 and  YYFILL should  always  return
              to  the  calling  function; zero return value indicates success.
              If re2c:eof is not used, then YYFILL return value is ignored and
              it should not return on failure. The maximum value of len is YY-
              MAXFILL.  The definition of YYFILL can be  either  function-like
              or  free-form depending on the API style (see re2c:api:style and
              re2c:define:YYFILL:naked).

       YYMAXFILL
              An integral constant equal to the maximum value of the  argument
              to YYFILL.  It can be generated with /*!max:re2c*/ directive.

       YYLESSTHAN
              A generic API primitive with one argument len.  It should be de-
              fined as an r-value of boolean type that equals true if and only
              if  there  are less than len input characters left.  The defini-
              tion can be either function-like or free-form depending  on  the
              API style (see re2c:api:style).

       YYPEEK A generic API primitive with no arguments.  It should be defined
              as an r-value of type YYCTYPE that is equal to the character  at
              the  current  input position. The definition can be either func-
              tion-like  or  free-form  depending  on  the  API   style   (see
              re2c:api:style).

       YYSKIP A  generic  API  primitive with no arguments.  YYSKIP should ad-
              vance the current input position by one character.  The  defini-
              tion  can  be either function-like or free-form depending on the
              API style (see re2c:api:style).

       YYBACKUP
              A generic API primitive with no arguments.  YYBACKUP should save
              the  current  input position, which is later restored with YYRE-
              STORE.   The  definition  should  be  either  function-like   or
              free-form depending on the API style (see re2c:api:style).

       YYRESTORE
              A generic API primitive with no arguments.  YYRESTORE should re-
              store the current input position to the value saved by YYBACKUP.
              The  definition  should be either function-like or free-form de-
              pending on the API style (see re2c:api:style).

       YYBACKUPCTX
              A generic API primitive with zero arguments.  YYBACKUPCTX should
              save  the current input position as the position of the trailing
              context, which is later restored by YYRESTORECTX.   The  defini-
              tion  should  be  either function-like or free-form depending on
              the API style (see re2c:api:style).

       YYRESTORECTX
              A generic API primitive with no arguments.  YYRESTORECTX  should
              restore  the  trailing  context position saved with YYBACKUPCTX.
              The definition should be either function-like or  free-form  de-
              pending on the API style (see re2c:api:style).

       YYRESTORETAG
              A  generic  API  primitive  with one argument tag.  YYRESTORETAG
              should restore the trailing context position  to  the  value  of
              tag.  The definition should be either function-like or free-form
              depending on the API style (see re2c:api:style).

       YYSTAGP
              A generic API primitive with one argument tag, where tag can  be
              a  pointer or an offset (see submatch extraction section for de-
              tails).  YYSTAGP should set tag to the current  input  position.
              The  definition  should be either function-like or free-form de-
              pending on the API style (see re2c:api:style).

       YYSTAGN
              A generic API primitive with one argument tag, where tag can  be
              a  pointer or an offset (see submatch extraction section for de-
              tails).  YYSTAGN should to set tag to a  value  that  represents
              non-existent  input  position.   The definition should be either
              function-like or free-form  depending  on  the  API  style  (see
              re2c:api:style).

       YYMTAGP
              A  generic  API primitive with one argument tag.  YYMTAGP should
              append the current position to the submatch history of tag  (see
              the  submatch  extraction  section for details.)  The definition
              should be either function-like or free-form depending on the API
              style (see re2c:api:style).

       YYMTAGN
              A  generic  API primitive with one argument tag.  YYMTAGN should
              append a value that represents non-existent input position posi-
              tion to the submatch history of tag (see the submatch extraction
              section for  details.)   The  definition  can  be  either  func-
              tion-like   or   free-form  depending  on  the  API  style  (see
              re2c:api:style).

       YYSHIFT
              A generic API primitive with one argument shift.  YYSHIFT should
              shift  the current input position by shift characters (the shift
              value may be negative).  The  definition  can  be  either  func-
              tion-like   or   free-form  depending  on  the  API  style  (see
              re2c:api:style).

       YYSHIFTSTAG
              A generic  API primitive with  two  arguments,  tag  and  shift.
              YYSHIFTSTAG  should  shift  tag  by  shift characters (the shift
              value may be negative).  The  definition  can  be  either  func-
              tion-like   or   free-form  depending  on  the  API  style  (see
              re2c:api:style).

       YYSHIFTMTAG
              A generic API primitive  with  two  arguments,  tag  and  shift.
              YYSHIFTMTAG  should shift the latest value in the history of tag
              by shift characters (the shift value may be negative).  The def-
              inition should be either function-like or free-form depending on
              the API style (see re2c:api:style).

       YYMAXNMATCH
              An integral constant equal to the maximal number of  POSIX  cap-
              turing   groups  in  a  rule.  It  is  generated  with  /*!maxn-
              match:re2c*/ directive.

       YYCONDTYPE
              The type of the condition enum.  It should be  generated  either
              with  the  /*!types:re2c*/ directive or the -t --type-header op-
              tion.

       YYGETCONDITION
              An API primitive with zero arguments.  It should be  defined  as
              an  r-value of type YYCONDTYPE that is equal to the current con-
              dition identifier. The definition can be either function-like or
              free-form  depending  on  the  API style (see re2c:api:style and
              re2c:define:YYGETCONDITION:naked).

       YYSETCONDITION
              An API primitive with one argument cond.  The meaning of  YYSET-
              CONDITION  is  to  set the current condition identifier to cond.
              The definition should be either function-like or  free-form  de-
              pending on the API style (see re2c:api:style and re2c:define:YY-
              SETCONDITION@cond).

       YYGETSTATE
              An API primitive with zero arguments.  It should be  defined  as
              an  r-value  of  integer type that is equal to the current lexer
              state. Should be initialized to -1. The definition can be either
              function-like  or  free-form  depending  on  the  API style (see
              re2c:api:style and re2c:define:YYGETSTATE:naked).

       YYSETSTATE
              An API primitive with one argument state.  The meaning of YYSET-
              STATE  is  to set the current lexer state to state.  The defini-
              tion should be either function-like or  free-form  depending  on
              the   API   style  (see  re2c:api:style  and  re2c:define:YYSET-
              STATE@state).

       YYDEBUG
              A debug API primitive with two arguments. It can be used to  de-
              bug  the generated code (with -d --debug-output option). YYDEBUG
              should return no value and accept two arguments: state (either a
              DFA state index or -1) and symbol (the current input symbol).

       yych   An l-value of type YYCTYPE that stores the current input charac-
              ter.  User definition is necessary only with -f --storable-state
              option.

       yyaccept
              An  l-value  of unsigned integral type that stores the number of
              the latest matched rule.  User definition is necessary only with
              -f --storable-state option.

       yynmatch
              An  l-value  of unsigned integral type that stores the number of
              POSIX capturing groups in the matched rule.  Used only  with  -P
              --posix-captures option.

       yypmatch
              An array of l-values that are used to hold the tag values corre-
              sponding to the capturing parentheses in the matching rule.  Ar-
              ray  length must be at least yynmatch * 2 (usually YYMAXNMATCH *
              2 is a good choice).  Used only with -P --posix-captures option.

CONFIGURATIONS
       re2c:api, re2c:flags:input
              Same as the --api option.

       re2c:api:sigil
              Specify the marker ("sigil") that is used  for  argument  place-
              holders  in the API primitives. The default is @@. A placeholder
              starts with sigil followed by the argument name in curly braces.
              For  example,  if sigil is set to $, then placeholders will have
              the form ${name}. Single-argument APIs may use  shorthand  nota-
              tion  without  the name in braces. This option can be overridden
              by options for individual API primitives, e.g.   re2c:define:YY-
              FILL@len for YYFILL.

       re2c:api:style
              Specify  API  style.  Possible values are functions (the default
              for C) and free-form (the default for Go and  Rust).   In  func-
              tions  style  API primitives are generated with an argument list
              in parentheses following the name of the  primitive.  The  argu-
              ments  are  provided  only for autogenerated parameters (such as
              the number of characters passed to YYFILL), but not for the gen-
              eral lexer context, so the primitives behave more like macros in
              C/C++ or closures in Go and Rust.  In free-form style API primi-
              tives  do  not  have  a  fixed  form:  they should be defined as
              strings containing free-form pieces of  code  with  interpolated
              variables  of  the  form @@{var} or @@ (they correspond to argu-
              ments in function-like style).  This configuration may be  over-
              ridden  for  individual API primitives, see for example re2c:de-
              fine:YYFILL:naked configuration for YYFILL.

       re2c:bit-vectors, re2c:flags:bit-vectors, re2c:flags:b
              Same as the --bit-vectors  option,  but  can  be  configured  on
              per-block basis.

       re2c:case-insensitive, re2c:flags:case-insensitive
              Same  as the --case-insensitive option, but can be configured on
              per-block basis.

       re2c:case-inverted, re2c:flags:case-inverted
              Same as the --case-inverted option, but  can  be  configured  on
              per-block basis.

       re2c:case-ranges, re2c:flags:case-ranges
              Same  as  the  --case-ranges  option,  but  can be configured on
              per-block basis.

       re2c:computed-gotos, re2c:flags:computed-gotos, re2c:flags:g
              Same as the --computed-gotos option, but can  be  configured  on
              per-block basis.

       re2c:computed-gotos:threshold, re2c:cgoto:threshold
              If  computed goto is used, this configuration specifies the com-
              plexity threshold that triggers the generation  of  jump  tables
              instead  of  nested if statements and bitmaps. The default value
              is 9.

       re2c:cond:goto
              Specifies a piece of code used for  the  autogenerated  shortcut
              rules :=> in conditions. The default is goto @@;.  The @@ place-
              holder is substituted with condition  name  (see  configurations
              re2c:api:sigil and re2c:cond:goto@cond).

       re2c:cond:goto@cond
              Specifies   the   sigil   used   for  argument  substitution  in
              re2c:cond:goto definition. The default value is  @@.   Overrides
              the more generic re2c:api:sigil configuration.

       re2c:cond:divider
              Defines  the divider for condition blocks.  The default value is
              /*  ***********************************  */.   Placeholders  are
              substituted   with   condition   name  (see  re2c:api;sigil  and
              re2c:cond:divider@cond).

       re2c:cond:divider@cond
              Specifies  the  sigil  used   for   argument   substitution   in
              re2c:cond:divider  definition. The default is @@.  Overrides the
              more generic re2c:api:sigil configuration.

       re2c:cond:prefix, re2c:condprefix
              Specifies the prefix used for condition labels.  The default  is
              yyc_.

       re2c:cond:enumprefix, re2c:condenumprefix
              Specifies  the  prefix  used for condition identifiers.  The de-
              fault is yyc.

       re2c:debug-output, re2c:flags:debug-output, re2c:flags:d
              Same as the --debug-output option,  but  can  be  configured  on
              per-block basis.

       re2c:define:YYBACKUP
              Defines  generic  API primitive YYBACKUP (see the API primitives
              section).

       re2c:define:YYBACKUPCTX
              Defines generic API primitive YYBACKUPCTX (see  the  API  primi-
              tives section).

       re2c:define:YYCONDTYPE
              Defines YYCONDTYPE (see the API primitives section).

       re2c:define:YYCTYPE
              Defines YYCTYPE (see the API primitives section).

       re2c:define:YYCTXMARKER
              Defines  API  primitive YYCTXMARKER (see the API primitives sec-
              tion).

       re2c:define:YYCURSOR
              Defines API primitive YYCURSOR (see the API primitives section).

       re2c:define:YYDEBUG
              Defines API primitive YYDEBUG (see the API primitives section).

       re2c:define:YYFILL
              Defines API primitive YYFILL (see the API primitives section).

       re2c:define:YYFILL@len
              Specifies the sigil used for  argument  substitution  in  YYFILL
              definition.   Defaults   to  @@.   Overrides  the  more  generic
              re2c:api:sigil configuration.

       re2c:define:YYFILL:naked
              Overrides the more generic re2c:api:style configuration for  YY-
              FILL.  Zero value corresponds to free-form API style.

       re2c:define:YYGETCONDITION
              Defines  API  primitive  YYGETCONDITION  (see the API primitives
              section).

       re2c:define:YYGETCONDITION:naked
              Overrides the  more  generic  re2c:api:style  configuration  for
              YYGETCONDITION. Zero value corresponds to free-form API style.

       re2c:define:YYGETSTATE
              Defines  API  primitive  YYGETSTATE (see the API primitives sec-
              tion).

       re2c:define:YYGETSTATE:naked
              Overrides the  more  generic  re2c:api:style  configuration  for
              YYGETSTATE. Zero value corresponds to free-form API style.

       re2c:define:YYLESSTHAN
              Defines generic API primitive YYLESSTHAN (see the API primitives
              section).

       re2c:define:YYLIMIT
              Defines API primitive YYLIMIT (see the API primitives section).

       re2c:define:YYMARKER
              Defines API primitive YYMARKER (see the API primitives section).

       re2c:define:YYMTAGN
              Defines generic API primitive YYMTAGN (see  the  API  primitives
              section).

       re2c:define:YYMTAGP
              Defines  generic  API  primitive YYMTAGP (see the API primitives
              section).

       re2c:define:YYPEEK
              Defines generic API primitive YYPEEK  (see  the  API  primitives
              section).

       re2c:define:YYRESTORE
              Defines  generic API primitive YYRESTORE (see the API primitives
              section).

       re2c:define:YYRESTORECTX
              Defines generic API primitive YYRESTORECTX (see the  API  primi-
              tives section).

       re2c:define:YYRESTORETAG
              Defines  generic  API primitive YYRESTORETAG (see the API primi-
              tives section).

       re2c:define:YYSETCONDITION
              Defines API primitive YYSETCONDITION  (see  the  API  primitives
              section).

       re2c:define:YYSETCONDITION@cond
              Specifies  the sigil used for argument substitution in YYSETCON-
              DITION definition. The default value is @@.  Overrides the  more
              generic re2c:api:sigil configuration.

       re2c:define:YYSETCONDITION:naked
              Overrides  the more generic re2c:api:style configuration for YY-
              SETCONDITION. Zero value corresponds to free-form API style.

       re2c:define:YYSETSTATE
              Defines API primitive YYSETSTATE (see the  API  primitives  sec-
              tion).

       re2c:define:YYSETSTATE@state
              Specifies the sigil used for argument substitution in YYSETSTATE
              definition. The default value is @@.  Overrides the more generic
              re2c:api:sigil configuration.

       re2c:define:YYSETSTATE:naked
              Overrides  the more generic re2c:api:style configuration for YY-
              SETSTATE. Zero value corresponds to free-form API style.

       re2c:define:YYSKIP
              Defines generic API primitive YYSKIP  (see  the  API  primitives
              section).

       re2c:define:YYSHIFT
              Defines  generic  API  primitive YYSHIFT (see the API primitives
              section).

       re2c:define:YYSHIFTMTAG
              Defines generic API primitive YYSHIFTMTAG (see  the  API  primi-
              tives section).

       re2c:define:YYSHIFTSTAG
              Defines  generic  API  primitive YYSHIFTSTAG (see the API primi-
              tives section).

       re2c:define:YYSTAGN
              Defines generic API primitive YYSTAGN (see  the  API  primitives
              section).

       re2c:define:YYSTAGP
              Defines  generic  API  primitive YYSTAGP (see the API primitives
              section).

       re2c:empty-class, re2c:flags:empty-class
              Same as the --empty-class  option,  but  can  be  configured  on
              per-block basis.

       re2c:encoding:ebcdic, re2c:flags:ecb, re2c:flags:e
              Same  as the --ebcdic option, but can be configured on per-block
              basis.

       re2c:encoding:ucs2, re2c:flags:wide-chars, re2c:flags:w
              Same as the --ucs2 option, but can be  configured  on  per-block
              basis.

       re2c:encoding:utf8, re2c:flags:utf-8, re2c:flags:8
              Same  as  the  --utf8 option, but can be configured on per-block
              basis.

       re2c:encoding:utf16, re2c:flags:utf-16, re2c:flags:x
              Same as the --utf16 option, but can be configured  on  per-block
              basis.

       re2c:encoding:utf32, re2c:flags:unicode, re2c:flags:u
              Same  as  the --utf32 option, but can be configured on per-block
              basis.

       re2c:encoding-policy, re2c:flags:encoding-policy
              Same as the --encoding-policy option, but can be  configured  on
              per-block basis.

       re2c:eof
              Specifies the sentinel symbol used with the end-of-input rule $.
              The default value is -1 ($ rule is  not  used).  Other  possible
              values  include  all  valid code units. Only decimal numbers are
              recognized.

       re2c:header, re2c:flags:type-header, re2c:flags:t
              Specifies the name of the generated header file relative to  the
              directory of the output file. Same as the --header option except
              that the file path is relative.

       re2c:indent:string
              Specifies the string used for indentation. The default is a sin-
              gle  tab character "\t". Indent string should contain whitespace
              characters only.  To disable indentation entirely, set this con-
              figuration to an empty string.

       re2c:indent:top
              Specifies  the minimum amount of indentation to use. The default
              value is zero. The value should be a non-negative  integer  num-
              ber.

       re2c:label:prefix, re2c:labelprefix
              Specifies  the  prefix used for DFA state labels. The default is
              yy.

       re2c:label:start, re2c:startlabel
              Controls the generation of a  block  start  label.  The  default
              value  is  zero,  which  means that the start label is generated
              only if it is used. An integer value greater  than  zero  forces
              the generation of start label even if it is unused by the lexer.
              A string value also forces start label generation and  sets  the
              label  name  to the specified string. This configuration applies
              only to the current block (it is reset to default for  the  next
              block).

       re2c:label:yyFillLabel
              Specifies  the prefix of YYFILL labels used with re2c:eof and in
              storable state mode.

       re2c:label:yyloop
              Specifies the name of the label marking the start of  the  lexer
              loop with --loop-switch option. The default is yyloop.

       re2c:label:yyNext
              Specifies the name of the optional label that follows YYGETSTATE
              switch in storable state mode (enabled  with  re2c:state:nextla-
              bel). The default is yyNext.

       re2c:lookahead, re2c:flags:lookahead
              Same as inverted --no-lookahead option, but can be configured on
              per-block basis.

       re2c:nested-ifs, re2c:flags:nested-ifs, re2c:flags:s
              Same as the  --nested-ifs  option,  but  can  be  configured  on
              per-block basis.

       re2c:posix-captures, re2c:flags:posix-captures, re2c:flags:P
              Same  as  the  --posix-captures option, but can be configured on
              per-block basis.

       re2c:tags, re2c:flags:tags, re2c:flags:T
              Same as the --tags option, but can be  configured  on  per-block
              basis.

       re2c:tags:expression
              Specifies  the  expression  used  for tag variables.  By default
              re2c generates expressions of the form yyt<N>. This might be in-
              convenient,  for  example if tag variables are defined as fields
              in a struct. All occurrences of @@{tag} or @@ are replaced  with
              the actual tag name. For example, re2c:tags:expression = "s.@@";
              results in expressions of the form  s.yyt<N>  in  the  generated
              code.  See also re2c:api:sigil configuration.

       re2c:tags:prefix
              Specifies the prefix for tag variable names. The default is yyt.

       re2c:sentinel
              Specifies  the  sentinel symbol used for the end-of-input checks
              (when bounds checks are disabled with  re2c:yyfill:enable  =  0;
              and  re2c:eof  is  not  set). This configuration does not affect
              code generation: its purpose is to verify that the  sentinel  is
              not  allowed  in the middle of a rule, and ensure that the lexer
              won't read past the end of buffer. The default value is -1`  (in
              that  case  re2c assumes that the sentinel is zero, which is the
              most common case). Only decimal numbers are recognized.

       re2c:state:abort
              If set to a positive integer value, changes the default case  in
              YYGETSTATE  switch: by default it aborts the program, and an ex-
              plicit -1 case contains transition to the start of the block.

       re2c:state:nextlabel
              Controls if the YYGETSTATE switch is followed by an yyNext label
              (the default value is zero, which corresponds to no label).  Al-
              ternatively one can use re2c:label:start to generate a  specific
              start  label, or an explicit getstate:re2c directive to generate
              the YYGETSTATE switch separately from the lexer block.

       re2c:unsafe, re2c:flags:unsafe
              Same as  the  --no-unsafe  option,  but  can  be  configured  on
              per-block  basis.   If set to zero, it suppresses the generation
              of unsafe wrappers around YYPEEK. The default is non-zero (wrap-
              pers are generated).  This configuration is specific to Rust.

       re2c:variable:yyaccept
              Specifies  the name of the yyaccept variable (see the API primi-
              tives section).

       re2c:variable:yybm
              Specifies the name of the yybm variable (used for bitmaps).

       re2c:variable:yybm:hex, re2c:yybm:hex
              If set to nonzero, bitmaps for the --bit-vectors option are gen-
              erated  in  hexadecimal format. The default is zero (bitmaps are
              in decimal format).

       re2c:variable:yych
              Specifies the name of the yych variable (see the API  primitives
              section).

       re2c:variable:yych:emit, re2c:yych:emit
              If  set  to zero, yych definition is not generated.  The default
              is non-zero.

       re2c:variable:yych:conversion, re2c:yych:conversion
              If set to non-zero, re2c automatically generates a conversion to
              YYCTYPE every time yych is read. The default is to zero (no con-
              version).

       re2c:variable:yyctable
              Specifies the name of the yyctable variable (the jump table gen-
              erated for YYGETCONDITION switch with --computed-gotos option).

       re2c:variable:yytarget
              Specifies the name of the yytarget variable.

       re2c:variable:yystable
              Deprecated.

       re2c:variable:yystate
              Specifies  the  name  of  the  yystate  variable  (used with the
              --loop-switch option to store the current DFA state).

       re2c:yyfill:check
              If set to zero, suppresses the generation  of  pre-YYFILL  check
              for the number of input characters (the YYLESSTHAN definition in
              generic API and the YYLIMIT-based comparison in C pointer  API).
              The default is non-zero (generate the check).

       re2c:yyfill:enable
              If  set  to  zero, suppresses the generation of YYFILL (together
              with the check). This should be used when the whole  input  fits
              into  one  piece  of memory (there is no need for buffering) and
              the end-of-input checks do not rely on the YYFILL  checks  (e.g.
              if  a sentinel character is used).  Use warnings (-W option) and
              re2c:sentinel configuration to verify that the  generated  lexer
              cannot read past the end of input.  The default is non-zero (YY-
              FILL is enabled).

       re2c:yyfill:parameter
              If set to zero, suppresses the generation of parameter passed to
              YYFILL.   The parameter is the minimum number of characters that
              must be supplied.  Defaults to non-zero (the parameter is gener-
              ated).   This  configuration  can  be  overridden  with re2c:de-
              fine:YYFILL:naked or re2c:api:style.

REGULAR EXPRESSIONS
       re2c uses the following syntax for regular expressions:

       • "foo" case-sensitive string literal

       • 'foo' case-insensitive string literal

       • [a-xyz], [^a-xyz] character class (possibly negated)

       • . any character except newline

       • R \ S difference of character classes R and SR* zero or more occurrences of RR+ one or more occurrences of RR? optional RR{n} repetition of R exactly n times

       • R{n,} repetition of R at least n times

       • R{n,m} repetition of R from n to m times

       • (R) just R; parentheses  are  used  to  override  precedence  or  for
         POSIX-style submatch

       • R S concatenation: R followed by SR | S alternative: R or SR / S lookahead: R followed by S, but S is not consumed

       • name the regular expression defined as name (or literal string "name"
         in Flex compatibility mode)

       • {name} the regular expression defined as name in  Flex  compatibility
         mode

       • @stag  an s-tag: saves the last input position at which @stag matches
         in a variable named stag#mtag an m-tag: saves all input positions at which #mtag matches in a
         variable named mtag

       Character  classes and string literals may contain the following escape
       sequences: \a, \b, \f, \n, \r, \t, \v, \\, octal escapes \ooo and hexa-
       decimal escapes \xhh, \uhhhh and \Uhhhhhhhh.

HANDLING THE END OF INPUT
       One  of the main problems for the lexer is to know when to stop.  There
       are a few terminating conditions:

       • the lexer may match some rule (including default rule *) and come  to
         a final state

       • the lexer may fail to match any rule and come to a default state

       • the lexer may reach the end of input

       The  first  two  conditions  terminate the lexer in a "natural" way: it
       comes to a state with no outgoing transitions, and the  matching  auto-
       matically  stops.  The  third condition, end of input, is different: it
       may happen in any state, and the lexer should be  able  to  handle  it.
       Checking  for the end of input interrupts the normal lexer workflow and
       adds conditional branches to the generated  program,  therefore  it  is
       necessary  to  minimize  the number of such checks. re2c supports a few
       different methods for handling the end of input. Which one to  use  de-
       pends on the complexity of regular expressions, the need for buffering,
       performance considerations and other factors. Here is a list  of  meth-
       ods:

       • Sentinel.   This  method  eliminates  the  need  for the end of input
         checks altogether. It is simple and efficient,  but  limited  to  the
         case  when there is a natural "sentinel" character that can never oc-
         cur in valid input. This character may still occur in invalid  input,
         but  it should not be allowed by the regular expressions, except per-
         haps as the last character of a rule. The sentinel is appended at the
         end  of  input and serves as a stop signal: when the lexer reads this
         character, it is either a syntax error or the end of input.  In  both
         cases  the  lexer  should stop. This method is used if YYFILL is dis-
         abled with re2c:yyfill:enable = 0; and re2c:eof has the default value
         -1.

       • Sentinel  with  bounds checks.  This method is generic: it allows one
         to handle any input without restrictions on the regular  expressions.
         The idea is to reduce the number of end of input checks by performing
         them only on certain characters. Similar to  the  "sentinel"  method,
         one  of  the characters is chosen as a "sentinel" and appended at the
         end of input. However, there is no restriction on where the  sentinel
         may  occur  (in  fact,  any  character can be chosen for a sentinel).
         When the lexer reads  this  character,  it  additionally  performs  a
         bounds  check.   If  the current position is within bounds, the lexer
         resumes matching and handles the sentinel  as  a  regular  character.
         Otherwise it invokes YYFILL (unless it is disabled). If more input is
         supplied, the lexer will rematch the last character and  continue  as
         if  the  sentinel  wasn't there. Otherwise it must be the real end of
         input, and the lexer stops. This method is  used  when  re2c:eof  has
         non-negative value (it should be set to the numeric value of the sen-
         tinel). YYFILL is optional.

       • Bounds checks with padding.  This method is generic, and  it  may  be
         faster  than the "sentinel with bounds checks" method, but it is also
         more complex. The idea is to partition DFA states into strongly  con-
         nected  components  (SCCs)  and  generate  a single check per SCC for
         enough characters to cover the longest non-looping path in this  SCC.
         This  reduces the number of checks, but there is a problem with short
         lexemes at the end of input, as the check requires enough  characters
         to  cover  the longest lexeme. This can be fixed by padding the input
         with a few fake characters that do not form a valid lexeme suffix (so
         that  the  lexer  cannot match them). The length of padding should be
         YYMAXFILL, generated with /*!max:re2c*/. If there is not  enough  in-
         put,  the  lexer  invokes YYFILL which should supply at least the re-
         quired number of characters or not return.  This method  is  used  if
         YYFILL  is enabled and re2c:eof is -1 (this is the default configura-
         tion).

       • Custom checks.  Generic API allows one to override  basic  operations
         like  reading  a  character,  which  makes it possible to include the
         end-of-input checks as part of them.  This  approach  is  error-prone
         and  should  be  used  with  caution.  To use a custom method, enable
         generic API with --api custom or re2c:api = custom; and  disable  de-
         fault bounds checks with re2c:yyfill:enable = 0; or re2c:yyfill:check
         = 0;.

       The following subsections contain an example of each method.

   Sentinel
       This example uses a sentinel character to handle the end of input.  The
       program  counts  space-separated words in a null-terminated string. The
       sentinel is null: it is the last character of each input string, and it
       is  not  allowed in the middle of a lexeme by any of the rules (in par-
       ticular, it is not included in character ranges where  it  is  easy  to
       overlook).  If  a null occurs in the middle of a string, it is a syntax
       error and the lexer will match default rule *, but it won't  read  past
       the  end  of  input  or  crash  (use  -Wsentinel-in-midrule warning and
       re2c:sentinel configuration to  verify  this).  Configuration  re2c:yy-
       fill:enable  = 0; suppresses the generation of bounds checks and YYFILL
       invocations.

          // re2c $INPUT -o $OUTPUT
          #include <assert.h>

          // Expect a null-terminated string.
          static int lex(const char *YYCURSOR) {
              int count = 0;

              for (;;) {
              /*!re2c
                  re2c:define:YYCTYPE = char;
                  re2c:yyfill:enable = 0;

                  *      { return -1; }
                  [\x00] { return count; }
                  [a-z]+ { ++count; continue; }
                  [ ]+   { continue; }
              */
              }
          }

          int main() {
              assert(lex("") == 0);
              assert(lex("one two three") == 3);
              assert(lex("f0ur") == -1);
              return 0;
          }

   Sentinel with bounds checks
       This example uses sentinel with bounds checks to handle the end of  in-
       put  (this  method  was  added  in  version  1.2).  The  program counts
       space-separated single-quoted strings. The sentinel character is  null,
       which is specified with re2c:eof = 0; configuration. As in the sentinel
       method, null is the last character of each input string, but it is  al-
       lowed in the middle of a rule (for example, 'aaa\0aa'\0 is valid input,
       but 'aaa\0 is a syntax error).  Bounds checks  are  generated  in  each
       state  that  matches  an  input  character,  but they are scoped to the
       branch that handles null. Bounds checks are of the form YYLIMIT <=  YY-
       CURSOR  or  YYLESSTHAN(1)  with  generic API. If the check condition is
       true, lexer has reached the end of input and  should  stop  (YYFILL  is
       disabled  with  re2c:yyfill:enable = 0; as the input fits into one buf-
       fer, see the YYFILL with sentinel section for an example that uses  YY-
       FILL).  Reaching  the  end  of  input opens three possibilities: if the
       lexer is in the initial state it will match the  end-of-input  rule  $,
       otherwise  it  may fallback to a previously matched rule (including de-
       fault   rule   *)   or    go    to    a    default    state,    causing
       -Wundefined-control-flow.

          // re2c $INPUT -o $OUTPUT
          #include <assert.h>

          // Expect a null-terminated string.
          static int lex(const char *str, unsigned int len) {
              const char *YYCURSOR = str, *YYLIMIT = str + len, *YYMARKER;
              int count = 0;

              for (;;) {
              /*!re2c
                  re2c:define:YYCTYPE = char;
                  re2c:yyfill:enable = 0;
                  re2c:eof = 0;

                  str = ['] ([^'\\] | [\\][^])* ['];

                  *    { return -1; }
                  $    { return count; }
                  str  { ++count; continue; }
                  [ ]+ { continue; }
              */
              }
          }

          #define TEST(s, r) assert(lex(s, sizeof(s) - 1) == r)
          int main() {
              TEST("", 0);
              TEST("'qu\0tes' 'are' 'fine: \\'' ", 3);
              TEST("'unterminated\\'", -1);
              return 0;
          }

   Bounds checks with padding
       This example uses bounds checks with padding to handle the end of input
       (this method is enabled by default). The program counts space-separated
       single-quoted  strings. There is a padding of YYMAXFILL null characters
       appended at the end of input, where YYMAXFILL  value  is  autogenerated
       with /*!max:re2c*/. It is not necessary to use null for padding --- any
       characters can be used as long as they do not form a valid lexeme  suf-
       fix  (in this example padding should not contain single quotes, as they
       may be mistaken for a suffix of a single-quoted  string).  There  is  a
       "stop"  rule that matches the first padding character (null) and termi-
       nates the lexer (note that it checks if null is  at  the  beginning  of
       padding,  otherwise  it is a syntax error). Bounds checks are generated
       only in some states that are determined by the strongly connected  com-
       ponents  of  the  underlying automaton. Checks have the form (YYLIMIT -
       YYCURSOR) < n or YYLESSTHAN(n) with generic API, where n is the minimum
       number  of characters that are needed for the lexer to proceed (it also
       means that the next bounds check will occur in at most  n  characters).
       If  the check condition is true, the lexer has reached the end of input
       and will invoke YYFILL(n) that should either supply at  least  n  input
       characters  or not return. In this example YYFILL always fails and ter-
       minates the lexer with an error (which is fine because the  input  fits
       into  one  buffer).  See the YYFILL with padding section for an example
       that refills the input buffer with YYFILL.

          // re2c $INPUT -o $OUTPUT
          #include <assert.h>
          #include <stdlib.h>
          #include <string.h>

          /*!max:re2c*/

          static int lex(const char *str, unsigned int len) {
              // Make a copy of the string with YYMAXFILL zeroes at the end.
              char *buf = (char*) malloc(len + YYMAXFILL);
              memcpy(buf, str, len);
              memset(buf + len, 0, YYMAXFILL);

              const char *YYCURSOR = buf, *YYLIMIT = buf + len + YYMAXFILL;
              int count = 0;

          loop:
              /*!re2c
                  re2c:api:style = free-form;
                  re2c:define:YYCTYPE = char;
                  re2c:define:YYFILL  = "goto fail;";

                  str = ['] ([^'\\] | [\\][^])* ['];

                  [\x00] {
                      // Check that it is the sentinel, not some unexpected null.
                      if (YYCURSOR - 1 == buf + len) goto exit; else goto fail;
                  }
                  str  { ++count; goto loop; }
                  [ ]+ { goto loop; }
                  *    { goto fail; }
              */

          fail:
              count = -1;

          exit:
              free(buf);
              return count;
          }

          #define TEST(s, r) assert(lex(s, sizeof(s) - 1) == r)
          int main() {
              TEST("", 0);
              TEST("'qu\0tes' 'are' 'fine: \\'' ", 3);
              TEST("'unterminated\\'", -1);
              TEST("'unexpected \0 null\\'", -1);
              return 0;
          }

   Custom checks
       This example uses  a  custom  end-of-input  handling  method  based  on
       generic API.  The program counts space-separated single-quoted strings.
       It is the same as the sentinel with bounds checks example, except  that
       the input is not null-terminated (this method can be used if padding is
       not an option, not even a single character). To cover up  for  the  ab-
       sence of sentinel character at the end of input, YYPEEK is redefined to
       perform a bounds check before it reads the next input  character.  This
       is  inefficient because checks are done very often. If the check condi-
       tion fails, YYPEEK returns the real character, otherwise it  returns  a
       fake sentinel character.

          // re2c $INPUT -o $OUTPUT
          #include <assert.h>
          #include <stdlib.h>
          #include <string.h>

          static int lex(const char *str, unsigned int len) {
              // For the sake of example create a string without terminating null.
              char *buf = (char*) malloc(len);
              memcpy(buf, str, len);

              const char *cur = buf, *lim = buf + len, *mar;
              int count = 0;

              for (;;) {
              /*!re2c
                  re2c:yyfill:enable = 0;
                  re2c:eof = 0;
                  re2c:api = custom;
                  re2c:api:style = free-form;
                  re2c:define:YYCTYPE = char;
                  re2c:define:YYLESSTHAN = "cur >= lim";
                  re2c:define:YYPEEK = "cur < lim ? *cur : 0";  // fake null
                  re2c:define:YYSKIP = "++cur;";
                  re2c:define:YYBACKUP = "mar = cur;";
                  re2c:define:YYRESTORE = "cur = mar;";

                  str = ['] ([^'\\] | [\\][^])* ['];

                  *    { count = -1; break; }
                  $    { break;; }
                  str  { ++count; continue; }
                  [ ]+ { continue; }
              */
              }

              free(buf);
              return count;
          }

          #define TEST(s, r) assert(lex(s, sizeof(s) - 1) == r)
          int main() {
              TEST("", 0);
              TEST("'qu\0tes' 'are' 'fine: \\'' ", 3);
              TEST("'unterminated\\'", -1);
              return 0;
          }

BUFFER REFILLING
       The need for buffering arises when the input cannot be mapped in memory
       all at once: either it is too large, or it comes in a streaming fashion
       (like  reading  from a socket). The usual technique in such cases is to
       allocate a fixed-sized memory buffer and process input in  chunks  that
       fit  into  the buffer. When the current chunk is processed, it is moved
       out and new data is moved in. In practice it is somewhat more  complex,
       because  lexer state consists not of a single input position, but a set
       of interrelated positions:

       • cursor: the next input character to be read (YYCURSOR  in  C  pointer
         API or YYSKIP/YYPEEK in generic API)

       • limit: the position after the last available input character (YYLIMIT
         in C pointer API, implicitly handled by YYLESSTHAN in generic API)

       • marker: the position of the most recent match, if  any  (YYMARKER  in
         default API or YYBACKUP/YYRESTORE in generic API)

       • token:  the  start of the current lexeme (implicit in re2c API, as it
         is not needed for the normal lexer operation and can be  defined  and
         updated by the user)

       • context  marker: the position of the trailing context (YYCTXMARKER in
         C pointer API or YYBACKUPCTX/YYRESTORECTX in generic API)

       • tag variables: submatch positions (defined with  /*!stags:re2c*/  and
         /*!mtags:re2c*/  directives  and  YYSTAGP/YYSTAGN/YYMTAGP/YYMTAGN  in
         generic API)

       Not all these are used in every case, but if used, they must be updated
       by  YYFILL.  All  active positions are contained in the segment between
       token and cursor, therefore everything between buffer start  and  token
       can  be  discarded,  the  segment  from token and up to limit should be
       moved to the beginning of buffer, and the free space at the end of buf-
       fer  should be filled with new data.  In order to avoid frequent YYFILL
       calls it is best to fill in as many input characters as possible  (even
       though fewer characters might suffice to resume the lexer). The details
       of YYFILL implementation are slightly different depending on which  EOF
       handling  method is used: the case of EOF rule is somewhat simpler than
       the case  of  bounds-checking  with  padding.  Also  note  that  if  -f
       --storable-state  option  is used, YYFILL has slightly different seman-
       tics (described in the section about storable state).

   YYFILL with sentinel
       If EOF rule is used, YYFILL is a function-like primitive  that  accepts
       no  arguments and returns a value which is checked against zero. YYFILL
       invocation is triggered by condition YYLIMIT <= YYCURSOR in  C  pointer
       API and YYLESSTHAN() in generic API. A non-zero return value means that
       YYFILL has failed. A successful YYFILL call must supply  at  least  one
       character  and adjust input positions accordingly. Limit must always be
       set to one after the last input position in buffer, and  the  character
       at the limit position must be the sentinel symbol specified by re2c:eof
       configuration. The pictures below show the relative locations of  input
       positions  in  buffer  before and after YYFILL call (sentinel symbol is
       marked with #, and the second picture shows the case when there is  not
       enough input to fill the whole buffer).

                         <-- shift -->
                       >-A------------B---------C-------------D#-----------E->
                       buffer       token    marker         limit,
                                                            cursor
          >-A------------B---------C-------------D------------E#->
                       buffer,  marker        cursor        limit
                       token

                         <-- shift -->
                       >-A------------B---------C-------------D#--E (EOF)
                       buffer       token    marker         limit,
                                                            cursor
          >-A------------B---------C-------------D---E#........
                       buffer,  marker       cursor limit
                       token

       Here  is  an  example  of  a program that reads input file input.txt in
       chunks of 4096 bytes and uses EOF rule.

          // re2c $INPUT -o $OUTPUT
          #include <assert.h>
          #include <stdio.h>
          #include <string.h>

          #define BUFSIZE 4095

          struct Input {
              FILE *file;
              char buf[BUFSIZE + 1], *lim, *cur, *mar, *tok; // +1 for sentinel
              bool eof;
          };

          static int fill(Input &in) {
              if (in.eof) return 1;

              const size_t shift = in.tok - in.buf;
              const size_t used = in.lim - in.tok;

              // Error: lexeme too long. In real life could reallocate a larger buffer.
              if (shift < 1) return 2;

              // Shift buffer contents (discard everything up to the current token).
              memmove(in.buf, in.tok, used);
              in.lim -= shift;
              in.cur -= shift;
              in.mar -= shift;
              in.tok -= shift;

              // Fill free space at the end of buffer with new data from file.
              in.lim += fread(in.lim, 1, BUFSIZE - used, in.file);
              in.lim[0] = 0;
              in.eof = in.lim < in.buf + BUFSIZE;
              return 0;
          }

          static int lex(Input &in) {
              int count = 0;
              for (;;) {
                  in.tok = in.cur;
              /*!re2c
                  re2c:api:style = free-form;
                  re2c:define:YYCTYPE  = char;
                  re2c:define:YYCURSOR = in.cur;
                  re2c:define:YYMARKER = in.mar;
                  re2c:define:YYLIMIT  = in.lim;
                  re2c:define:YYFILL   = "fill(in) == 0";
                  re2c:eof = 0;

                  str = ['] ([^'\\] | [\\][^])* ['];

                  *    { return -1; }
                  $    { return count; }
                  str  { ++count; continue; }
                  [ ]+ { continue; }
              */
              }
          }

          int main() {
              const char *fname = "input";
              const char content[] = "'qu\0tes' 'are' 'fine: \\'' ";

              // Prepare input file: a few times the size of the buffer, containing
              // strings with zeroes and escaped quotes.
              FILE *f = fopen(fname, "w");
              for (int i = 0; i < BUFSIZE; ++i) {
                  fwrite(content, 1, sizeof(content) - 1, f);
              }
              fclose(f);
              int count = 3 * BUFSIZE; // number of quoted strings written to file

              // Initialize lexer state: all pointers are at the end of buffer.
              Input in;
              in.file = fopen(fname, "r");
              in.cur = in.mar = in.tok = in.lim = in.buf + BUFSIZE;
              in.eof = 0;
              // Sentinel (at YYLIMIT pointer) is set to zero, which triggers YYFILL.
              in.lim[0] = 0;

              // Run the lexer.
              assert(lex(in) == count);

              // Cleanup: remove input file.
              fclose(in.file);
              remove(fname);
              return 0;
          }

   YYFILL with padding
       In the default case (when EOF rule is  not  used)  YYFILL  is  a  func-
       tion-like  primitive that accepts a single argument and does not return
       any value.  YYFILL invocation is triggered by condition (YYLIMIT -  YY-
       CURSOR)  < n in C pointer API and YYLESSTHAN(n) in generic API. The ar-
       gument passed to YYFILL is the minimal number of characters  that  must
       be  supplied. If it fails to do so, YYFILL must not return to the lexer
       (for that reason it is best implemented as a macro  that  returns  from
       the calling function on failure).  In case of a successful YYFILL invo-
       cation the limit position must be set either to one after the last  in-
       put position in buffer, or to the end of YYMAXFILL padding (in case YY-
       FILL has successfully read at least n characters,  but  not  enough  to
       fill the entire buffer). The pictures below show the relative locations
       of input positions in buffer before and after YYFILL invocation (YYMAX-
       FILL padding on the second picture is marked with # symbols).

                         <-- shift -->                 <-- need -->
                       >-A------------B---------C-----D-------E---F--------G->
                       buffer       token    marker cursor  limit

          >-A------------B---------C-----D-------E---F--------G->
                       buffer,  marker cursor               limit
                       token

                         <-- shift -->                 <-- need -->
                       >-A------------B---------C-----D-------E-F        (EOF)
                       buffer       token    marker cursor  limit

          >-A------------B---------C-----D-------E-F###############
                       buffer,  marker cursor                   limit
                       token                        <- YYMAXFILL ->

       Here  is  an  example  of  a program that reads input file input.txt in
       chunks of 4096 bytes and uses bounds-checking with padding.

          // re2c $INPUT -o $OUTPUT
          #include <assert.h>
          #include <stdio.h>
          #include <string.h>

          /*!max:re2c*/
          #define BUFSIZE (4096 - YYMAXFILL)

          struct Input {
              FILE *file;
              char buf[BUFSIZE + YYMAXFILL], *lim, *cur, *tok;
              bool eof;
          };

          static int fill(Input &in, size_t need) {
              if (in.eof) return 1;

              const size_t shift = in.tok - in.buf;
              const size_t used = in.lim - in.tok;

              // Error: lexeme too long. In real life could reallocate a larger buffer.
              if (shift < need) return 2;

              // Shift buffer contents (discard everything up to the current token).
              memmove(in.buf, in.tok, used);
              in.lim -= shift;
              in.cur -= shift;
              in.tok -= shift;

              // Fill free space at the end of buffer with new data from file.
              in.lim += fread(in.lim, 1, BUFSIZE - used, in.file);

              // If read less than expected, this is end of input => add zero padding
              // so that the lexer can access characters at the end of buffer.
              if (in.lim < in.buf + BUFSIZE) {
                  in.eof = true;
                  memset(in.lim, 0, YYMAXFILL);
                  in.lim += YYMAXFILL;
              }

              return 0;
          }

          static int lex(Input &in) {
              int count = 0;
              for (;;) {
                  in.tok = in.cur;
              /*!re2c
                  re2c:api:style = free-form;
                  re2c:define:YYCTYPE  = char;
                  re2c:define:YYCURSOR = in.cur;
                  re2c:define:YYLIMIT  = in.lim;
                  re2c:define:YYFILL   = "if (fill(in, @@) != 0) return -1;";

                  str = ['] ([^'\\] | [\\][^])* ['];

                  [\x00] {
                      // Check that it is the sentinel, not some unexpected null.
                      return in.tok == in.lim - YYMAXFILL ? count : -1;
                  }
                  str  { ++count; continue; }
                  [ ]+ { continue; }
                  *    { return -1; }
              */
              }
          }

          int main() {
              const char *fname = "input";
              const char content[] = "'qu\0tes' 'are' 'fine: \\'' ";

              // Prepare input file: a few times the size of the buffer, containing
              // strings with zeroes and escaped quotes.
              FILE *f = fopen(fname, "w");
              for (int i = 0; i < BUFSIZE; ++i) {
                  fwrite(content, 1, sizeof(content) - 1, f);
              }
              fclose(f);
              int count = 3 * BUFSIZE; // number of quoted strings written to file

              // Initialize lexer state: all pointers are at the end of buffer.
              // This immediately triggers YYFILL, as the check `in.cur < in.lim` fails.
              Input in;
              in.file = fopen(fname, "r");
              in.cur = in.tok = in.lim = in.buf + BUFSIZE;
              in.eof = 0;

              // Run the lexer.
              assert(lex(in) == count);

              // Cleanup: remove input file.
              fclose(in.file);
              remove(fname);
              return 0;
          }

MULTIPLE BLOCKS
       Sometimes it is necessary to have multiple interrelated lexers (for ex-
       ample,  if there is a high-level state machine that transitions between
       lexer modes). This can be implemented  using  multiple  connected  re2c
       blocks. Another option is to use start conditions.

       The  implementation of connections between blocks depends on the target
       language.  In languages that have goto statement (such as C/C++ and Go)
       one  can  have all blocks in one function, each of them prefixed with a
       label. Transition from one block to another is a simple goto.  In  lan-
       guages  that  do  not have goto (such as Rust) it is necessary to use a
       loop with a  switch  on  a  state  variable,  similar  to  the  yystate
       loop/switch  generated  by  re2c, or else wrap each block in a function
       and use function calls.

       The example below uses multiple blocks to parse binary, octal,  decimal
       and hexadecimal numbers. Each base has its own block. The initial block
       determines base and dispatches to other blocks.  Common  configurations
       are  defined  in a separate block at the beginning of the program; they
       are inherited by the other blocks.

          // re2c $INPUT -o $OUTPUT -i
          #include <stdint.h>
          #include <limits.h>
          #include <assert.h>

          static const uint64_t ERROR = UINT64_MAX;

          template<int BASE> static void add(uint64_t &u, char d) {
              u = u * BASE + d;
              if (u > UINT32_MAX) u = ERROR;
          }

          static uint64_t parse_u32(const char *s) {
              const char *YYCURSOR = s, *YYMARKER;
              uint64_t u = 0;

              /*!re2c
                  re2c:yyfill:enable = 0;
                  re2c:define:YYCTYPE = char;

                  end = "\x00";

                  '0b' / [01]        { goto bin; }
                  "0"                { goto oct; }
                  "" / [1-9]         { goto dec; }
                  '0x' / [0-9a-fA-F] { goto hex; }
                  *                  { return ERROR; }
              */
          bin:
              /*!re2c
                  end   { return u; }
                  [01]  { add<2>(u, YYCURSOR[-1] - '0'); goto bin; }
                  *     { return ERROR; }
              */
          oct:
              /*!re2c
                  end   { return u; }
                  [0-7] { add<8>(u, YYCURSOR[-1] - '0'); goto oct; }
                  *     { return ERROR; }
              */
          dec:
              /*!re2c
                  end   { return u; }
                  [0-9] { add<10>(u, YYCURSOR[-1] - '0'); goto dec; }
                  *     { return ERROR; }
              */
          hex:
              /*!re2c
                  end   { return u; }
                  [0-9] { add<16>(u, YYCURSOR[-1] - '0');      goto hex; }
                  [a-f] { add<16>(u, YYCURSOR[-1] - 'a' + 10); goto hex; }
                  [A-F] { add<16>(u, YYCURSOR[-1] - 'A' + 10); goto hex; }
                  *     { return ERROR; }
              */
          }

          int main() {
              assert(parse_u32("") == ERROR);
              assert(parse_u32("1234567890") == 1234567890);
              assert(parse_u32("0b1101") == 13);
              assert(parse_u32("0x7Fe") == 2046);
              assert(parse_u32("0644") == 420);
              assert(parse_u32("9999999999") == ERROR);
              return 0;
          }

START CONDITIONS
       Start conditions are enabled with --start-conditions option. They  pro-
       vide  a  way  to  encode multiple interrelated automata within the same
       re2c block.

       Each condition corresponds to a single automaton and has a unique  name
       specified by the user and a unique internal number defined by re2c. The
       numbers are used to switch between conditions: the generated code  uses
       YYGETCONDITION  and YYSETCONDITION primitives to get the current condi-
       tion or set it to the given number. Use /*!conditions:re2c*/  directive
       or  the --header option to generate numeric condition identifiers. Con-
       figuration re2c:cond:enumprefix specifies the generated identifier pre-
       fix.

       In condition mode every rule must be prefixed with a list of comma-sep-
       arated condition names in angle brackets, or a wildcard <*>  to  denote
       all conditions. The rule syntax is extended as follows:

          < cond-list > regexp action
                 A  rule  that  is merged to every condition on the cond-list.
                 It matches regexp and executes the associated action.

          < cond-list > regexp => cond action
                 A rule that is merged to every condition  on  the  cond-list.
                 It matches regexp, sets the current condition to cond and ex-
                 ecutes the associated action.

          < cond-list > regexp :=> cond
                 A rule that is merged to every condition  on  the  cond-list.
                 It  matches regexp and immediately transitions to cond (there
                 is no semantic action).

          <! cond-list > action
                 The action is prepended to semantic actions of all rules  for
                 every  condition  on the cond-list. This may be used to dedu-
                 plicate common code.

          < > action
                 A rule that is merged to a special entry condition with  num-
                 ber  zero  and name "0". It matches empty string and executes
                 the action.

          < > => cond action
                 A rule that is merged to a special entry condition with  num-
                 ber zero and name "0". It matches empty string, sets the cur-
                 rent condition to cond and executes the action.

          < > :=> cond
                 A rule that is merged to a special entry condition with  num-
                 ber  zero  and  name "0". It matches empty string and immedi-
                 ately transitions to cond.

       The code re2c generates for conditions depends  on  whether  re2c  uses
       goto/label approach or loop/switch approach to encode the automata.

       In languages that have goto statement (such as C/C++ and Go) conditions
       are naturally implemented as blocks of code prefixed with labels of the
       form  yyc_<cond>,  where  cond is a condition name (label prefix can be
       changed with re2c:cond:prefix). Transitions between conditions are  im-
       plemented  using  goto and condition labels. Before all conditions re2c
       generates an initial switch on YYGETSTATE that jumps to the start state
       of  the  current  condition.  The shortcut rules :=> bypass the initial
       switch and jump directly to the specified condition (re2c:cond:goto can
       be  used  to  change the default behavior). The rules with semantic ac-
       tions do not automatically jump to the next condition; this  should  be
       done by the user-defined action code.

       In  languages that do not have goto (such as Rust) re2c reuses the yys-
       tate variable to store condition numbers. Each condition gets a numeric
       identifier equal to the number of its start state, and a switch between
       conditions is no different than a switch between DFA states of a single
       condition.  There  is  no need for a separate initial condition switch.
       (Since the same approach is used to implement storable  states,  YYGET-
       CONDITION/YYSETCONDITION are redundant if both storable states and con-
       ditions are used).

       The program below uses start conditions to parse binary, octal, decimal
       and  hexadecimal  numbers.  There is a single block where each base has
       its own condition, and the initial condition is  connected  to  all  of
       them.  User-defined  variable cond stores the current condition number;
       it is initialized to the number of the initial condition generated with
       /*!conditions:re2c*/.

          // re2c $INPUT -o $OUTPUT -ci
          #include <stdint.h>
          #include <limits.h>
          #include <assert.h>

          static const uint64_t ERROR = UINT64_MAX;
          /*!conditions:re2c*/

          template<int BASE> static void add(uint64_t &u, char d) {
              u = u * BASE + d;
              if (u > UINT32_MAX) u = ERROR;
          }

          static uint64_t parse_u32(const char *s) {
              const char *YYCURSOR = s, *YYMARKER;
              int c = yycinit;
              uint64_t u = 0;

              /*!re2c
                  re2c:api:style = free-form;
                  re2c:define:YYCTYPE        = char;
                  re2c:define:YYGETCONDITION = "c";
                  re2c:define:YYSETCONDITION = "c = @@;";
                  re2c:yyfill:enable = 0;

                  <*> * { return ERROR; }

                  <init> '0b' / [01]        :=> bin
                  <init> "0"                :=> oct
                  <init> "" / [1-9]         :=> dec
                  <init> '0x' / [0-9a-fA-F] :=> hex

                  <bin, oct, dec, hex> "\x00" { return u; }

                  <bin> [01]  { add<2>(u,  YYCURSOR[-1] - '0');      goto yyc_bin; }
                  <oct> [0-7] { add<8>(u,  YYCURSOR[-1] - '0');      goto yyc_oct; }
                  <dec> [0-9] { add<10>(u, YYCURSOR[-1] - '0');      goto yyc_dec; }
                  <hex> [0-9] { add<16>(u, YYCURSOR[-1] - '0');      goto yyc_hex; }
                  <hex> [a-f] { add<16>(u, YYCURSOR[-1] - 'a' + 10); goto yyc_hex; }
                  <hex> [A-F] { add<16>(u, YYCURSOR[-1] - 'A' + 10); goto yyc_hex; }
              */
          }

          int main() {
              assert(parse_u32("") == ERROR);
              assert(parse_u32("1234567890") == 1234567890);
              assert(parse_u32("0b1101") == 13);
              assert(parse_u32("0x7Fe") == 2046);
              assert(parse_u32("0644") == 420);
              assert(parse_u32("9999999999") == ERROR);
              return 0;
          }

STORABLE STATE
       With  --storable-state option re2c generates a lexer that can store its
       current state, return to the caller, and later  resume  operations  ex-
       actly  where  it  left  off. The default mode of operation in re2c is a
       "pull" model, in which the lexer "pulls" more input whenever  it  needs
       it.  This may be unacceptable in cases when the input becomes available
       piece by piece (for example, if the lexer is invoked by the parser,  or
       if the lexer program communicates via a socket protocol with some other
       program that must wait for a reply from the lexer before  it  transmits
       the  next message). Storable state feature is intended exactly for such
       cases: it allows one to generate lexers that work in  a  "push"  model.
       When the lexer needs more input, it stores its state and returns to the
       caller. Later, when more input becomes available,  the  caller  resumes
       the  lexer  exactly where it stopped. There are a few changes necessary
       compared to the "pull" model:

       • Define YYSETSTATE() and YYGETSTATE(state) primitives.

       • Define yych, yyaccept (if used) and state variables as a part of per-
         sistent lexer state. The state variable should be initialized to -1.

       • YYFILL should return to the outer program instead of trying to supply
         more input. Return code should indicate that lexer needs more input.

       • The outer program should recognize situations when lexer  needs  more
         input and respond appropriately.

       • Optionally  use  getstate:re2c to generate YYGETSTATE switch detached
         from the main lexer. This only works for  languages  that  have  goto
         (not in --loop-switch mode).

       • Use re2c:eof and the sentinel with bounds checks method to handle the
         end of input. Padding-based method may not work because it is unclear
         when to append padding: the current end of input may not be the ulti-
         mate end of input, and appending padding too early may cut off a par-
         tially  read  greedy  lexeme.  Furthermore, due to high-level program
         logic getting more input may depend on processing the lexeme  at  the
         end  of buffer (which already is blocked due to the end-of-input con-
         dition).

       Here is an example of a "push" model lexer that simulates reading pack-
       ets from a socket. The lexer loops until it encounters the end of input
       and returns to the calling function. The calling function provides more
       input  by  "sending"  the  next packet and resumes lexing. This process
       stops when all the packets have been sent, or when there is an error.

          // re2c $INPUT -o $OUTPUT -f
          #include <assert.h>
          #include <stdio.h>
          #include <string.h>

          #define DEBUG 0
          #define LOG(...) if (DEBUG) fprintf(stderr, __VA_ARGS__);

          // Use a small buffer to cover the case when a lexeme doesn't fit.
          // In real world use a larger buffer.
          #define BUFSIZE 10

          struct State {
              FILE *file;
              char buf[BUFSIZE + 1], *lim, *cur, *mar, *tok;
              int state;
          };

          typedef enum {END, READY, WAITING, BAD_PACKET, BIG_PACKET} Status;

          static Status fill(State &st) {
              const size_t shift = st.tok - st.buf;
              const size_t used = st.lim - st.tok;
              const size_t free = BUFSIZE - used;

              // Error: no space. In real life can reallocate a larger buffer.
              if (free < 1) return BIG_PACKET;

              // Shift buffer contents (discard already processed data).
              memmove(st.buf, st.tok, used);
              st.lim -= shift;
              st.cur -= shift;
              st.mar -= shift;
              st.tok -= shift;

              // Fill free space at the end of buffer with new data.
              const size_t read = fread(st.lim, 1, free, st.file);
              st.lim += read;
              st.lim[0] = 0; // append sentinel symbol

              return READY;
          }

          static Status lex(State &st, unsigned int *recv) {
              char yych;
              /*!getstate:re2c*/

              for (;;) {
                  st.tok = st.cur;
              /*!re2c
                  re2c:api:style = free-form;
                  re2c:define:YYCTYPE    = "char";
                  re2c:define:YYCURSOR   = "st.cur";
                  re2c:define:YYMARKER   = "st.mar";
                  re2c:define:YYLIMIT    = "st.lim";
                  re2c:define:YYGETSTATE = "st.state";
                  re2c:define:YYSETSTATE = "st.state = @@;";
                  re2c:define:YYFILL     = "return WAITING;";
                  re2c:eof = 0;

                  packet = [a-z]+[;];

                  *      { return BAD_PACKET; }
                  $      { return END; }
                  packet { *recv = *recv + 1; continue; }
              */
              }
          }

          void test(const char **packets, Status expect) {
              // Create a "socket" (open the same file for reading and writing).
              const char *fname = "pipe";
              FILE *fw = fopen(fname, "w");
              FILE *fr = fopen(fname, "r");
              setvbuf(fw, NULL, _IONBF, 0);
              setvbuf(fr, NULL, _IONBF, 0);

              // Initialize lexer state: `state` value is -1, all pointers are at the end
              // of buffer.
              State st;
              st.file = fr;
              st.cur = st.mar = st.tok = st.lim = st.buf + BUFSIZE;
              // Sentinel (at YYLIMIT pointer) is set to zero, which triggers YYFILL.
              st.lim[0] = 0;
              st.state = -1;

              // Main loop. The buffer contains incomplete data which appears packet by
              // packet. When the lexer needs more input it saves its internal state and
              // returns to the caller which should provide more input and resume lexing.
              Status status;
              unsigned int send = 0, recv = 0;
              for (;;) {
                  status = lex(st, &recv);
                  if (status == END) {
                      LOG("done: got %u packets\n", recv);
                      break;
                  } else if (status == WAITING) {
                      LOG("waiting...\n");
                      if (*packets) {
                          LOG("sent packet %u\n", send);
                          fprintf(fw, "%s", *packets++);
                          ++send;
                      }
                      status = fill(st);
                      LOG("queue: '%s'\n", st.buf);
                      if (status == BIG_PACKET) {
                          LOG("error: packet too big\n");
                          break;
                      }
                      assert(status == READY);
                  } else {
                      assert(status == BAD_PACKET);
                      LOG("error: ill-formed packet\n");
                      break;
                  }
              }

              // Check results.
              assert(status == expect);
              if (status == END) assert(recv == send);

              // Cleanup: remove input file.
              fclose(fw);
              fclose(fr);
              remove(fname);
          }

          int main() {
              const char *packets1[] = {0};
              const char *packets2[] = {"zero;", "one;", "two;", "three;", "four;", 0};
              const char *packets3[] = {"zer0;", 0};
              const char *packets4[] = {"looooooooooong;", 0};

              test(packets1, END);
              test(packets2, END);
              test(packets3, BAD_PACKET);
              test(packets4, BIG_PACKET);

              return 0;
          }

REUSABLE BLOCKS
       Reusable blocks are re2c blocks that can be reused any number of  times
       and   combined   with   other   re2c  blocks.  They  are  defined  with
       /*!rules:re2c[:<name>] ... */ (the <name> is optional). A  rules  block
       can  be used in two contexts: either in a use block, or in a use direc-
       tive inside of another block. The code for a rules block  is  generated
       at every point of use.

       Use  blocks are defined with /*!use:re2c[:<name>] ... */. The <name> is
       optional; if not specified, the associated rules block is the most  re-
       cent  one (whether named or unnamed). A use block can add named defini-
       tions, configurations and rules of its own.  An important use case  for
       use  blocks is a lexer that supports multiple input encodings: the same
       rules block is reused multiple times with encoding-specific  configura-
       tions (see the example below).

       In-block  use  directive !use:<name>; can be used from inside of a re2c
       block. It merges the referenced block <name> into the current  one.  If
       some of the merged rules and configurations overlap with the previously
       defined ones, conflicts are resolved in the  usual  way:  the  earliest
       rule takes priority, and latest configuration overrides preceding ones.
       One exception are the special rules *, $ and (in condition  mode)  <!>,
       for  which  a  block-local definition overrides any inherited ones. Use
       directive allows one to combine different re2c blocks together  in  one
       block (see the example below).

       Named blocks and in-block use directive were added in re2c version 2.2.
       Since that version reusable blocks are allowed by default  (no  special
       option  is  needed).  Before version 2.2 reuse mode was enabled with -r
       --reusable option. Before version 1.2  reusable  blocks  could  not  be
       mixed with normal blocks.

   Example of a !use directive
          // re2c $INPUT -o $OUTPUT
          #include <assert.h>

          // This example shows how to combine reusable re2c blocks: two blocks
          // ('colors' and 'fish') are merged into one. The 'salmon' rule occurs
          // in both blocks; the 'fish' block takes priority because it is used
          // earlier. Default rule * occurs in all three blocks; the local (not
          // inherited) definition takes priority.

          enum What { COLOR, FISH, DUNNO };

          /*!rules:re2c:colors
              *                            { assert(false); }
              "red" | "salmon" | "magenta" { return COLOR; }
          */

          /*!rules:re2c:fish
              *                            { assert(false); }
              "haddock" | "salmon" | "eel" { return FISH; }
          */

          static What lex(const char *s) {
              const char *YYCURSOR = s, *YYMARKER;
              /*!re2c
                  re2c:yyfill:enable = 0;
                  re2c:define:YYCTYPE = char;

                  !use:fish;
                  !use:colors;
                  * { return DUNNO; }  // overrides inherited '*' rules
              */
          }

          int main() {
              assert(lex("salmon") == FISH);
              assert(lex("what?") == DUNNO);
              return 0;
          }

   Example of a /*!use:re2c ... */ block
          // re2c $INPUT -o $OUTPUT --input-encoding utf8
          #include <assert.h>
          #include <stdint.h>

          // This example supports multiple input encodings: UTF-8 and UTF-32.
          // Both lexers are generated from the same rules block, and the use
          // blocks add only encoding-specific configurations.
          /*!rules:re2c
              re2c:yyfill:enable = 0;

              "∀x ∃y" { return 0; }
              *       { return 1; }
          */

          static int lex_utf8(const uint8_t *s) {
              const uint8_t *YYCURSOR = s, *YYMARKER;
              /*!use:re2c
                  re2c:define:YYCTYPE = uint8_t;
                  re2c:encoding:utf8 = 1;
              */
          }

          static int lex_utf32(const uint32_t *s) {
              const uint32_t *YYCURSOR = s, *YYMARKER;
              /*!use:re2c
                  re2c:define:YYCTYPE = uint32_t;
                  re2c:encoding:utf32 = 1;
              */
          }

          int main() {
              static const uint8_t s8[] = // UTF-8
                  { 0xe2, 0x88, 0x80, 0x78, 0x20, 0xe2, 0x88, 0x83, 0x79 };

              static const uint32_t s32[] = // UTF32
                  { 0x00002200, 0x00000078, 0x00000020, 0x00002203, 0x00000079 };

              assert(lex_utf8(s8) == 0);
              assert(lex_utf32(s32) == 0);
              return 0;
          }

SUBMATCH EXTRACTION
       re2c has two options for submatch extraction.

       The  first option is -T --tags. With this option one can use standalone
       tags of the form @stag and #mtag, where stag  and  mtag  are  arbitrary
       used-defined  names.  Tags can be used anywhere inside of a regular ex-
       pression; semantically they are just position markers. Tags of the form
       @stag  are called s-tags: they denote a single submatch value (the last
       input position where this tag matched). Tags  of  the  form  #mtag  are
       called  m-tags: they denote multiple submatch values (the whole history
       of repetitions of this tag).  All tags should be defined by the user as
       variables  with the corresponding names. With standalone tags re2c uses
       leftmost greedy disambiguation: submatch positions  correspond  to  the
       leftmost matching path through the regular expression.

       The  second  option  is -P --posix-captures: it enables POSIX-compliant
       capturing groups. In this mode parentheses in regular  expressions  de-
       note  the  beginning and the end of capturing groups; the whole regular
       expression is group number zero. The number of groups for the  matching
       rule  is stored in a variable yynmatch, and submatch results are stored
       in yypmatch array. Both yynmatch and yypmatch should be defined by  the
       user,  and yypmatch size must be at least [yynmatch * 2]. re2c provides
       a directive /*!maxnmatch:re2c*/ that defines  YYMAXNMATCH:  a  constant
       equal  to the maximal value of yynmatch among all rules. Note that re2c
       implements POSIX-compliant disambiguation: each  subexpression  matches
       as  long  as possible, and subexpressions that start earlier in regular
       expression have priority over those starting  later.  Capturing  groups
       are  translated  into  s-tags under the hood, therefore we use the word
       "tag" to describe them as well.

       With both -P --posix-captures and T --tags options re2c uses  efficient
       submatch extraction algorithm described in the Tagged Deterministic Fi-
       nite Automata with Lookahead paper. The overhead on submatch extraction
       in the generated lexer grows with the number of tags --- if this number
       is moderate, the overhead is barely noticeable. In the lexer  tags  are
       implemented using a number of tag variables generated by re2c. There is
       no one-to-one correspondence between tag variables and tags:  a  single
       variable may be reused for different tags, and one tag may require mul-
       tiple variables to hold all its ambiguous values. Eventually  ambiguity
       is  resolved, and only one final variable per tag survives. When a rule
       matches, all its tags are set to the values of  the  corresponding  tag
       variables.   The  exact number of tag variables is unknown to the user;
       this number is determined by re2c. However, tag variables should be de-
       fined  by  the user as a part of the lexer state and updated by YYFILL,
       therefore re2c provides directives /*!stags:re2c*/ and  /*!mtags:re2c*/
       that  can  be used to declare, initialize and manipulate tag variables.
       These directives have  two  optional  configurations:  format  =  "@@";
       (specifies  the  template where @@ is substituted with the name of each
       tag variable), and separator = ""; (specifies the piece of code used to
       join the generated pieces for different tag variables).

       S-tags support the following operations:

       • save input position to an s-tag: t = YYCURSOR with C pointer API or a
         user-defined operation YYSTAGP(t) with generic API

       • save default value to an s-tag: t = NULL with  C  pointer  API  or  a
         user-defined operation YYSTAGN(t) with generic API

       • copy one s-tag to another: t1 = t2

       M-tags support the following operations:

       • append  input  position  to  an  m-tag: a user-defined operation YYM-
         TAGP(t) with both default and generic API

       • append default value to an m-tag: a user-defined operation YYMTAGN(t)
         with both default and generic API

       • copy one m-tag to another: t1 = t2

       S-tags  can  be  implemented  as  scalar  values (pointers or offsets).
       M-tags need a more complex representation, as they need to store a  se-
       quence  of tag values. The most naive and inefficient representation of
       an m-tag is a list (array, vector) of tag values; a more efficient rep-
       resentation  is to store all m-tags in a prefix-tree represented as ar-
       ray of nodes (v, p), where v is tag value and p is a pointer to  parent
       node.

       Here  is  a  simple  example of using s-tags to parse semantic versions
       consisting of three numeric components: major, minor, patch (the latter
       is optional).  See below for a more complex example that uses YYFILL.

          // re2c $INPUT -o $OUTPUT
          #include <assert.h>
          #include <stddef.h>

          struct SemVer { int major, minor, patch; };

          static int s2n(const char *s, const char *e) { // pre-parsed string to number
              int n = 0;
              for (; s < e; ++s) n = n * 10 + (*s - '0');
              return n;
          }

          static bool lex(const char *str, SemVer &ver) {
              const char *YYCURSOR = str, *YYMARKER;

              // User-defined tag variables that are available in semantic action.
              const char *t1, *t2, *t3, *t4, *t5;

              // Autogenerated tag variables used by the lexer to track tag values.
              /*!stags:re2c format = 'const char *@@;\n'; */

              /*!re2c
                  re2c:yyfill:enable = 0;
                  re2c:define:YYCTYPE = char;
                  re2c:tags = 1;

                  num = [0-9]+;

                  @t1 num @t2 "." @t3 num @t4 ("." @t5 num)? [\x00] {
                      ver.major = s2n(t1, t2);
                      ver.minor = s2n(t3, t4);
                      ver.patch = t5 != NULL ? s2n(t5, YYCURSOR - 1) : 0;
                      return true;
                  }
                  * { return false; }
              */
          }

          int main() {
              SemVer v;
              assert(lex("23.34", v) && v.major == 23 && v.minor == 34 && v.patch == 0);
              assert(lex("1.2.999", v) && v.major == 1 && v.minor == 2 && v.patch == 999);
              assert(!lex("1.a", v));
              return 0;
          }

       Here  is  a more complex example of using s-tags with YYFILL to parse a
       file with newline-separated semantic versions. Tag variables  are  part
       of  the  lexer  state, and they are adjusted in YYFILL like other input
       positions.  Note that it is necessary for s-tags because  their  values
       are invalidated after shifting buffer contents. It may not be necessary
       in a custom implementation where tag variables store  offsets  relative
       to  the  start of the input string rather than the buffer, which may be
       the case with m-tags.

          // re2c $INPUT -o $OUTPUT --tags
          #include <assert.h>
          #include <stddef.h>
          #include <stdio.h>
          #include <string.h>
          #include <vector>

          #define BUFSIZE 4095

          struct Input {
              FILE *file;
              char buf[BUFSIZE + 1], *lim, *cur, *mar, *tok;
              // Tag variables must be part of the lexer state passed to YYFILL.
              // They don't correspond to tags and should be autogenerated by re2c.
              /*!stags:re2c format = 'const char *@@;'; */
              bool eof;
          };

          struct SemVer { int major, minor, patch; };

          static bool operator==(const SemVer &x, const SemVer &y) {
              return x.major == y.major && x.minor == y.minor && x.patch == y.patch;
          }

          static int s2n(const char *s, const char *e) { // pre-parsed string to number
              int n = 0;
              for (; s < e; ++s) n = n * 10 + (*s - '0');
              return n;
          }

          static int fill(Input &in) {
              if (in.eof) return 1;

              const size_t shift = in.tok - in.buf;
              const size_t used = in.lim - in.tok;

              // Error: lexeme too long. In real life could reallocate a larger buffer.
              if (shift < 1) return 2;

              // Shift buffer contents (discard everything up to the current token).
              memmove(in.buf, in.tok, used);
              in.lim -= shift;
              in.cur -= shift;
              in.mar -= shift;
              in.tok -= shift;
              // Tag variables need to be shifted like other input positions. The check
              // for non-NULL is only needed if some tags are nested inside of alternative
              // or repetition, so that they can have NULL value.
              /*!stags:re2c format = "if (in.@@) in.@@ -= shift;\n"; */

              // Fill free space at the end of buffer with new data from file.
              in.lim += fread(in.lim, 1, BUFSIZE - used, in.file);
              in.lim[0] = 0;
              in.eof = in.lim < in.buf + BUFSIZE;
              return 0;
          }

          static bool lex(Input &in, std::vector<SemVer> &vers) {
              // User-defined local variables that store final tag values.
              // They are different from tag variables autogenerated with `stags:re2c`,
              // as they are set at the end of match and used only in semantic actions.
              const char *t1, *t2, *t3, *t4;
              for (;;) {
                  in.tok = in.cur;
              /*!re2c
                  re2c:eof = 0;
                  re2c:api:style = free-form;
                  re2c:define:YYCTYPE  = char;
                  re2c:define:YYCURSOR = in.cur;
                  re2c:define:YYMARKER = in.mar;
                  re2c:define:YYLIMIT  = in.lim;
                  re2c:define:YYFILL   = "fill(in) == 0";
                  re2c:tags:expression = "in.@@";

                  num = [0-9]+;

                  num @t1 "." @t2 num @t3 ("." @t4 num)? [\n] {
                      int major = s2n(in.tok, t1);
                      int minor = s2n(t2, t3);
                      int patch = t4 != NULL ? s2n(t4, in.cur - 1) : 0;
                      SemVer ver = {major, minor, patch};
                      vers.push_back(ver);
                      continue;
                  }
                  $ { return true; }
                  * { return false; }
              */}
          }

          int main() {
              const char *fname = "input";
              const SemVer semver = {1, 22, 333};
              std::vector<SemVer> expect(BUFSIZE, semver), actual;

              // Prepare input file (make sure it exceeds buffer size).
              FILE *f = fopen(fname, "w");
              for (int i = 0; i < BUFSIZE; ++i) fprintf(f, "1.22.333\n");
              fclose(f);

              // Reopen input file for reading.
              f = fopen(fname, "r");

              // Initialize lexer state: all pointers are at the end of buffer.
              Input in;
              in.file = f;
              in.cur = in.mar = in.tok = in.lim = in.buf + BUFSIZE;
              /*!stags:re2c format = "in.@@ = in.lim;\n"; */
              in.eof = false;
              // Sentinel (at YYLIMIT pointer) is set to zero, which triggers YYFILL.
              *in.lim = 0;

              // Run the lexer and check results.
              assert(lex(in, actual) && expect == actual);

              // Cleanup: remove input file.
              fclose(f);
              remove(fname);
              return 0;
          }

       Here is an example of using POSIX capturing groups  to  parse  semantic
       versions.

          // re2c $INPUT -o $OUTPUT
          #include <assert.h>
          #include <stddef.h>

          // Maximum number of capturing groups among all rules.
          /*!maxnmatch:re2c*/

          struct SemVer { int major, minor, patch; };

          static int s2n(const char *s, const char *e) { // pre-parsed string to number
              int n = 0;
              for (; s < e; ++s) n = n * 10 + (*s - '0');
              return n;
          }

          static bool lex(const char *str, SemVer &ver) {
              const char *YYCURSOR = str, *YYMARKER;

              // Allocate memory for capturing parentheses (twice the number of groups).
              const char *yypmatch[YYMAXNMATCH * 2];
              size_t yynmatch;

              // Autogenerated tag variables used by the lexer to track tag values.
              /*!stags:re2c format = 'const char *@@;\n'; */

              /*!re2c
                  re2c:yyfill:enable = 0;
                  re2c:define:YYCTYPE = char;
                  re2c:posix-captures = 1;

                  num = [0-9]+;

                  (num) "." (num) ("." num)? [\x00] {
                      // `yynmatch` is the number of capturing groups
                      assert(yynmatch == 4);
                      // Even `yypmatch` values are for opening parentheses, odd values
                      // are for closing parentheses, the first group is the whole match.
                      ver.major = s2n(yypmatch[2], yypmatch[3]);
                      ver.minor = s2n(yypmatch[4], yypmatch[5]);
                      ver.patch = yypmatch[6] ? s2n(yypmatch[6] + 1, yypmatch[7]) : 0;
                      return true;
                  }
                  * { return false; }
              */
          }

          int main() {
              SemVer v;
              assert(lex("23.34", v) && v.major == 23 && v.minor == 34 && v.patch == 0);
              assert(lex("1.2.999", v) && v.major == 1 && v.minor == 2 && v.patch == 999);
              assert(!lex("1.a", v));
              return 0;
          }

       Here  is  an example of using m-tags to parse a version with a variable
       number of components. Tag variables are stored in a trie.

          // re2c $INPUT -o $OUTPUT
          #include <assert.h>
          #include <stddef.h>
          #include <vector>

          static const int MTAG_ROOT = -1;

          // An m-tag tree is a way to store histories with an O(1) copy operation.
          // Histories naturally form a tree, as they have common start and fork at some
          // point. The tree is stored as an array of pairs (tag value, link to parent).
          // An m-tag is represented with a single link in the tree (array index).
          struct Mtag {
              const char *elem; // tag value
              int pred; // index of the predecessor node or root
          };
          typedef std::vector<Mtag> MtagTrie;

          typedef std::vector<int> Ver; // unbounded number of version components

          static int s2n(const char *s, const char *e) { // pre-parsed string to number
              int n = 0;
              for (; s < e; ++s) n = n * 10 + (*s - '0');
              return n;
          }

          // Append a single value to an m-tag history.
          static void add_mtag(MtagTrie &trie, int &mtag, const char *value) {
              Mtag m = {value, mtag};
              mtag = (int)trie.size();
              trie.push_back(m);
          }

          // Recursively unwind tag histories and collect version components.
          static void unfold(const MtagTrie &trie, int x, int y, Ver &ver) {
              // Reached the root of the m-tag tree, stop recursion.
              if (x == MTAG_ROOT && y == MTAG_ROOT) return;

              // Unwind history further.
              unfold(trie, trie[x].pred, trie[y].pred, ver);

              // Get tag values. Tag histories must have equal length.
              assert(x != MTAG_ROOT && y != MTAG_ROOT);
              const char *ex = trie[x].elem, *ey = trie[y].elem;

              if (ex != NULL && ey != NULL) {
                  // Both tags are valid pointers, extract component.
                  ver.push_back(s2n(ex, ey));
              } else {
                  // Both tags are NULL (this corresponds to zero repetitions).
                  assert(ex == NULL && ey == NULL);
              }
          }

          static bool parse(const char *str, Ver &ver) {
              const char *YYCURSOR = str, *YYMARKER;
              MtagTrie mt;

              // User-defined tag variables that are available in semantic action.
              const char *t1, *t2;
              int t3, t4;

              // Autogenerated tag variables used by the lexer to track tag values.
              /*!stags:re2c format = 'const char *@@ = NULL;'; */
              /*!mtags:re2c format = 'int @@ = MTAG_ROOT;'; */

              /*!re2c
                  re2c:api:style = free-form;
                  re2c:define:YYCTYPE = char;
                  re2c:define:YYSTAGP = "@@ = YYCURSOR;";
                  re2c:define:YYSTAGN = "@@ = NULL;";
                  re2c:define:YYMTAGP = "add_mtag(mt, @@, YYCURSOR);";
                  re2c:define:YYMTAGN = "add_mtag(mt, @@, NULL);";
                  re2c:yyfill:enable = 0;
                  re2c:tags = 1;

                  num = [0-9]+;

                  @t1 num @t2 ("." #t3 num #t4)* [\x00] {
                      ver.clear();
                      ver.push_back(s2n(t1, t2));
                      unfold(mt, t3, t4, ver);
                      return true;
                  }
                  * { return false; }
              */
          }

          int main() {
              Ver v;
              assert(parse("1", v) && v == Ver({1}));
              assert(parse("1.2.3.4.5.6.7", v) && v == Ver({1, 2, 3, 4, 5, 6, 7}));
              assert(!parse("1.2.", v));
              return 0;
          }

ENCODING SUPPORT
       It is necessary to understand the difference between  code  points  and
       code  units.  A  code point is a numeric identifier of a symbol. A code
       unit is the smallest unit of storage in the encoded text. A single code
       point may be represented with one or more code units. In a fixed-length
       encoding all code points are represented with the same number  of  code
       units.  In  a  variable-length  encoding code points may be represented
       with a different number of code units.  Note that the  "any"  rule  [^]
       matches any code point, but not necessarily any code unit (the only way
       to match any code unit regardless of the encoding is the  default  rule
       *).  The generated lexer works with a stream of code units: yych stores
       a code unit, and YYCTYPE is the code unit type. Regular expressions, on
       the  other  hand, are specified in terms of code points. When re2c com-
       piles regular expressions to automata it translates code points to code
       units.  This  is generally not a simple mapping: in variable-length en-
       codings a single code point range may get translated to a complex  code
       unit graph.  The following encodings are supported:

       • ASCII  (enabled  by default). It is a fixed-length encoding with code
         space [0-255] and 1-byte code points and code units.

       • EBCDIC (enabled with  --ebcdic  or  re2c:encoding:ebcdic).  It  is  a
         fixed-length  encoding with code space [0-255] and 1-byte code points
         and code units.

       • UCS2  (enabled  with  --ucs2  or   re2c:encoding:ucs2).   It   is   a
         fixed-length  encoding  with  code  space  [0-0xFFFF] and 2-byte code
         points and code units.

       • UTF8 (enabled with --utf8  or  re2c:encoding:utf8).  It  is  a  vari-
         able-length  Unicode  encoding. Code unit size is 1 byte. Code points
         are represented with 1 -- 4 code units.

       • UTF16 (enabled with --utf16 or re2c:encoding:utf16). It  is  a  vari-
         able-length  Unicode encoding. Code unit size is 2 bytes. Code points
         are represented with 1 -- 2 code units.

       • UTF32  (enabled  with  --utf32  or  re2c:encoding:utf32).  It  is   a
         fixed-length Unicode encoding with code space [0-0x10FFFF] and 4-byte
         code points and code units.

       Include file include/unicode_categories.re  provides  re2c  definitions
       for the standard Unicode categories.

       Option  --input-encoding  specifies  source file encoding, which can be
       used to enable Unicode literals in  regular  expressions.  For  example
       --input-encoding  utf8  tells  re2c that the source file is in UTF8 (it
       differs from --utf8 which sets input text  encoding).  Option  --encod-
       ing-policy  specifies  the  way  re2c  handles Unicode surrogates (code
       points in range [0xD800-0xDFFF]).

       Below is an example of a lexer for UTF8 encoded Unicode identifiers.

          // re2c $INPUT -o $OUTPUT -8 --case-ranges -i
          #include <assert.h>
          #include <stdint.h>

          /*!include:re2c "unicode_categories.re" */

          static int lex(const char *s) {
              const char *YYCURSOR = s, *YYMARKER;
              /*!re2c
                  re2c:define:YYCTYPE = 'unsigned char';
                  re2c:yyfill:enable = 0;

                  // Simplified "Unicode Identifier and Pattern Syntax"
                  // (see https://unicode.org/reports/tr31)
                  id_start    = L | Nl | [$_];
                  id_continue = id_start | Mn | Mc | Nd | Pc | [\u200D\u05F3];
                  identifier  = id_start id_continue*;

                  identifier { return 0; }
                  *          { return 1; }
              */
          }

          int main() {
              assert(lex("_Ыдентификатор") == 0);
              return 0;
          }

INCLUDE FILES
       re2c allows one to include other files using directive  /*!include:re2c
       FILE  */ or !include FILE ;, where FILE is a path to the file to be in-
       cluded.  The first form should be used outside of re2c blocks, and  the
       second form allows one to include a file in the middle of a re2c block.
       re2c looks for included files in the directory of  the  including  file
       and  in  include locations, which can be specified with -I option.  In-
       clude directives in re2c work in the same way as  C/C++  #include:  the
       contents  of  FILE  are copy-pasted verbatim in place of the directive.
       Include files may have further includes of their own. Use --depfile op-
       tion  to  track build dependencies of the output file on include files.
       re2c provides some predefined include files that can be  found  in  the
       include/  subdirectory  of the project. These files contain definitions
       that can be useful to other projects (such as Unicode  categories)  and
       form  something  like a standard library for re2c.  Below is an example
       of using include directive.

   Include file 1 (definitions.h)
          typedef enum { OK, FAIL } Result;

          /*!re2c
              number = [1-9][0-9]*;
          */

   Include file 2 (extra_rules.re.inc)
          // floating-point numbers
          frac  = [0-9]* "." [0-9]+ | [0-9]+ ".";
          exp   = 'e' [+-]? [0-9]+;
          float = frac exp? | [0-9]+ exp;

          float { return OK; }

   Input file
          // re2c $INPUT -o $OUTPUT -i
          #include <assert.h>
          /*!include:re2c "definitions.h" */

          Result lex(const char *s) {
              const char *YYCURSOR = s, *YYMARKER;
              /*!re2c
                  re2c:define:YYCTYPE = char;
                  re2c:yyfill:enable = 0;

                  *      { return FAIL; }
                  number { return OK; }
                  !include "extra_rules.re.inc";
              */
          }

          int main() {
              assert(lex("123") == OK);
              assert(lex("123.4567") == OK);
              return 0;
          }

HEADER FILES
       re2c allows one to generate header file from the input .re  file  using
       option  -t,  --type-header  or configuration re2c:flags:type-header and
       directives /*!header:re2c:on*/ and /*!header:re2c:off*/. The first  di-
       rective  marks  the  beginning of header file, and the second directive
       marks the end of it. Everything between these directives  is  processed
       by re2c, and the generated code is written to the file specified by the
       -t --type-header option (or stdout if this option was not used).  Auto-
       generated  header file may be needed in cases when re2c is used to gen-
       erate definitions of constants, variables and structs that must be vis-
       ible from other translation units.

       Here is an example of generating a header file that contains definition
       of the lexer state with tag variables (the number variables depends  on
       the regular grammar and is unknown to the programmer).

   Input file
          // re2c $INPUT -o $OUTPUT -i --header lexer/state.h
          #include <assert.h>
          #include <stddef.h>
          #include "lexer/state.h" // the header is generated by re2c

          /*!header:re2c:on*/
          struct LexerState {
              const char *str, *cur;
              /*!stags:re2c format = "const char *@@;"; */
          };
          /*!header:re2c:off*/

          long lex(LexerState& st) {
              const char *t;
              /*!re2c
                  re2c:header = "lexer/state.h";
                  re2c:yyfill:enable = 0;
                  re2c:define:YYCTYPE = char;
                  re2c:define:YYCURSOR = "st.cur";
                  re2c:tags = 1;
                  re2c:tags:expression = "st.@@";

                  [a]* @t [b]* { return t - st.str; }
              */
          }

          int main() {
              const char *s = "ab";
              LexerState st = { s, s /*!stags:re2c format = ", NULL"; */ };
              assert(lex(st) == 1);
              return 0;
          }

   Header file
          /* Generated by re2c */

          typedef struct {
              const char *str, *cur, *mar;
              const char *yyt1;
          } LexerState;

SKELETON PROGRAMS
       With the -S, --skeleton option, re2c ignores all non-re2c code and gen-
       erates a self-contained C program that can be further compiled and exe-
       cuted. The program consists of lexer code and input data. For each con-
       structed DFA (block or condition) re2c generates a standalone lexer and
       two files: an .input file with strings derived from the DFA and a .keys
       file with expected match results. The program runs each  lexer  on  the
       corresponding  .input  file and compares results with the expectations.
       Skeleton programs are very useful for a number of reasons:

       • They can check correctness of various re2c optimizations (the data is
         generated  early  in the process, before any DFA transformations have
         taken place).

       • Generating a set of input data with good coverage may be  useful  for
         both testing and benchmarking.

       • Generating self-contained executable programs allows one to get mini-
         mized test cases (the original code may be large or have a lot of de-
         pendencies).

       The  difficulty with generating input data is that for all but the most
       trivial cases the number of possible input strings is too  large  (even
       if the string length is limited). re2c solves this difficulty by gener-
       ating sufficiently many strings to cover almost all DFA transitions. It
       uses  the  following  algorithm. First, it constructs a skeleton of the
       DFA. For encodings with 1-byte code unit size (such as ASCII, UTF-8 and
       EBCDIC)  skeleton is just an exact copy of the original DFA. For encod-
       ings with multibyte code units skeleton is a copy of DFA  with  certain
       transitions omitted: namely, re2c takes at most 256 code units for each
       disjoint continuous range that corresponds to a  DFA  transition.   The
       chosen  values are evenly distributed and include range bounds. Instead
       of trying to cover all possible paths in the skeleton (which is  infea-
       sible)  re2c  generates  sufficiently  many paths to cover all skeleton
       transitions, and thus trigger the corresponding  conditional  jumps  in
       the  lexer.  The algorithm implementation is limited by ~1Gb of transi-
       tions and consumes constant amount of memory (re2c writes data to  file
       as soon as it is generated).

VISUALIZATION AND DEBUG
       With  the  -D, --emit-dot option, re2c does not generate code. Instead,
       it dumps the generated DFA in DOT format.  One can convert this dump to
       an  image of the DFA using Graphviz or another library.  Note that this
       option shows the final DFA after it has gone through a number of  opti-
       mizations  and transformations. Earlier stages can be dumped with vari-
       ous debug options, such as --dump-nfa,  --dump-dfa-raw  etc.  (see  the
       full list of options).

SEE ALSO
       You  can  find  more  information  about  re2c at the official website:
       http://re2c.org.   Similar  programs  are   flex(1),   lex(1),   quex(-
       http://quex.sourceforge.net).

AUTHORS
       re2c  was  originaly  written by Peter Bumbulis in 1993.  Since then it
       has been developed and maintained by multiple volunteers; mots notably,
       Brain Young, Marcus Boerger, Dan Nuffer and Ulya Trofimovich.

                                                                       RE2C(1)

Generated by dwww version 1.14 on Fri Jan 24 09:28:24 CET 2025.