Lexical conventions - The Objective Caml language

The Objective Caml language

6.1 Lexical conventions

Blanks

The following characters are considered as blanks: space, newline, horizontal tabulation, carriage return, line feed and form feed. Blanks are ignored, but they separate adjacent identifiers, literals and keywords that would otherwise be confused as one single identifier, literal or keyword.

Comments

Comments are introduced by the two characters (*, with no intervening blanks, and terminated by the characters *), with no intervening blanks. Comments are treated as blank characters.

Comments do not occur inside string or character literals. Nested comments are handled correctly.

Identifiers

ident ::= (letter |_){letter |0. . .9|_|’}

letter ::= A. . .Z|a. . .z

Identifiers are sequences of letters, digits,_(the underscore character), and’(the single quote), starting with a letter or an underscore. Letters contain at least the 52 lowercase and uppercase letters from the ASCII set. The current implementation also recognizes as letters all accented characters from the ISO 8859-1 (“ISO Latin 1”) set. All characters in an identifier are meaningful.

The current implementation accepts identifiers up to 16000000 characters in length.

Integer literals

integer-literal ::= [-] (0. . .9){0. . .9|_}

| [-] (0x|0X) (0. . .9|A. . .F|a. . .f){0. . .9|A. . .F|a. . .f|_}

| [-] (0o|0O) (0. . .7){0. . .7|_}

| [-] (0b|0B) (0. . .1){0. . .1|_}

An integer literal is a sequence of one or more digits, optionally preceded by a minus sign. By default, integer literals are in decimal (radix 10). The following prefixes select a different radix:

Prefix Radix

0x,0X hexadecimal (radix 16) 0o,0O octal (radix 8)

0b,0B binary (radix 2)

(The initial0is the digit zero; theOfor octal is the letter O.) The interpretation of integer literals that fall outside the range of representable integer values is undefined.

For convenience and readability, underscore characters (_) are accepted (and ignored) within integer literals.

Floating-point literals

float-literal ::= [-] (0. . .9){0. . .9|_}[.{0. . .9|_}] [(e|E) [+|-] (0. . .9){0. . .9|_}]

Floating-point decimals consist in an integer part, a decimal part and an exponent part. The integer part is a sequence of one or more digits, optionally preceded by a minus sign. The decimal part is a decimal point followed by zero, one or more digits. The exponent part is the character e or E followed by an optional + or - sign, followed by one or more digits. The decimal part or the exponent part can be omitted, but not both to avoid ambiguity with integer literals. The interpretation of floating-point literals that fall outside the range of representable floating-point values is undefined.

For convenience and readability, underscore characters (_) are accepted (and ignored) within floating-point literals.

Character literals

char-literal ::= ’regular-char ’

| ’escape-sequence’ escape-sequence ::= \(\|"|’|n|t|b|r)

| \(0. . .9) (0. . .9) (0. . .9)

| \x(0. . .9|A. . .F|a. . .f) (0. . .9|A. . .F|a. . .f)

Character literals are delimited by ’ (single quote) characters. The two single quotes enclose either one character different from’and \, or one of the escape sequences below:

Sequence Character denoted

\\ backslash (\)

\" double quote (")

\’ single quote (’)

\n linefeed (LF)

\r carriage return (CR)

\t horizontal tabulation (TAB)

\b backspace (BS)

\space space (SPC)

\ddd the character with ASCII code dddin decimal

\xhh the character with ASCII code hhin hexadecimal String literals

string-literal ::= "{string-character}"

string-character ::= regular-char-str

| escape-sequence

String literals are delimited by " (double quote) characters. The two double quotes enclose a sequence of either characters different from"and\, or escape sequences from the table given above for character literals.

To allow splitting long string literals across lines, the sequence \newline blanks (a\at end-of-line followed by any number of blanks at the beginning of the next end-of-line) is ignored inside string literals.

The current implementation places practically no restrictions on the length of string literals.

Naming labels

To avoid ambiguities, naming labels in expressions cannot just be defined syntactically as the sequence of the three tokens ~,ident and :, and have to be defined at the lexical level.

label-name ::= (a. . .z|_){letter |0. . .9|_|’}

label ::= ~label-name: optlabel ::= ?label-name:

Naming labels come in two flavours: labelfor normal arguments andoptlabelfor optional ones.

They are simply distinguished by their first character, either~or?.

Despite label andoptlabelbeing lexical entities in expressions, their expansions~label-name: and ? label-name : will be used in grammars, for the sake of readability. Note also that inside type expressions, this expansion can be taken literally,i.e. there are really 3 tokens, with optional spaces beween them.

Prefix and infix symbols

infix-symbol ::= (=|<|>|@|^|||&|+|-|*|/|$|%){operator-char}

prefix-symbol ::= (!|?|~){operator-char}

operator-char ::= !|$|%|&|*|+|-|.|/|:|<|=|>|?|@|^|||~

Sequences of “operator characters”, such as <=> or !!, are read as a single token from the infix-symbol orprefix-symbol class. These symbols are parsed as prefix and infix operators inside expressions, but otherwise behave much as identifiers.

Keywords

The identifiers below are reserved as keywords, and cannot be employed otherwise:

and as assert asr begin class

constraint do done downto else end

exception external false for fun function

functor if in include inherit initializer

land lazy let lor lsl lsr

lxor match method mod module mutable

new object of open or private

rec sig struct then to true

try type val virtual when while

with

The following character sequences are also keywords:

!= # & && ’ ( ) * + ,

--. -> . .. : :: := :> ; ;; <

<- = > >] >} ? ?? [ [< [> [|

] _ ‘ { {< | |] } ~

Note that the following identifiers are keywords of the Camlp4 extensions and should be avoided for compatibility reasons.

parser << <: >> $ $$ $:

Ambiguities

Lexical ambiguities are resolved according to the “longest match” rule: when a character sequence can be decomposed into two tokens in several different ways, the decomposition retained is the one with the longest first token.

Line number directives

linenum-directive ::= #{0. . .9}⁺

| #{0. . .9}⁺"{string-character}"

Preprocessors that generate Caml source code can insert line number directives in their output so that error messages produced by the compiler contain line numbers and file names referring to the source file before preprocessing, instead of after preprocessing. A line number directive is composed of a # (sharp sign), followed by a positive integer (the source line number), optionally followed by a character string (the source file name). Line number directives are treated as blank characters during lexical analysis.

6.2 Values

This section describes the kinds of values that are manipulated by Objective Caml programs.

6.2.1 Base values Integer numbers

Integer values are integer numbers from −2³⁰ to 2³⁰−1, that is−1073741824 to 1073741823. The implementation may support a wider range of integer values: on 64-bit platforms, the current implementation supports integers ranging from−2⁶² to 2⁶²−1.

Floating-point numbers

Floating-point values are numbers in floating-point representation. The current implementation uses double-precision floating-point numbers conforming to the IEEE 754 standard, with 53 bits of mantissa and an exponent ranging from −1022 to 1023.

Characters

Character values are represented as 8-bit integers between 0 and 255. Character codes between 0 and 127 are interpreted following the ASCII standard. The current implementation interprets character codes between 128 and 255 following the ISO 8859-1 standard.

Character strings

String values are finite sequences of characters. The current implementation supports strings con-taining up to 2²⁴−5 characters (16777211 characters); on 64-bit platforms, the limit is 2⁵⁷−9.

6.2.2 Tuples

Tuples of values are written (v1, . . . , vn), standing for the n-tuple of values v1 to vn. The current implementation supports tuple of up to 2²²−1 elements (4194303 elements).

6.2.3 Records

Record values are labeled tuples of values. The record value written{f ield₁ =v₁;. . .;f ield_n=v_n} associates the value v_i to the record field f ield_i, for i = 1. . . n. The current implementation supports records with up to 2²²−1 fields (4194303 fields).

6.2.4 Arrays

Arrays are finite, variable-sized sequences of values of the same type. The current implementation supports arrays containing up to 2²² −1 elements (4194303 elements) unless the elements are floating-point numbers (2097151 elements in this case); on 64-bit platforms, the limit is 2⁵⁴−1 for all arrays.

6.2.5 Variant values

Variant values are either a constant constructor, or a pair of a non-constant constructor and a value. The former case is writtenconstr; the latter case is writtenconstr(v), wherev is said to be the argument of the non-constant constructor constr.

The following constants are treated like built-in constant constructors:

Constant Constructor false the boolean false true the boolean true () the “unit” value [] the empty list

The current implementation limits each variant type to have at most 246 non-constant con-structors.

6.2.6 Polymorphic variants

Polymorphic variants are an alternate form of variant values, not belonging explicitly to a predefined variant type, and following specific typing rules. They can be either constant, written‘tag-name, or non-constant, written ‘tag-name(v).

6.2.7 Functions

Functional values are mappings from values to values.

6.2.8 Objects

Objects are composed of a hidden internal state which is a record of instance variables, and a set of methods for accessing and modifying these variables. The structure of an object is described by the toplevel class that created it.

6.3 Names

Identifiers are used to give names to several classes of language objects and refer to these objects by name later:

• value names (syntactic classvalue-name),

• value constructors and exception constructors (classconstr-name),

• labels (label-name),

• variant tags (tag-name),

• type constructors (typeconstr-name),

• record fields (field-name),

• class names (class-name),

• method names (method-name),

• instance variable names (inst-var-name),

• module names (module-name),

• module type names (modtype-name).

These eleven name spaces are distinguished both by the context and by the capitalization of the identifier: whether the first letter of the identifier is in lowercase (written lowercase-ident below) or in uppercase (written capitalized-ident). Underscore is considered a lowercase letter for this purpose.

Naming objects

value-name ::= lowercase-ident

| (operator-name ) operator-name ::= prefix-symbol |infix-op

infix-op ::= infix-symbol

| *|=|or|&|:=

| mod|land|lor|lxor|lsl|lsr|asr constr-name ::= capitalized-ident

label-name ::= lowercase-ident tag-name ::= capitalized-ident typeconstr-name ::= lowercase-ident

field-name ::= lowercase-ident module-name ::= capitalized-ident modtype-name ::= ident

class-name ::= lowercase-ident inst-var-name ::= lowercase-ident method-name ::= lowercase-ident

As shown above, prefix and infix symbols as well as some keywords can be used as value names, provided they are written between parentheses. The capitalization rules are summarized in the table below.

Name space Case of first letter

Values lowercase

Constructors uppercase

Labels lowercase

Variant tags uppercase Exceptions uppercase Type constructors lowercase Record fields lowercase

Classes lowercase

Instance variables lowercase

Methods lowercase

Modules uppercase

Module types any

Note on variant tags: the current implementation accepts lowercase variant tags in addition to uppercase variant tags, but we suggest you avoid lowercase variant tags for portability and compatibility with future OCaml versions.

Referring to named objects

value-path ::= value-name

| module-path.value-name constr ::= constr-name

| module-path.constr-name typeconstr ::= typeconstr-name

| extended-module-path.typeconstr-name field ::= field-name

| module-path.field-name module-path ::= module-name

| module-path.module-name extended-module-path ::= module-name

| extended-module-path.module-name

| extended-module-path(extended-module-path) modtype-path ::= modtype-name

| extended-module-path.modtype-name class-path ::= class-name

| module-path.class-name

A named object can be referred to either by its name (following the usual static scoping rules for names) or by an access path prefix . name, where prefix designates a module and name is the name of an object defined in that module. The first component of the path, prefix, is either a simple module name or an access path name₁ .name₂. . ., in case the defining module is itself nested inside other modules. For referring to type constructors or module types, the prefix can also contain simple functor applications (as in the syntactic classextended-module-pathabove), in case the defining module is the result of a functor application.

Label names, tag names, method names and instance variable names need not be qualified: the former three are global labels, while the latter are local to a class.

Dans le document The Objective Caml system release 3.12 (Page 93-102)