Information Technology Environment — Ember

Introduction

This section documents the Ember computing environment: a centralized append-only information store, a computer operating system integrated with that information store, and related specifications. This is a work-in-progress draft, and everything here is subject to change and is not presently suited for implementation.

Overview

The computing environment will consist of the following components:

Development principles

Prerequisites for code to be added to the repository

How issues should be prioritised

Ordered from highest priority to lowest priority

  1. Security vulnerabilities
  2. Functional regressions
  3. Incorrect results
  4. Crashes and similar critical usability issues
  5. Slow code with a significant impact on usability
  6. Aesthetic regressions
  7. Minor usability issues
  8. Slow code with a moderate impact on usability
  9. Missing features

Data formats

Ember Language

Objective

Develop a machine-readable language that can be source-to-source translated into other languages. It should be practically useable as a shell as well as for running stored programs. Possible target languages to investigate include Rakudo Perl 6, NQP, C--, C, Qt, and JavaScript.

Language profiles

Ember Language programs may optionally declare a non-default language profile to use: Core, Basic, and Dangerous (the default is "Standard"). Core and Basic both restrict the program to a subset of the language. The Basic language interpreter is written using the Core subset of the language, and provides useful shortcuts to use in the development of the interpreter for the Standard profile. The Default language interpreter is written using the Basic subset of the language. The Dangerous profile allows using language features which are probably a bad idea to use, but may be needed in some cases.

Dcs

The core unit of the Ember Language is the Dc (Document Component). The defined Dcs are listed in DcData.csv. An Ember Language document is a list of Dcs, and a file is considered structurally valid if it can be interpreted as such. A Dc can have a syntactical pattern that it can require if it is to be meaningful. For example, a marker to begin a section of a document might be required to have a matching end marker. A document is only syntactically valid if the usage of each Dc contained within it conforms to the Dc's defined syntax, even if the document is otherwise structurally valid.

Reading DcData.csv

DcData.csv contains nine columns, each of which gives some information about a given Dc.

From left to right, the columns are: ID, Name, Combining class, Bidirectional class, Simple case mapping, Type, Script, Details, and Description.

The "ID" column specifies the number used to refer to a given Dc. Once an ID has been specified in a stable version, its meaning will not change in future versions.

The "Name" column specifies an informative name for the Dc. The names may change in future versions if the current names seem suboptimal. They should not be relied on as unique or stable identifiers. If a name is prefixed with "!", then that Dc is deprecated. Names should be unique within any given version of DcData.csv, although errors in it could compromise that (it is not currently checked by a computer).

"Combining class" column: See below.

"Bidirectional class" column: See below.

"Simple case mapping" column: This column contains the ID of the uppercase form of characters with the "Ll" type, and the ID of the lowercase form of characters with the "Lu" type.

"Type" column: See below.

The "Script" column indicates the script or other set to which the character belongs. Values needing further explanation include "Semantic", "DCE", "DCE sheets", "Noncharacters", "DCE versions", "Encapsulation", "EL Syntax", "EL Routines", and "EL Types".

The "Details" column contains various additional information about characters, as a comma-separated list.

The remaining list entries are aliases (alternate names for the characters, for ease of look-up).

The "Description" column contains additional comments regarding the Dc.

Three columns' contents are directly inherited from the Unicode Standard: Combining class (inherits Unicode's "Canonical_Combining_Class property"), Bidirectional class (inherits Unicode's "Bidi_Class" property), and Type (inherits Unicode's "General_Category" property). The "Simple case mapping" and "Script" columns should also be inherited from Unicode in some manner, but are not at present. For characters not included in Unicode, a reasonable value is chosen in the pattern of the values used by Unicode. If there are discrepancies between this value and Unicode's value for a given character that is in both sets, this should be reported as an error in the Ember Language standard. Unicode's values should take precedence.

"Type" column values also extend the Unicode Standard's possible values with the "!Cx" category, denoting characters that do not fit neatly into Unicode's existing categories.

Notes on specific Dcs
Dcs 241–245: Mode indicators

Inclusion of the mode indicators in documents is optional. The selected mode expresses information about the document's expected execution environment. These modes are shortcuts that set up the environment in advance so that the document does not need to contain specific code to set up these contexts. This lets the resulting documents more concise and readable.

Dcs 246–255: Source formatting control

Dcs 246 through 255 control the formatting of the EM format version of a document.

Document formats

There are six file formats defined by this specification. Four of them (EMD, EM, EMS, and DEMS) are general-use formats, while the fifth (EMR) is a special-purpose subset of the EMB format. EMS and DEMS are intended as an intermediate, more-readable format between EMD and EMB, and are not intended for information interchange (they are much larger than the other formats for a given document, in general).

To allow backward compatibility, once a completed version of this standard has been released, the meaning of any given Dc will not change. That will ensure that existing documents retain their meaning when interpreted using a newer version of the specification. While they are semantically stable, they are not necessarily presentation-stable (a Dc representing "A" in one version may look different from one version to the next, but it won't change to represent a "B"). Implementations should be able to render a document exactly as determined by earlier versions of the specification as well. A syntax should be provided to indicate the version of the specification a given Dc, region of Dcs, or document should be displayed using (exactly, not just semantically), although Dcs have not been created for this purpose yet.

There is a one-to-one correspondence between EMD, EM, EMS, and DEMS files (for any given document in one of those formats, there is only one way to represent it in the other formats), but not for EMR files (because EMR files can only represent a subset of Ember Language documents). That means that documents can be losslessly round-trip-converted between those four formats. However, when converting from an EM file, if it does not have a version specified, its behavior may change due to changes in the mapping between source code and Dc IDs. Source form should be able to represent syntactically invalid documents unambiguously. Whether structurally invalid source-form documents should be able to be represented as structurally valid Dc sequences is debatable.

EMD, EM, and EMS files are subsets of ASCII text files, with lines delimited by 0x0A (line feed). Bytes 0x00 through 0x09, 0x0B through 0x1F, and 0x7F through 0xFF (all ranges inclusive) are disallowed. Files must end with 0x0A. This may later be changed to use UTF-8.

At the end of each format's summary (except for EMR), a simple "Hello, World!" document is given in the format.

Ember Language documents (EMD), .emd

Ember Language documents are a list of Dcs. The Dcs mappable to the permitted ASCII characters are represented by those ASCII characters, with the exception of 0x40 "@" (Dc 1). All other Dcs are represented by "@" followed by the integer Dc ID followed by a space, such that, for instance, "@" would be represented as "@1 ".

Hello, World!
Ember Language source files (EM), .em

Ember Language source files are a programming language–inspired representation of Ember Language documents. It is the most readable of the formats, but also the most technically complex.

dc:
    Hello, World!

or more idiomatically (but not the exact equivalent of the others in terms of the Dcs used),

say 'Hello, World!'

which would be

256 258 260 262 # . . . .
264 263 57 86 # . . H e
93 93 96 30 # l l o ,
18 72 96 99 # . W o r
93 85 19 261 # l d ! .
259 # .

in Dcs, or even more simply the say could be omitted since literals are printed by default: 'Hello, World!'.

Ember Language sequence files (EMS), .ems

A list of Dc numbers. Four Dcs are given per line, separated by spaces.

57 86 93 93
96 30 18 72
96 99 93 85
19
Documented Ember Language sequence files (DEMS), .dems

A variant of the EMS format for easier reading: after each line, the printable ASCII equivalent of each Dc is given following 0x202320, each separated from the next by a space. If there is no printable ASCII equivalent, or the character is a space, "." is used instead.

57 86 93 93 # H e l l
96 30 18 72 # o , . W
96 99 93 85 # o r l d
19 # !
Source-Documented Ember Language sequence files (SEMS), .sems

A variant of the EMS and EM formats for easier reading: the EM source version is given in a comment in the style of the DEMS format, but the number of Dcs on each line is determined by the source lines to which they correspond.

# dc:
57 86 93 93 96 96 30 18 72 96 99 93 85 19 #     Hello, World!
Ember Record Documents (EMR), .emr

This is a special format in the "Structured" mode used for structured record storage in the Ember cloud. It is not yet defined, but will most likely be a subset of one of the other formats.

Structures in the Ember Language

The Ember Language uses the following main types of entity to represent information:

Type
Types are templates describing the structure of objects. They are known as prototypes or classes in most programming languages, depending on whether objects described by them inherit changes to the types made after the object was created. (Objects can be used as types by casting.) Type names begin with a capital letter when in source form.
Object
An object is an entity that conforms to a given type (an instance of that type). The most general type is object, and there is no need for an object to conform to any other type.
Block
A block is a group of statements. The routine block indicates a separation between blocks, although in most cases block division is indicated implicitly by a : at the end of a line (in which case the block indentation level increases), or by the decrease in whitespace marking a decrease in block indentation level.
Project
A project is a single document, and if relevant, any other documents maintained as part of that document.
Module
A module is one or more Library-mode documents that have a package name for addressing the things they provide.
List
A list is an ordered list of objects. An inline list is begun by [ and terminated by ]. Example: listName=["a" [5 6] $b]; say $listName[0] stores three Objects in a list named listName and prints a. Objects can be given custom identifiers for addressing them by separating them using : , in which case they should be additionally separated by , . Example: listName=[foo: "a", bar: [5, 6], baz: b]; say $listName[foo].
String
A string is a list of Dcs. (Because all Dcs can be used in strings, any data type can be cast to a string.) An inline list is begun by " and terminated by a second ". In source form, " and \ must be escaped using \. Example: stringName="Hello, \"World\"!" stores the string Hello, "World"! in a string named stringName. Strings can be concatenated by placing them beside each other: a="foo"; b="baz"; say $a"bar"$b prints foobarbaz. String literals can be specified without the quotation marks when they are used as parameters to a routine, because then they won't be confused for routine invocations. They need to have quotation marks if they look like numbers, though. I *think* context can be used to ensure that they aren't mistaken for types, but I'm not sure about that.
Routine
A routine is a set of instructions for a computer to follow as part of the process of interpreting a document. Similar concepts are known as functions or subroutines in most programming languages, or as methods when used within objects. Routines have an associated structure that indicates what parameters may be passed to it; they will be accessible through a list named !par, and if named, then also through their names. Specifying a return type for a routine is optional. If none is specified, it will be treated as "void", meaning no return type is expected. It is denoted by () followed by a block of statements, with its structure, if desired, within the parentheses. Example: String foo(String, String qux?, *){say $!par[1]$!par[2]$qux}; foo("bar" "baz"); foo("bar" 6 "qux") # qux is 6, param 0 is bar, param 1 is 6, and param 2 is qux; foo(qux=6 "bar") # qux is still 6, but now parameter 0 is 6 and 1 is bar represents an unnamed (positional-only) string parameter, an optional string parameter named "qux", and an unknown number (zero or more) of additional parameters of any type, and prints
bazbaz
6qux6
bar6
. Because literals are printed by default when at the beginning of a statement, the "invoke" routine must be used to invoke a routine in some cases, such as when referencing a routine by a name stored in a variable or constant: a="I! Am! An! Awkward! Identifier!"; $a(){say "blob"}; invoke $a. An alternative syntax for routine invocation, omitting parentheses, can be used if desired: foo(String, String qux?){}; foo qux=6 bar
Operator
An operator is a short notation or syntax pattern for some common routines (e.g., Number a + Number b in place of add(Number a, Number b), or if true; then say 'Hello, World!'; else die in place of if(true, (){say 'Hello, World!'}, (){die})).
Identifier
An identifier is a name for an object. They are indicated by $, except for identifiers for routines, which do not have the $ prefix. Before the $ a type signature is often present. Except for routines, ! following the $ indicates a language-defined identifier, and must be escaped if used as the first character of a custom identifier. For routines, the pattern is inverted: ! before an identifier indicates that a custom routine is being referenced (it must be included in calls to invoke, as well). ( following the $ indicates a special value, not a normal identifier, using Bash's syntax: $(say "foo"; say "bar") represents the output of the code between the ( and the ), and $(<foo) represents the contents of the file foo.
Structure
A structure is the definition of what the structure is that an entity can have, similar to type definitions or type signatures in some programming languages. A type can contain named Structures without any values for defining an interface.
Statement
A statement is a logical line of a document. It can be an invocation of a routine, or a declaration of an entity's structure or value.