Turk2white.htm

Turkish Demitasse (Turk/2 or T2), A Much Stronger Brew Than Java

T.Pittman

This paper describes the essential features of the programming language "Turkish Demitasse" (Turk/2 or T2), initially a dialect of Java, which like the respective beverages is stronger than the corresponding namesake. Strength in a programming language measures either

The power of the language to accomplish a range of programming tasks -- T2 is designed to be a systems programming language, in addition to which you need nothing else at all for all systems programming tasks, including its own compiler and library code and indeed the whole operating system

The restrictions imposed by the data type system of the language, which also protects the programmer from errors that the compiler can thus catch at compile time.

One of the explicit design considerations in T2 is the realization that it is not the programmer's job to optimize code in ways a well-designed compiler can do better (however, this goal is not completely achieved in the current implementation). It is common knowledge that C was intended as a high-level assembler for the PDP-11 computer, and the programmer was given a lot of control over such optimizations as register allocation, array indexing, and common subexpression elimination. These low-level code tricks are of course increasingly inappropriate for target architectures differing from the original PDP-11, so much so that hand-coded C often performs worse than machine generated C code derived from better languages like Modula-2 using the same C compiler.

The specification of the Java programming language went a long way toward eliminating the type weaknesses in C/C++, but it left too many problems and introduced a few new ones. Here I propose a much stronger typed language which also eliminates (almost all of) the rest of the problems. The result is mostly a subset of Java with some (mostly syntactic) enhancements for readability, plus a few significant extensions as noted below. Apart from the systems-programming extensions, a T2 program can be mechanically translated to a Java program that is semantically equivalent, and with a little more effort, into a semantically equivalent C++ program. Except as noted in the following sections, all of the Java syntax and semantics has been preserved in T2 for maximum code and skill portability.

One of the aims of T2 is to implement a systems programming language such that its compiler and all library code -- indeed a whole operating system -- can be efficiently written in T2 without falling back on other languages. To this end we relax the strong type system in controlled ways by the introduction of the package "Dangerous" containing low-level types and functions, which the compiler knows about and can generate efficient in-line code for. These improvements have been largely inspired by the small and powerful language Modula-2, which attained all these advantages in a somewhat more readable Pascal-like syntax; the "2" in the name of T2 is a nod of acknowledgement for our dear departed friend.

Deletions

The most significant type failure that Java inherited from C is that any type is assignment-compatible with type void. In other words, you can often with impunity discard expression results. Unless you confuse statements and expressions (as C does) there is no valid reason for discarding expression values; values should either be assigned to destination variables (or parameters), or used in some larger expressions, or else used directly in a control structure like if or switch. Although it is not absolutely preventable, we want to discourage the practice of letting expressions have side effects. Side effects make the program less readable and therefore less reliable. T2 does not disallow side effects, but it disallows the most egregious of them by giving assignment statements -- indeed all statements -- an implicit type void. This syntactically prevents the common programming error (which in Java is perfectly legal but has unexpected results when a and b are type boolean):

if (a=b) DoSomething();

We incidentally also disallow the confusing multiple assignment construct,

a=b=c;

but the same result is easily achieved without the confusing loss of readability by making two assignment statements:

b=c;
a=b;

Notice that this restriction does not reduce what can be programmed, it only reduces what can be programmed in an unreadable and dangerous manner. The restricted language is also still syntactically correct Java, and generally compiles to exactly the same machine code.

The other significant restriction to the language introduced by T2 is the requirement that non-empty switch cases explicitly be coded with break (or some other form of goto, such as return or continue or throw) at the end of each case, so that it is no longer possible for one case to fall into the next one. You can achieve the same effect by duplicating code, and a smart compiler can optimize the duplications back out. It is not the programmer's responsibility to do that job, and it is too dangerous to allow it. A much better solution would be the case statement of Pascal or Modula-2 (which needs no such gotos), but the design decision here is to preserve syntactic compatibility with Java as much as possible. It is sufficient for the compiler to reject ill-formed switch cases. We except empty cases, which are effectively only multiple labels on a single case.

We removed statement labels; break and continue apply only to the immediately enclosing structure (loop or switch), without the option of exiting multiple structures at once. The extra capability was judged not useful enough to justify the additional syntactic burden. A smaller language is easier to read and maintain.

In the spirit of smaller language -- although that is not our primary goal, it contributes to robustness -- we also removed the operator+assignment combinations. If programmers properly eschew side effects in their expressions, then repeating the destination variable on the right side of the assignment has no effect on performance (because the compiler can do the necessary common-subexpression elimination automatically) and enhances readability. In the same spirit we also removed the prefix "++" and "--" operators, and restricted the postfix operators to their own statements. An extra assignment statement is cheap code, and programmers should not be writing code that depends on such increments or decrements occurring in the middle of expressions. It is apparently these mid-expression increments and decrements that are the primary purpose for the prefix and postfix incrementation operators, and it is precisely those mid-expression increments and decrements that yield surprising (and therefore error-prone) results [see Note 1]. Nevertheless, for those programmers too lazy to type the variable name twice, we left the postfix "++" and "--" operators in as shortcuts for the T2 spelled-out versions as single statments. This could easily be extended to all of the C/Java composite assignment operators at no cost to execution time, while preserving the inherent readability of the full spelling, but the current compiler does not support them.

Finally, we have tightened case sensitivity in identifiers. T2 identifiers are essentially not case sensitive, in that we disallow the declaration in the same scope of identifiers differing only by capitalization, but we also warn the programmer who changes the capitalization on an identifier. This prevents the C/Java-habituated programmer from declaring a new identifier that hides another with different capitalization in an outer scope, then attempting to use them both in the same context. Programs meeting this restriction are still perfectly good Java, with the additional advantage of better readability. [The current T2 compiler is supposed to enforce this restriction, but a TAG compiler bug prevents it from working properly. See Using the Turk/2 Compiler.]

All names must be declared before they are used. Mutually recursive functions in the same compilation unit can be declared by function prototypes. Variable declarations are required to be before the first executable code, except that class declarations can be either among the variable declarations at the front of the package, or embedded within the (global) function definitions of that package -- notably after the functions called from class methods are declared.

Additions

With a few exceptions, most of the additions to Java that went into T2 are in the nature of "syntactic sugar" which can be relatively painlessly converted back to pure Java in a mechanical manner.

The most important addition is the promotion of arrays to be honest types. Arrays in C are second-class citizens; Java made significant improvements to this, but they are still not honest types as evidenced by the option to put empty array brackets on each identifier in a variable declaration independently. In T2 all types, including arrays, can be named and used in a consistent manner. Array types are not and cannot be syntactically distinguished from other types by placing a part of the distinguishing syntax in some other place. This disallows the Java syntax,

int x, y[], z;

To get the same effect the T2 programmer must make a separate declaration:

int x, z;
int[] y;

Because T2 is intended for systems programming, we further enhance the array type to permit statically declared arrays in the obvious manner:

int[10] x, y;

Assignment to x as a whole must be from a compatible array value, namely an array initializer containing exactly ten integers, or else from another array of the same type, such as y. The compiler is then free to allocate x statically or in the local stack frame. This T2 line is semantically equivalent to the following pure Java code:

int[] x = new int[10]; int[] y = new int[10];

except that T2 disallows any subsequent explicit reallocation by new or assignment to an array of different length. Dynamic arrays as in Java are still available in T2, with run-time subscript bounds checking and the ability to resize them from time to time, along with the corresponding performance hit. Unlike C and Java, because T2 arrays are an honest data type, not a sugar-coated pointer, assigning a static array to another (similarly typed) static array copies the whole array (however I don't think the T2 compiler supports such assignments yet).

Because T2 is strongly typed, we still require array subscripts to be range-checked, but a smart compiler can do that fairly inexpensively with static arrays (as above) and with a couple improvements to the language to enable the compiler more easily to infer in-bounds limits on subscripts without checking every access. One is the addition of a subrange type, which we require to be named:

type decimal = 0..9;

Then any value assigned to a variable of type decimal is range-checked upon assignment, and thereafter can be used without further checking as a subscript to access an array dimensioned int[10]. To convert this usage into pure Java involves nothing more than deleting the type statement and substituting int (or byte) for all occurrences of "decimal" in its scope. A slightly more ambitious use of subrange types permits the type name to be used as the dimension inside the brackets of an array type declaration; converting this back to pure Java resubstitutes there the range's upper bound +1. Note that T2 considers the built-in types byte and short to be appropriate subranges of integer instead of separate types. Thus adding two maximal bytes in T2 does not overflow until you try to assign it back into a byte subrange variable or parameter. Java has this same behavior without the explicit type safety, resulting in some surprises. The 2012 T2 compiler does not yet support subrange types.

The char data type has been promoted to its own type, incompatible with integers. You must use explicit type cast functions (see below) to do arithmetic on characters or to convert numbers to characters.

Named enumerations are a particularly useful strong type that C never really had and Java didn't even attempt. We do this with the same type keyword:

type color = {red, orange, yellow, green, blue, violet};

The 2012 T2 compiler does not yet support enumeration types, but when it does, converting this to the weaker-typed pure Java is as simple as creating constant declarations for each identifier in the list, and replacing the type name with int as we also did for subranges:

final int red=0, orange=1, yellow=2, green=3, blue=4, violet=5;

The same new type keyword can be used to rename another type, or to give a name to an array type:

type bigArray = int[1000];

Record types already exist in Java; the keyword is merely spelled class. You can create methodless classes (that is, record types) in Java by declaring them final static class name, but that does not permit them to be subclassed for extensions. I extended the declaration syntax to allow class name static, which may be subclassed without permitting methods at any level. The record thus declared need not allocate space for a method dispatch table, but it also cannot be dynamically down-cast to a subclass, which cannot be known at runtime. This is one of those efficiency things that the T2 programmer can do to inform the compiler to generate faster, smaller code, but should not need to.

Operator precedence in T2 has been modified slightly from the C/Java convention to be more natural in view of the stronger type checking. The T2 bitwise logical operators cannot be used on boolean operands, so their precedence was exchanged with the relational operators. Thus a>b&c means a>(b&c) where b and c must be integer, while a>b&&c means (a>b)&&c where c must be boolean; in pure Java the bitwise operators can be applied to boolean operands with results equivalent to the corresponding boolean operators, so a>b&c produces a compile-time error in Java if b and c are integer. Conversion between correct pure Java and a correct T2 program using unparenthesized expressions like this must insert parentheses.

Similarly, the T2 "+" operator cannot be used for string concatenation as in Java because a+3 gives surprising results depending on whether a is string or integer; the new operator "#" is used in T2 for string concatenation. The "#" operator has a precedence just lower than the bitwise operators (which operate on integers only), so that automatic promotion to string does not come between them and the arithmetic operators. As with the bitwise operators, conversion to pure Java may need to insert parentheses when substituting "+" for "#".

The other feature that facilitates strong array bounds checking at minimal cost is a properly structured for loop. Unlike C and Java, where the for loop is so flexible as to discourage any compiler optimization, we permit only one control variable, which must be assigned an initial value, checked for termination, and updated, and which the programmer is further forbidden to alter within the loop. Thus the T2 compiler can easily prove (termination and) the range of the control variable, and so eliminate unnecessary range checking code within the body of the loop; this optimization is only partly successful in the 2012 compiler. To draw attention to the restriction, the T2 syntax is somewhat different from Java:

for i=start,stop,step do doSomething();

This form is just syntactic sugar for the pure Java form, which we now disallow in T2:

for (int i=start; i<=stop; i=i+step) doSomething();

Like its Pascal inspiration, the T2 for-loop works as easily for decremented loops by setting the (optional) step value negative, which also inverts the termination test. The new syntax also eliminates the common programming error of an inappropriate termination condition (from confusion over whether to use <= or just <). Programmers familiar only with the C/Java for loop notation may initially find this Pascal/Ada syntax obscure, but the reverse is also true; we believe the proposed notational difference enhances the recognition of different semantics and does not significantly interfere with usability; in fact the reduced redundancy of the T2 syntax is easier to write and somewhat easier to read. We have for the most part eschewed changes to the language that entail the proliferation of new reserved words.

Object-Oriented Programming Systems (OOPS) are fine and dandy for writing elegant software that runs slowly, but performance is a major concern in system software, and T2 is aimed squarely at that market. Like Java, the T2 compiler is expected to statically bind methods to final classes (but the 2012 compiler does not yet do so), so that static method calls in these cases perform no differently from ordinary function calls in a non-OOPS language. We further offer a syntactic sugaring that fields and methods may be declared outside a class scope as a reminder that this is system software, not OOPS code nor appropriate to the OOPS paradigm; they might be collected together into an implicit "final static Global__Stuff" class when converting T2 to pure Java.

Java offers two different syntactic forms for calling a function/method. One is used for static methods, where all parameters are explicit in the parameter list the way we have always called functions in pre-OOPS languages; the other is used for normal class methods, where there is always one parameter hidden in the call, sometimes explicitly in the "objectName." prefix and sometimes only implicitly the current (this) object. This becomes somewhat confusing when trying to call utility functions such as the String class methods. There is no inherent primary string object when comparing two strings for equality or ordering, yet the programmer is forced to think in terms of sending a "comareTo" message to only one of them; similarly, sending a "concat" message to a string object does not modify nor otherwise distinguish that string object, it only treats both operands as parameters to form a new string which is returned. Notice that the Math library routines are more sensibly declared static and are passed all parameters explicitly. In T2 we favor library routines using this latter, more orthogonal function-calling protocol, and to that end we also provide implicit wrappers for the more useful String class methods to make them easier to read. The T2 compiler does not actually generate nor call extra wrapper code, but both kinds of invocations compile to the same machine code call -- except the 2012 compiler does not support String as a class at all, it's just a native string type. Converting a T2 program to pure Java can either supply these wrappers explicitly in a library class, or else restructure (and rename) the method call to match the standard String class methods. Note that T2 is not different from Java in available procedure calling syntax, only that the additional T2 library routines favor the more readable non-OOPS form where that makes sense.

T2 removes the Java type-casting syntax inherited from C, and replaces it with explicit system functions. The advantage of this is that the dangerous casts -- and we do need them -- can now be flagged by forcing them to be imported from package Dangerous. Most important among these casts are address conversions, needed for pointer arithmetic, which must be done to write a memory allocator (and other such system functions) in T2. The compiler is of course fully aware of the library functions exported by package Dangerous, so they can be coded inline -- indeed the inline code for a true typecast (as distinguished from the conversions improperly called typecasts in C and Java) is no code at all! Obviously system code depending on such typecasts cannot function correctly as pure Java, but the alternative conversion to C++ remains viable. The reasonably safe casts and coersions (for example, int to float) can be pervasive functions and/or remain implicit as in Java.

For T2 to be as easy to use as Basic (a reasonable goal), strings must be well-ordered and implicitly unique (though perhaps not really), so "str1<str2" actually compares the characters as expected. We also add chunk expression functions that work more or less like HyperCard. All of these must be translated to appropriate function or method calls for C++ or Java conversion.

A common programming error in C and Java is mismatched braces. Most editors offer a brace-matching tool to help find these, but my editor currently lacks such a convenience. Instead the compiler optionally accepts and checks a name on each closing brace which identifies what kind of block it is:

void myFunct() {
while (true) {
if (sometest) {DoSomething();}~if
else {break;}~else}~while}~myFunct

Class and function blocks are named respectively by the class or function name; blocks controlled by keywords if, while, repeat, for, switch, try, catch, and finally are named by those keywords. If the block name is incorrect, the compiler errors off immediately, which usually gives a much more precise identification of the mismatch than where C and Java compilers are able.

T2 does not support method overloading, providing different parameter signatures for the same method name. The most valuable use for overloading is to provide initial values for class variables when it is constructed; the 2012 T2 compilers offers instead a "poor man's constructor" which looks like an overloaded class constructor, but only copies the given arguments to the class variables (in order, as provided) before invoking the actual class constructor, if any.

The compilation unit in T2 is a package, which usually encompases multiple class, variable, and function declarations, and is terminated by a single dot "." at the end. If the keyword "package" is omitted, then there must be a "main()" function somewhere in the package. Otherwise the keyword package names the package, which is compiled and added to the library, from where it can be imported into subsequent compilations, thus:

package fust;
// some declarations...
void DoSomething() {}
. // end of fust
package more;
import fust;
void SomethingElse() {DoSomething();}
. // end of more

Multiple packages can be in a single file, with a single pseudo-package declaration line at the front, identifying in quotes which package is to be compiled

package "more"

Empty quotes proceeds to compile the whole file after it (which is usually the main program, up its own terminal dot). The 2012 compiler is smart enough to read the imports on the main program (with certain restrictions, see details in Using the Turk/2 Compiler) and compile all those packages (in that order) first, before compiling the main program. It's not exactly a makefile, but it works similarly.

New T2 Library Routines

(TBA)

Formal T2 Syntax

Example T2 Code

(TBA)

Notes

1. Out of a class of 14 senior computer science majors, three (more than 20%) were unable to use the ++ operator correctly and/or recognize correct usage in a test environment. Two of them attempted to pass an incremented parameter value (i+1) in a recursive method call by writing i++; the other criticized as an error the syntactically correct for(i=0;i<n;++i). [Back]

Rev. 2012 April 27