REAL signature
signature REAL
structure Real :> REAL
where type real = real
structure LargeReal :> REAL
structure Real<N> :> REAL (* OPTIONAL *)
The REAL signature specifies structures that implement floating-point numbers. The semantics of floating-point numbers should follow the IEEE standard 754-1985 [CITE] and the ANSI/IEEE standard 854-1987[CITE]. In addition, implementations of the REAL signature are required to use non-trapping semantics. Additional aspects of the design of the REAL and MATH signatures were guided by the Floating-Point C Extensions[CITE] developed by the X3J11 ANSI committee and the lecture notes [CITE] by W. Kahan on the IEEE standard 754.
Although there can be many representations for NaN values, the Library models them as a single value and currently provides no explicit way to distinguish among them, ignoring the sign bit. Thus, in the descriptions below and in the Math structure, we just refer to the NaN value.
type real
structure Math : MATH
where type real = real
val radix : int
val precision : int
val maxFinite : real
val minPos : real
val minNormalPos : real
val posInf : real
val negInf : real
val + : real * real -> real
val - : real * real -> real
val * : real * real -> real
val / : real * real -> real
val rem : real * real -> real
val *+ : real * real * real -> real
val *- : real * real * real -> real
val ~ : real -> real
val abs : real -> real
val min : real * real -> real
val max : real * real -> real
val sign : real -> int
val signBit : real -> bool
val sameSign : real * real -> bool
val copySign : real * real -> real
val compare : real * real -> order
val compareReal : real * real -> IEEEReal.real_order
val < : real * real -> bool
val <= : real * real -> bool
val > : real * real -> bool
val >= : real * real -> bool
val == : real * real -> bool
val != : real * real -> bool
val ?= : real * real -> bool
val unordered : real * real -> bool
val isFinite : real -> bool
val isNan : real -> bool
val isNormal : real -> bool
val class : real -> IEEEReal.float_class
val toManExp : real -> {man : real, exp : int}
val fromManExp : {man : real, exp : int} -> real
val split : real -> {whole : real, frac : real}
val realMod : real -> real
val nextAfter : real * real -> real
val checkFloat : real -> real
val realFloor : real -> real
val realCeil : real -> real
val realTrunc : real -> real
val realRound : real -> real
val floor : real -> int
val ceil : real -> int
val trunc : real -> int
val round : real -> int
val toInt : IEEEReal.rounding_mode -> real -> int
val toLargeInt : IEEEReal.rounding_mode
-> real -> LargeInt.int
val fromInt : int -> real
val fromLargeInt : LargeInt.int -> real
val toLarge : real -> LargeReal.real
val fromLarge : IEEEReal.rounding_mode
-> LargeReal.real -> real
val fmt : StringCvt.realfmt -> real -> string
val toString : real -> string
val scan : (char, 'a) StringCvt.reader
-> (real, 'a) StringCvt.reader
val fromString : string -> real option
val toDecimal : real -> IEEEReal.decimal_approx
val fromDecimal : IEEEReal.decimal_approx -> real option
type real
real is not an equality type.
val radix : int
val precision : int
0 and radix-1, in the mantissa. Note that the precision includes the implicit (or hidden) bit used in the IEEE representation (e.g., the value of Real64.precision is 53).
val maxFinite : real
val minPos : real
val minNormalPos : real
val posInf : real
val negInf : real
r1 + r2
r1 - r2
r1 * r2
r1 / r2
NaN and +-infinity / +-infinity = NaN. Dividing a finite, non-zero number by a zero, or an infinity by a finite number produces an infinity with the correct sign. (Note that zeros are signed.) A finite number divided by an infinity is 0 with the correct sign.
rem (x, y)
trunc (x / y). The result has the same sign as x and has absolute value less than the absolute value of y.
If x is an infinity or y is 0, rem returns NaN. If y is an infinity, rem returns x.
*+ (a, b, c)
*- (a, b, c)
a*b + c and a*b - c, respectively. Their behaviors on infinities follow from the behaviors derived from addition, subtraction, and multiplication.
The precise semantics of these operations depend on the language implementation and the underlying hardware. Specifically, certain architectures provide these operations as a single instruction, possibly using a single rounding operation. Thus, the use of these operations may be faster than performing the individual arithmetic operations sequentially, but may also cause different rounding behavior.
~ r
~ (+-infinity) = -+infinity.
abs r
abs(+-0.0) = +0.0abs(+-infinity) = +infinityabs(+-NaN) = +NaN
val min : real * real -> real
val max : real * real -> real
sign r
Domain on NaN.
signBit r
true if and only if the sign of r (infinities, zeros, and NaN, included) is negative.
sameSign (r1, r2)
true if and only if signBit r1 equals signBit r2.
copySign (x, y)
val compare : real * real -> order
val compareReal : real * real -> IEEEReal.real_order
compare returns LESS, EQUAL, or GREATER according to whether its first argument is less than, equal to, or greater than the second. It raises IEEEReal.Unordered on unordered arguments.
The function compareReal behaves similarly except that the values it returns have the extended type IEEEReal.real_order and it returns IEEEReal.UNORDERED on unordered arguments.
Implementation note:
Implementations should try to optimize use of
compare, since it is necessary for catching NaNs.
val < : real * real -> bool
val <= : real * real -> bool
val > : real * real -> bool
val >= : real * real -> bool
true if the corresponding relation holds between the two reals.
Note that these operators return false on unordered arguments, i.e., if either argument is NaN, so that the usual reversal of comparison under negation does not hold, e.g., a < b is not the same as not (a >= b).
== (x, y)
!= (x, y)
true if and only if neither y nor x is NaN, and y and x are equal, ignoring signs on zeros. This is equivalent to the IEEE = operator.
The second function != is equivalent to not o op == and the IEEE ?<> operator.
val ?= : real * real -> bool
true if either argument is NaN or if the arguments are bitwise equal, ignoring signs on zeros. It is equivalent to the IEEE ?= operator.
unordered (x, y)
true if x and y are unordered, i.e., at least one of x and y is NaN.
isFinite x
true if x is neither NaN nor an infinity.
isNan x
true if x is NaN.
isNormal x
true if x is normal, i.e., neither zero, subnormal, infinite nor NaN.
class x
IEEEReal.float_class to which x belongs.
toManExp r
{man, exp}, where man and exp are the mantissa and exponent of r, respectively. Specifically, we have the relation
r = man * radix(exp)
where 1.0 <= man * radix < radix. This function is comparable to frexp in the C library.
If r is +-0, man is +-0 and exp is +0. If r is +-infinity, man is +-infinity and exp is unspecified. If r is NaN, man is NaN and exp is unspecified.
fromManExp {man, exp}
radix(exp). This function is comparable to ldexp in the C library. Note that, even if man is a non-zero, finite real value, the result of fromManExp can be zero or infinity because of underflows and overflows.
If man is +-0, the result is +-0. If man is +-infinity, the result is +-infinity. If man is NaN, the result is NaN.
split r
realMod r
{whole, frac}, where frac and whole are the fractional and integral parts of r, respectively. Specifically, whole is integral, |frac| < 1.0, whole and frac have the same sign as r, and r = whole + frac. This function is comparable to modf in the C library.
If r is +-infinity, whole is +-infinity and frac is +-0. If r is NaN, both whole and frac are NaN.
realMod is equivalent to #frac o split.
nextAfter (r, t)
nextAfter returns the largest representable floating-point number less than r. If r = t then it returns r. If either argument is NaN, this returns NaN. If r is +-infinity, it returns +-infinity.
checkFloat x
Overflow if x is an infinity, and raises Div if x is NaN. Otherwise, it returns its argument.
This can be used to synthesize trapping arithmetic from the non-trapping operations given here. Note, however, that infinities can be converted to NaNs by some operations, so that if accurate exceptions are required, checks must be done after each operation.
realFloor r
realCeil r
realTrunc r
realRound r
realFloor produces floor(r), the largest integer not larger than r. realCeil produces ceil(r), the smallest integer not less than r. realTrunc rounds r towards zero, and realRound rounds to the integer-values real value that is nearest to r. If r is NaN or an infinity, these functions return r.
floor r
ceil r
trunc r
round r
floor produces floor(r), the largest int not larger than r. ceil produces ceil(r), the smallest int not less than r. trunc rounds r towards zero. round yields the integer nearest to r. In the case of a tie, it rounds to the nearest even integer. They raise Overflow if the resulting value cannot be represented as an int, for example, on infinity. They raise Domain on NaN arguments.
These are respectively equivalent to:
toInt IEEEReal.TO_NEGINF r toInt IEEEReal.TO_POSINF r toInt IEEEReal.TO_ZERO r toInt IEEEReal.TO_NEAREST r
toInt mode x
toLargeInt mode x
Overflow if the result is not representable, in particular, if x is an infinity. They raise Domain if the input real is NaN.
fromInt i
fromLargeInt i
real value. If the absolute value of i is larger than maxFinite, then the appropriate infinity is returned. If i cannot be exactly represented as a real value, then the current rounding mode is used to determine the resulting value. The top-level function real is an alias for Real.fromInt.
toLarge r
fromLarge r
real and type LargeReal.real. If r is too small or too large to be represented as a real, fromLarge will convert it to a zero or an infinity.
fmt spec r
toString r
fmt is parameterized by spec, which has the following forms and interpretations.
SCI arg
[~]?[0-9].[0-9]+?E[0-9]+where there is always one digit before the decimal point, nonzero if the number is nonzero. arg specifies the number of digits to appear after the decimal point, with 6 the default if arg is
NONE. If arg is SOME(0), no fractional digits and no decimal point are printed.
FIX arg
[~]?[0-9]+.[0-9]+?arg specifies the number of digits to appear after the decimal point, with 6 the default if arg is
NONE. If arg is SOME(0), no fractional digits and no decimal point are printed.
GEN arg
NONE.
EXACT
IEEEReal.toString for a complete description of this format.
"inf" and "~inf", respectively, and NaN values are converted to the string "nan".
Refer to StringCvt.realfmt for more details concerning these formats, especially the adaptive format GEN.
fmt raises Size if spec is an invalid precision, i.e., if spec is
fmt spec is evaluated.
The fmt function allows the user precise control as to the form of the resulting string. Note, therefore, that it is possible for fmt to produce a result that is not a valid SML string representation of a real value.
The value returned by toString is equivalent to:
(fmt (StringCvt.GEN NONE) r)
scan getc strm
fromString s
real value from character source. The first version reads from ARG/strm/ using reader getc, ignoring initial whitespace. It returns SOME(r,rest) if successful, where r is the scanned real value and rest is the unused portion of the character stream strm. Values of too large a magnitude are represented as infinities; values of too small a magnitude are represented as zeros.
The second version returns if a SOME(r)real value can be scanned from a prefix of s, ignoring any initial whitespace; otherwise, it returns NONE. This function is equivalent to .
StringCvt.scanString scan
The functions accept real numbers with the following format:
[+~-]?([0-9]+.[0-9]+? | .[0-9]+)(e | E)[+~-]?[0-9]+?It also accepts the following string representations of non-finite values:
[+~-]?(inf | infinity | nan)where the alphabetic characters are case-insensitive.
toDecimal r
fromDecimal d
real values and decimal approximations. Decimal approximations are to be converted using the IEEEReal.TO_NEAREST rounding mode. toDecimal should produce only as many digits as are necessary for fromDecimal to convert back to the same number. In particular, for any normal or subnormal real value r, we have the bit-wise equality:
fromDecimal (toDecimal r) = r.
For toDecimal, when the r is not normal or subnormal, then the exp field is set to 0 and the digits field is the empty list. In all cases, the sign and class field capture the sign and class of r.
For fromDecimal, if class is ZERO or INF, the resulting real is the appropriate signed zero or infinity. If class is NAN, a signed NaN is generated. If class is NORMAL or SUBNORMAL, the sign, digits and exp fields are used to produce a real number whose value is.
s * 0.d(1)d(2)...d(n) 10(exp)where
digits = [d(1), d(2), ..., d(n)] and where s is -1 if sign is true and 1 otherwise. Note that the conversion itself should ignore the class field, so that the resulting value might have class NORMAL, SUBNORMAL, ZERO, or INF. For example, if digits is empty or a list of all 0's, the result should be a signed zero. More generally, very large or small magnitudes are converted to infinities or zeros.
If the argument to fromDecimal does not have a valid format, i.e., if the digits field contains integers outside the range [0,9], it returns NONE.
Implementation note:
Algorithms for accurately and efficiently converting between binary and decimal real representations are readily available, e.g., see the technical report by Gay[CITE].
IEEEReal,MATH,StringCvt
If LargeReal is not the same as Real, then there must be a structure Real<N> equal to LargeReal.
The sign of a zero is ignored in all comparisons.
Unless specified otherwise, any operation involving NaN will return NaN.
Note that, if x is real, ~x is equivalent to ~(x), that is, it is identical to x but with its sign bit flipped. In particular, the literal ~0.0 is just 0.0 with its sign bit set. On the other hand, this might not be the same as 0.0-0.0, in which rounding modes come into play.
Except for the *+ and *- functions, arithmetic should be done in the exact precision specified by the precision value. In particular, arithmetic must not be done in some extended precision and then rounded.
The relation between the comparison predicates defined here and those defined by IEEE, ANSI C, and FORTRAN is specified in the following table.
| SML | IEEE | C | FORTRAN |
|---|---|---|---|
| == | = | == | .EQ. |
| != | ?<> | != | .NE. |
| < | < | < | .LT. |
| <= | <= | <= | .LE. |
| > | > | > | .GT. |
| >= | >= | >= | .GE. |
| ?= | ?= | !islessgreater | .UE. |
| not o ?= | <> | islessgreater | .LG. |
| unordered | ? | isunordered | unordered |
| not o unordered | <=> | !isunordered | .LEG. |
| not o op < | ?>= | ! < | .UGE. |
| not o op <= | ?> | ! <= | .UG. |
| not o op > | ?<= | ! > | .ULE. |
| not o op >= | ?< | ! >= | .UL. |
Implementation note:
Implementations may choose to provide a debugging mode, in which NaNs and infinities are detected when they are generated.
Rationale:
The specification of the default signature and structure for non-integer arithmetic, particularly concerning exceptional conditions, was the source of much debate, given the desire of supporting efficient floating-point modules. If we permit implementations to differ on whether or not, for example, to raise
Divon division by zero, the user really would not have a standard to program against. Portable code would require adopting the more conservative position of explicitly handling exceptions. A second alternative was to specify that functions in theRealstructure must raise exceptions, but that implementations so desiring could provide additional structures matchingREALwith explicit floating-point semantics. This was rejected because it meant that the defaultrealtype would not be the same as a defined floating-pointrealtype. This would give a second-class status to the latter, while providing the default real with worse performance and involving additional implementation complexity for little benefit.Deciding if
realshould be an equality type, and if so, what should equality mean, was also problematic. IEEE specifies that the sign of zeros be ignored in comparisons, and that equality evaluate to false if either argument is NaN. These constraints are disturbing to the SML programmer. The former implies that0 = ~0is true whiler/0 = r/~0is false. The latter implies such anomalies asr = ris false, or that, for a ref cellrr, we could haverr = rrbut not have!rr = !rr. We accepted the unsigned comparison of zeros, but felt that the reflexive property of equality, structural equality, and the equivalence of<>andnot o =ought to be preserved. Additional complications led to the decision to not haverealbe an equality type.The type, signature, and structure identifiers
real,REAL, andReal, although misnomers in light of the floating-point-specific nature of the modules, were retained for historical reasons.
Generated April 12, 2004
Last Modified May 25, 2000
Comments to John Reppy.
This document may be distributed freely over the internet as long as the copyright notice and license terms below are prominently displayed within every machine-readable copy.
|
Copyright © 2004 AT&T and Lucent Technologies. All rights reserved.
Permission is granted for internet users to make one paper copy for their
own personal use. Further hardcopy reproduction is strictly prohibited.
Permission to distribute the HTML document electronically on any medium
other than the internet must be requested from the copyright holders by
contacting the editors.
Printed versions of the SML Basis Manual are available from Cambridge
University Press.
To order, please visit
www.cup.org (North America) or
www.cup.cam.ac.uk (outside North America). |