From: Jason Eisner Date: Mon, 6 Jan 2014 21:27:04 +0000 (-0500) Subject: notes from nwf/jason meeting on numbers and strings X-Git-Url: https://hydra-www.ietfng.org/gitweb/?a=commitdiff_plain;h=6700a0a493d253564c33d63c183434f685800a11;p=dyna2 notes from nwf/jason meeting on numbers and strings --- diff --git a/docs/sphinx/spec/index.rst b/docs/sphinx/spec/index.rst index 5e01047..6892ae4 100644 --- a/docs/sphinx/spec/index.rst +++ b/docs/sphinx/spec/index.rst @@ -142,9 +142,148 @@ Booleans Numbers ------- +.. todo:: One option is to have disjoint numeric types that vary by + precision. There seem to be the following classes: + * exact expressions as in Mathematica (bignums, rationals + infinity, algebraic expressions like 1+sqrt(3) or at least + the ones without variables, complex numbers with exact + coefficients, perhaps unlimited-precision FP) + * machine integers of various widths, with modular + arithmetic (wraparound); these are not really subtypes + of one another because upcasting changes what addition + means + * machine floating-point of various precisions + + We also have union types like `number` and (to exclude + complex numbers from comparisons) `real`. + The numeric library is able to add or compare ints and + floats with each other: it probably covers numbers via a + union of several type signatures. However, not all modes + are supported. So even though `*` has (int,float)->float, + we can only run that in mode +,+,- or +,-,+ but not -,+,+ + where we need to use (float,float)->float. And even though + `between` has (real,real,real)->bool, the mode +,-,+ + requires the second arg to already be restricted to int. + So the user can write `between(1,3.5,10)` but must do + `between(1,:int X,10)` rather than `between(1,X,10) (unless + the :int restriction shows up elsewhere in the rule). + Perhaps we should offer the user a convenience predicate + `ibetween`. + + Problems with this approach: + * distinguishing the literals (3 versus 3. versus 3L) + * names for explicit casts + * operator overloading and/or coercion (borrow from other + languages) + * queries for f(3) and f(3.0) could give different results + * f(3.0) := ... can't override f(3) := ... + * f(3.0) += doesn't combine with f(3) += ...; could be + awkward if argument to f comes from a subexpression or + another query + * inverse floating-point modes may have to be nondet if they + are supported at all: there are many values of Epsilon + such that 3+Epsilon==3 (similarly, inf+Epsilon==inf) + * double-counting in aggregation, e.g., += f(X) for X > 3. + * Options: + * Reject aggregation where the type contains multiple + numeric branches. But this is a pain because now people + can't aggregate innocuously over terms when they are not + planning to store both 3 and 3.0. + * At runtime, give an aggregation error if some of the + binding sets that we aggregate over appear to be equal, + i.e., X=3,Y="foo" and X=3.0,Y="foo". (We can + eliminate this check if at compile-time we know from X's + type that it can't take on different "versions" of the + same value, e.g., its int range and float range are + disjoint. We can also speed up this check if at runtime + we track information about the numeric types that are + actually used and do more type inference based on those + restricted types.) + * Don't permit the user to define explicit values at + compile-time or runtime for both f(3) and f(3.0) + (perhaps f needs to keep track of which numeric subclass + it is actually using for its args and values even if it + is defined over terms), unless user declares that they + really want this for f/1. We probably shouldn't + allow the queries f(3) and f(3.0) to find each other's + values because then we'll get double-counting. + + Alternative approach: We only have one big `number` type + with subtypes that are implemented by various efficient + classes, which may overlap. This ensures that f(3.0) and + f(3) always mean the same thing and are not double-counted. + But what happens when a class runs into its bounds or + precision limits? The classy procedures that implement `+` + may give different results from one another, depending on + the particular classes of their arguments. + + (A good rule of thumb is that the most precise argument + class dictates the amount of precision in the operation and + the result class, and literals have a sufficient and indeed + generous level of precision, so 12345678901234567890 will + get a class that is big enough to represent it and so that + adding 255+1 is adding integers rather than bytes. However, + we need to decide whether bigint+float returns bigint or + float or something even bigger, and certain operations may + map finite to possibly-infinite or int to float.) + + Since the class of the input item makes a difference, we + think of this nondeterministic behavior as akin to the + "don't care" aggregator, where there are several definitions + of `+` and which one to use is up to the system. If the + user wants to control which `+` it is (e.g., unsigned char, + or strict math that is careful to never lose precision), + then they can give a more specifically named operator, or + rebind the `+` symbol to that operator for convenience, or + give a pragma, or something. + + A problem with this don't-care approach is that a single + item can be computed in different ways (plan order, forward + vs. backward chaining, query mode, aggregation order, the + subtraction trick, etc.) that get different answers owing + to different class choices or aggregated numerical errors. + So we may have to pin a particular answer: remember that + don't-cares are as bad for efficiency as cycles or + randomness. So maybe we'd like to ensure at least that the + precision of a token of `+` is resolved in a consistent way + across all plans (which might require restricting the number + of plans). + + Also, it should be easy for the user to figure out what the + semantics of the program is and to control it. That is, can + the user figure out which classes are being used other than + by annotating the operator? Note that `f(X) = X+1` might + give different numerical answers for `f(3000000000000)` + depending on the class used to represent the argument. + Consider also `f(X) = X+1 for a(X), b(X)`, where the choice + of `+` procedure might be affected by the classes of ground + answers returned by subgoals `a` and `b` as well as the + class of ground argument passed in by `f` (if any). If + these classes are not all the same, then what is the class + passed into `+`? Can it be determined somehow at compile + time, in a consistent way? Isn't it a problem if forward + and backward chaining give different results? (Perhaps we + can at least say that if the query `f(X)` comes back with + the result `f(3)=4` then `f(3)` would have come back with + the same answer `4`, and that if it comes back with + `f(3.0)=4.0` then `f(3.0)` would have come back with the + same answer `4.0`.) + Strings ------- +.. todo:: (related to numbers) when are two Unicode strings considered + equal? Is a string a sequence of code points, or is it a + normalized version? Do we have different types for string, + string_NFD, string_NFC, string_NFKD, string_NFKC? Can they + be equal to one another? Is there any coercion? What are + the canonicalization operators called? Do we have + canonicalizing equality operators? How do we protect the + user? Should we regard a normalized string in any of these + schemes as a *set* of unnormalized strings (probably the + enumeration mode is not supported), so that we can ask about + meet as well as equality? + Escape codes ^^^^^^^^^^^^ .. todo:: including \' and \" -- borrow from Python @@ -402,6 +541,8 @@ Some discussion of current approach is in :doc:`/manual/pragmas`. Lambdas ======= +.. todo:: with keyword args. should be sufficient for teaching NLP. + Terms as dynabases ================== .. todo:: querying and destructuring terms