notes from nwf/jason meeting on numbers and strings

author Jason Eisner <jason@cs.jhu.edu>

Mon, 6 Jan 2014 21:27:04 +0000 (16:27 -0500)

committer Jason Eisner <jason@cs.jhu.edu>

Mon, 6 Jan 2014 21:27:04 +0000 (16:27 -0500)
author Jason Eisner <jason@cs.jhu.edu>
Mon, 6 Jan 2014 21:27:04 +0000 (16:27 -0500)
committer Jason Eisner <jason@cs.jhu.edu>
Mon, 6 Jan 2014 21:27:04 +0000 (16:27 -0500)
diff --git a/docs/sphinx/spec/index.rst b/docs/sphinx/spec/index.rst

index 5e01047230494106cb5bc3e29406e08254fa8b97..6892ae47edf97b1b0a190e3a294d2015d891bb90 100644 (file)
--- a/docs/sphinx/spec/index.rst
+++ b/docs/sphinx/spec/index.rst
@@ -142,9 +142,148 @@ Booleans
  Numbers
  -------
  
+.. todo:: One option is to have disjoint numeric types that vary by
+         precision.  There seem to be the following classes: 
+         * exact expressions as in Mathematica (bignums, rationals
+           infinity, algebraic expressions like 1+sqrt(3) or at least
+           the ones without variables, complex numbers with exact
+           coefficients, perhaps unlimited-precision FP)
+         * machine integers of various widths, with modular
+           arithmetic (wraparound); these are not really subtypes
+           of one another because upcasting changes what addition
+           means
+         * machine floating-point of various precisions
+
+          We also have union types like `number` and (to exclude
+         complex numbers from comparisons) `real`.
+         The numeric library is able to add or compare ints and
+         floats with each other: it probably covers numbers via a
+         union of several type signatures.  However, not all modes
+         are supported.  So even though `*` has (int,float)->float,
+         we can only run that in mode +,+,- or +,-,+ but not -,+,+
+         where we need to use (float,float)->float.  And even though
+         `between` has (real,real,real)->bool, the mode +,-,+ 
+         requires the second arg to already be restricted to int.
+         So the user can write `between(1,3.5,10)` but must do
+         `between(1,:int X,10)` rather than `between(1,X,10) (unless
+         the :int restriction shows up elsewhere in the rule).
+         Perhaps we should offer the user a convenience predicate
+         `ibetween`.
+
+         Problems with this approach:
+         * distinguishing the literals (3 versus 3. versus 3L)
+         * names for explicit casts
+         * operator overloading and/or coercion (borrow from other
+           languages)
+         * queries for f(3) and f(3.0) could give different results
+         * f(3.0) := ... can't override f(3) := ... 
+         * f(3.0) +=  doesn't combine with f(3) += ...; could be
+           awkward if argument to f comes from a subexpression or
+           another query
+         * inverse floating-point modes may have to be nondet if they
+           are supported at all: there are many values of Epsilon
+           such that 3+Epsilon==3 (similarly, inf+Epsilon==inf)
+         * double-counting in aggregation, e.g., += f(X) for X > 3.
+         * Options:
+           * Reject aggregation where the type contains multiple
+             numeric branches.  But this is a pain because now people
+             can't aggregate innocuously over terms when they are not
+             planning to store both 3 and 3.0.
+           * At runtime, give an aggregation error if some of the
+             binding sets that we aggregate over appear to be equal,
+             i.e., X=3,Y="foo" and X=3.0,Y="foo".  (We can
+             eliminate this check if at compile-time we know from X's
+             type that it can't take on different "versions" of the
+             same value, e.g., its int range and float range are
+             disjoint.  We can also speed up this check if at runtime
+             we track information about the numeric types that are
+             actually used and do more type inference based on those
+             restricted types.)
+           * Don't permit the user to define explicit values at
+             compile-time or runtime for both f(3) and f(3.0)
+             (perhaps f needs to keep track of which numeric subclass
+             it is actually using for its args and values even if it
+             is defined over terms), unless user declares that they
+             really want this for f/1.  We probably shouldn't
+             allow the queries f(3) and f(3.0) to find each other's
+             values because then we'll get double-counting.
+        
+         Alternative approach: We only have one big `number` type
+         with subtypes that are implemented by various efficient
+         classes, which may overlap.  This ensures that f(3.0) and
+         f(3) always mean the same thing and are not double-counted.
+         But what happens when a class runs into its bounds or
+         precision limits?  The classy procedures that implement `+`
+         may give different results from one another, depending on
+         the particular classes of their arguments.  
+
+         (A good rule of thumb is that the most precise argument
+         class dictates the amount of precision in the operation and
+         the result class, and literals have a sufficient and indeed
+         generous level of precision, so 12345678901234567890 will
+         get a class that is big enough to represent it and so that
+         adding 255+1 is adding integers rather than bytes.  However,
+         we need to decide whether bigint+float returns bigint or
+         float or something even bigger, and certain operations may
+         map finite to possibly-infinite or int to float.)
+
+         Since the class of the input item makes a difference, we
+         think of this nondeterministic behavior as akin to the
+         "don't care" aggregator, where there are several definitions
+         of `+` and which one to use is up to the system.  If the
+         user wants to control which `+` it is (e.g., unsigned char,
+         or strict math that is careful to never lose precision),
+         then they can give a more specifically named operator, or
+         rebind the `+` symbol to that operator for convenience, or
+         give a pragma, or something.
+
+         A problem with this don't-care approach is that a single
+         item can be computed in different ways (plan order, forward
+         vs. backward chaining, query mode, aggregation order, the
+         subtraction trick, etc.)  that get different answers owing
+         to different class choices or aggregated numerical errors.
+         So we may have to pin a particular answer: remember that
+         don't-cares are as bad for efficiency as cycles or
+         randomness.  So maybe we'd like to ensure at least that the
+         precision of a token of `+` is resolved in a consistent way
+         across all plans (which might require restricting the number
+         of plans).
+
+         Also, it should be easy for the user to figure out what the
+         semantics of the program is and to control it.  That is, can
+         the user figure out which classes are being used other than
+         by annotating the operator?  Note that `f(X) = X+1` might
+         give different numerical answers for `f(3000000000000)`
+         depending on the class used to represent the argument.
+         Consider also `f(X) = X+1 for a(X), b(X)`, where the choice
+         of `+` procedure might be affected by the classes of ground
+         answers returned by subgoals `a` and `b` as well as the
+         class of ground argument passed in by `f` (if any).  If
+         these classes are not all the same, then what is the class
+         passed into `+`?  Can it be determined somehow at compile
+         time, in a consistent way?  Isn't it a problem if forward
+         and backward chaining give different results?  (Perhaps we
+         can at least say that if the query `f(X)` comes back with
+         the result `f(3)=4` then `f(3)` would have come back with
+         the same answer `4`, and that if it comes back with
+         `f(3.0)=4.0` then `f(3.0)` would have come back with the
+         same answer `4.0`.)
+
  Strings
  -------
  
+.. todo:: (related to numbers) when are two Unicode strings considered
+         equal?  Is a string a sequence of code points, or is it a
+         normalized version?  Do we have different types for string,
+         string_NFD, string_NFC, string_NFKD, string_NFKC?  Can they
+         be equal to one another?  Is there any coercion?  What are
+         the canonicalization operators called?  Do we have
+         canonicalizing equality operators?  How do we protect the
+         user?  Should we regard a normalized string in any of these
+         schemes as a *set* of unnormalized strings (probably the
+         enumeration mode is not supported), so that we can ask about
+         meet as well as equality?
+
  Escape codes
  ^^^^^^^^^^^^
  .. todo:: including \' and \" -- borrow from Python
@@ -402,6 +541,8 @@ Some discussion of current approach is in :doc:`/manual/pragmas`.
  Lambdas
  =======
  
+.. todo:: with keyword args.  should be sufficient for teaching NLP.
+
  Terms as dynabases
  ==================
  .. todo:: querying and destructuring terms
author	Jason Eisner <jason@cs.jhu.edu>
	Mon, 6 Jan 2014 21:27:04 +0000 (16:27 -0500)
committer	Jason Eisner <jason@cs.jhu.edu>
	Mon, 6 Jan 2014 21:27:04 +0000 (16:27 -0500)