This is a library to shim BuckleScript's string-handling when using native-OCaml string-manipulation libraries.
In particular, a
String::charCodeAt; specifically, this function returns
Nth UTF-16 code unit into the string.
Meanwhile, over in OCaml land, the type
string (and the functions in the module String) is, semantically, a dumb byte array. That is, when you ask the OCaml compiler for
a_string., you don't get the first character of the string, or even a Unicode-aware codepoint or grapheme; instead, you get the first byte of (what OCaml believes to be) a series of opaque bytes.
Unfortunately, BuckleScript compiles the latter syntax (
a_string.) into the former semantic (
a_string.charCodeAt(0)); this only makes sense within the very limited range of the ASCII-compatible bytes; that is, between 0-127.
Let's experiment with the following small program. It'll take an input string on the command-line, extract the first ... character? byte? and then tell us about it.
(* str_test.ml *)let first_char_info s =let c = s. in"Code: " ^ string_of_int (Char.code c) |> print_endline;"String: " ^ String.make 1 c |> print_endline(* Change the "1" to a "2" to execute this with Node.js. Annoyingly. *)let () = first_char_info Sys.argv.(1)
$ bsc str_test.ml$ node str_test.js helloCode: 104String: h$ ocaml str_test.ml helloCode: 104String: h
Let's try the same thing with an non-ASCII, international string:
$ node str_test.js جملCode: 1580String: ج$ ocaml str_test.ml جملCode: 216String: ?
Ruh-roh. The problem here comes from this series of exchanges:
sin the above program comes in as a UTF-8 encoded string; that's what the shell is passing along to the program in
Node.js understands and expects this; and converts the incoming value into its internal format, UCS-2; this means that
s.charCodeAt(0)is going to be the first UCS code-point of that input string as encoded in UCS-2. That is to say,
"ج", integer value 1580.
An OCaml program, unaware that it's being compiled via BuckleScript, expects
stringvalues arising from UTF-8 input (like
s) to be addressed bytewise; that is, they'd expect
s.to yield "\xD8" (216) and
s.to yield "\xAC" (172), the two bytes of the UTF-8 encoding of the codepoint ‘ج’.
Strings; these can be passed with impunity to UTF-8 handling OCaml functions, which will now function as expected.
Note: This package is not necessary for code written specifically for BuckleScript; just be aware of the BuckleScript-specific semantics of the
ocaml-string-convert with npm:
npm install --save ocaml-string-convert
String; that is, one with Unicode characters outside the ASCII range (but still within the BMP!) encoded as single, 16-bit code-units; and higher-plane characters encoded as UTF-16-style surrogate pairs.
Example, as a UCS-2 sequence of 16-bit code-units:102 111 111 183 98 97 114
String (that is still technically encoded as UCS-2,
mind you!) containing a series of UTF-8 bytes, as interpreted as UCS-2 codepoints.
Example, as a UCS-2 sequence of 16-bit code-units:102 111 111 194 183 98 97 114
See that, in this example, the non-ASCII character U+00B7 “MIDDLE DOT”, which is one code-unit (literally
\xC2\xB7 — C2-B7 being the UTF-8 encoding of U+00B7.
The inverse operation to the above.
Given a double-encoded (effectively, mis-encoded) BuckleScript ‘string’ that's been manipulated as if it's a UTF-8
A Note on Types
Given that readers of this are almost guaranteed to write OCaml, it will probably surprise noboby that I prefer the ability to use nominal types. This is not, however, standard TypeScript practice.
This library's TypeScript interface (which I hope I'm exporting correctly, by the way; I'm rather new to publishing a TypeScript-enabled library!) mints a new type for
string_as_utf_8_buffer. Idiomatic usage would be to tag every stringish return-value from a BuckleScript module with this type:
// ... manipulation ...
(As you can see, I also like to follow a different naming-convention for values I know to contain opaque values produced by the BuckleScript runtime.)
You can, of course, dispense with my convention at your earliest convenience, if you can't stand the (hopefully helpful?) type-errors that this produces; I do not, of course, suggest that you do so:
// ... manipulation ...