String literal in Move

Hello! Are you going to add support for UTF-8 (or maybe ASCII) string literals in Move? From what I’ve seen in standard library, only x"..." is supported. Will there be another way to create new vector<u8>?

1 Like

A string type for the source language has been our our wishlist, but we haven’t gotten to it yet. Specifically, we would like to add something like Rust-style byte string literals. We would be more than happy to accept this as a contribution if you are willing or know someone that would be! https://github.com/libra/libra/pull/3131 might be a helpful guide.

2 Likes

Thanks for response! I think I know the right guy for this job. :smirk:

Could you provide some details on how this is supposed to work? Let’s say I use Rust-like byte literal: b'sam', what value would vector store? Would it be ASCII-encoded (and one symbol = 1 byte)? Or is there a place for experiments with UTF?

I suggest doing this incrementally and starting with something simple. We’ve been planning for a byte string literal syntax similar to Rust (b”….”). We can split this into 2 steps:

  1. ASCII strings with no escapes. Change the find_token function in language/move-lang/src/parser/lexer.rs to recognize both b”…” and x”…” as ByteStringValue tokens. Then, change the parse_byte_string function in language/move-lang/src/parser/syntax.rs to check for the “b” prefix (instead of asserting that it is “x”) and read the string of characters as a byte vector. I think that’s all, but of course, you should also add some tests in language/move-lang/tests/move_check/parser/.
  2. Add support for escape sequences: “\n”, “\t”, “\r”, “\0”, “\\”, “\”” and byte escapes (“\x52”). The lexer will need to check for escaped quote characters when scanning to find the end of a string token, and the parser will need to process the different escape sequences when converting to byte values. This shouldn’t be too hard.

As far as UTF-8 goes, the Move input character encoding is ASCII right now and we don’t have plans to change that anytime soon. Of course you can still encode arbitrary Unicode characters in a UTF-8 string by specifying the raw byte values. We could also add support for 24-bit Unicode codepoint escapes (“\u{abc123}”), where the compiler would know how to encode those to UTF-8, but we’ve been hesitant to explicitly support Unicode, since it raises a lot of complexity. Even without any string operations, there are questions of whether to normalize strings when there are multiple ways of encoding the same character.

2 Likes