発端

ある日, 私は文字列の先頭を切り出すために次のような関数を書いていました.

https://github.com/susisu/loquat-core/blob/dcd37f885b5f5b1b0dfdc0e15680375a16239318/lib/utils.js#L97-L113

function unconsString(str, unicode) {
    if (unicode) {
        const cp = str.codePointAt(0);
        if (cp === undefined) {
            return { empty: true };
        }
        else {
            const char = String.fromCodePoint(cp);
            return { empty: false, head: char, tail: str.substr(char.length) };
        }
    }
    else {
        return str === ""
            ? { empty: true }
            : { empty: false, head: str[0], tail: str.substr(1) };
    }
}

unicode = false のときは別に特に変わったことはしていないのですが, unicode = true のときは String.prototype.codePointAt を使って code point 単位で文字コードを取り出し, それを String.fromCodePoint を使って文字列に変換しています.

要するに unicode = true のときは, サロゲートペア (絵文字とか) をペアのまま取り出してくれるというやつです.

const str = "🍣🍤"; // === "\uD83C\uDF63\uD83C\uDF64";
console.log(unconsString(str, false));
// -> { empty: false, head: "\uD83C", tail: "\uDF63🍤" }
console.log(unconsString(str, true));
// -> { empty: false, head: "🍣", tail: "🍤" }

ところで unconsString 関数のパフォーマンスを調べてみましょう (Node.js v7.3.0, 雑に2回実行して平均とってます).

ASCII, `unicode = false`

const N = 10000000;

console.time("uncons");
let str = "abcd";
for (let i = 0; i < N; i++) {
    unconsString(str, false);
}
console.timeEnd("uncons");

400 ms.

ASCII, `unicode = true`

const N = 10000000;

console.time("uncons");
let str = "abcd";
for (let i = 0; i < N; i++) {
    unconsString(str, true);
}
console.timeEnd("uncons");

1620 ms. 既に雲行きが怪しいとかそういうレベルではない.

サロゲートペア, `unicode = false`

const N = 10000000;

console.time("uncons");
let str = "🍣🍤";
for (let i = 0; i < N; i++) {
    unconsString(str, false);
}
console.timeEnd("uncons");

790ms. ASCII と比べて遅いのは内部表現の都合だろうか?

サロゲートペア, `unicode = true`

const N = 10000000;

console.time("uncons");
let str = "🍣🍤";
for (let i = 0; i < N; i++) {
    unconsString(str, true);
}
console.timeEnd("uncons");

4300 ms. もうダメだ.

何が遅いのか

unconsString 関数内では特に複雑なこともしていないので, 遅いことが疑われるのは String.prototype.codePointAt か String.fromCodePoint のどちらかしかないわけで, 結局タイトルの通り, String.fromCodePoint が犯人だったというわけです.

書き直す

そんなこんなで以下のように書き直しました.

https://github.com/susisu/loquat-core/blob/6fbcffecb21b415547d20e728b1ef259f5975571/lib/utils.js#L97-L123

function unconsString(str, unicode) {
    const len = str.length;
    if (unicode) {
        if (len === 0) {
            return { empty: true };
        }
        else if (len === 1) {
            return { empty: false, head: str[0], tail: str.substr(1) };
        }
        else {
            const first = str.charCodeAt(0);
            if (first < 0xD800 || 0xDBFF < first) {
                return { empty: false, head: str[0], tail: str.substr(1) };
            }
            const second = str.charCodeAt(1);
            if (second < 0xDC00 || 0xDFFF < second) {
                return { empty: false, head: str[0], tail: str.substr(1) };
            }
            return { empty: false, head: String.fromCharCode(first, second), tail: str.substr(2) };
        }
    }
    else {
        return len === 0
            ? { empty: true }
            : { empty: false, head: str[0], tail: str.substr(1) };
    }
}

String.fromCodePoint の代わりに String.fromCharCode を使っています. その過程で code point を code unit に分割する処理が必要だったりして, 結局 String.prototype.codePointAt の中身も仕様書を見ながら再実装したりしています (色々省いてはいますが).

実際これでパフォーマンスが改善したのか, 再度計測して比べてみましょう.

ASCII, `unicode = true`

const N = 10000000;

console.time("uncons");
let str = "abcd";
for (let i = 0; i < N; i++) {
    unconsString(str, true);
}
console.timeEnd("uncons");

1620 ms → 460 ms. unicode = false の場合とほぼ変わらない速度が出るようになりました.

サロゲートペア, `unicode = true`

const N = 10000000;

console.time("uncons");
let str = "🍣🍤";
for (let i = 0; i < N; i++) {
    unconsString(str, true);
}
console.timeEnd("uncons");

4300 ms → 680 ms. むしろ unicode = false より速くなった.

まとめ

というわけで無事 String.fromCodePoint を避けることで高速化に成功しました. パフォーマンスが気になる場合は代わりに旧来通りの String.fromCharCode を使いましょう. あまりちゃんと調べていませんが, Firefox でも効果があるっぽいです.

(追記: なぜ遅いのかと V8 の中身をざっと見た感じでは String.fromCharCode がかなり最適化されたような実装になっているのに対して, String.fromCodePoint はほぼ仕様書見たままのナイーブな実装になっている雰囲気だった. 雰囲気しかわからないので正しくないかもしれない.)

unconsString 関数はパーサライブラリの中で使っていて, 文字列分割は頻繁に行われるため, 全体のパフォーマンスにそこそこ影響を及ぼします. 適当に JSON パーサで調べたら unicode = true の場合に 20 〜 30 % くらい高速化しました. Cookie Cliker なら 2000 時間放置後の ascension くらいの威力ですね (人に依るのでは?).

Object.create(null)

String.fromCodePoint が遅かった

発端

ASCII, `unicode = false`

ASCII, `unicode = true`

サロゲートペア, `unicode = false`

サロゲートペア, `unicode = true`

何が遅いのか

書き直す

ASCII, `unicode = true`

サロゲートペア, `unicode = true`

まとめ

発端

ASCII, unicode = false

ASCII, unicode = true

サロゲートペア, unicode = false

サロゲートペア, unicode = true

何が遅いのか

書き直す

ASCII, unicode = true

サロゲートペア, unicode = true

まとめ

ASCII, `unicode = false`

ASCII, `unicode = true`

サロゲートペア, `unicode = false`

サロゲートペア, `unicode = true`

ASCII, `unicode = true`

サロゲートペア, `unicode = true`