TysonAndre
Repos
163
Followers
103

Phan is a static analyzer for PHP. Phan prefers to avoid false-positives and attempts to prove incorrectness rather than correctness.

5319
341

Uniqush is a free and open source software system which provides a unified push service for server side notification to apps on mobile devices.

1436
193

Igbinary is a drop in replacement for the standard php serializer.

717
58

Runkit that (mostly) works in PHP7! (method and function manipulation)

175
27

Phan - PHP Static Analysis for VS Code

22
3

A fast, light-weight proxy for memcached and redis

C
11571
1925

Events

pull request opened
Add script to compare results against json_decode()

Closes #68

The only current difference where json_decode throws when simdjson_decode allows a value is that json_decode properly rejects invalid low surrogates after high surrogates in surrogate pairs

Created at 6 hours ago

Add script to compare results against json_decode()

Closes #68

The only current difference where json_decode throws when simdjson_decode allows a value is that json_decode properly rejects invalid low surrogates after high surrogates in surrogate pairs

Created at 6 hours ago
create branch
TysonAndre create branch test_parsing
Created at 6 hours ago
issue comment
Set up test scripts to compare simdjson_decode and json_decode handling of json edge cases

The only difference I found from php json_decode() was https://github.com/simdjson/simdjson/issues/1894 for PHP json_decode rejecting invalid low surrogate pairs when simdjson_decode didn't

https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs

php > var_export(simdjson_decode('"\uD888\u1234"'));
'㘴'
php > var_export(json_encode(simdjson_decode('"\uD888\u1234"')));
'"\\u3634"'
Created at 6 hours ago
opened issue
simdjson does not check that the value after high surrogate is in the range of a low surrogate?

Describe the bug simdjon's dom parser (and other parsers) will accept the string "\uD888\u1234", despite "\u1234" not being a low surrogate pair in the range U+DC00—U+DFFF

Related to https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs

U+D800—U+DBFF (1,024 code points): high surrogates
U+DC00—U+DFFF (1,024 code points): low surrogates
// simdjson/src/generic/stage2/stringparsing.h
simdjson_inline bool handle_unicode_codepoint(const uint8_t **src_ptr,
                                            uint8_t **dst_ptr) {
  // jsoncharutils::hex_to_u32_nocheck fills high 16 bits of the return value with 1s if the
  // conversion isn't valid; we defer the check for this to inside the
  // multilingual plane check
  uint32_t code_point = jsoncharutils::hex_to_u32_nocheck(*src_ptr + 2);
  *src_ptr += 6;

  // If we found a high surrogate, we must
  // check for low surrogate for characters
  // outside the Basic
  // Multilingual Plane.
  if (code_point >= 0xd800 && code_point < 0xdc00) {
    if (((*src_ptr)[0] != '\\') || (*src_ptr)[1] != 'u') {
      return false;
    }
    uint32_t code_point_2 = jsoncharutils::hex_to_u32_nocheck(*src_ptr + 2);

    // if the first code point is invalid we will get here, as we will go past
    // the check for being outside the Basic Multilingual plane. If we don't
    // find a \u immediately afterwards we fail out anyhow, but if we do,
    // this check catches both the case of the first code point being invalid
    // or the second code point being invalid.
    if ((code_point | code_point_2) >> 16) {
      return false;
    }

    code_point =
        (((code_point - 0xd800) << 10) | (code_point_2 - 0xdc00)) + 0x10000;
    *src_ptr += 6;
  } else if (code_point >= 0xdc00 && code_point <= 0xdfff) {
      // If we encounter a low surrogate (not preceded by a high surrogate)
      // then we have an error.
      return false;
  }

Looking at the method:

  1. if ((code_point | code_point_2) >> 16) { seems to be in the wrong position. if (code_point >= 0xd800 && code_point < 0xdc00) {. But even if that's meant to be the code_point after the reassignment, 0x10FFFF >> 16 would still be true, so that also seems wrong.
  2. Maybe the check should be if ((code_point_2 - 0xdc00) >> 16) so it would fail both for invalid hex (all high 16 bits set) and for values smaller than 0xdc00?
  3. the check can still overflow - i.e. it isn't checking for code_point_2 <= 0xdfff - i.e. it should reject \xD888\xE0000?
  4. Maybe it's meant to check if (((code_point - 0xd800) | (code_point_2 - 0xdc00)) >> 10) { return false; } - though I don't know if that'd even be more efficient (e.g. check that they're both in the range of 1024 possible values, fail if at least one isn't)

E.g. if you execute that step by step

  1. code_point = 0xd888 = 55432
  2. code_point_2 = 0x1234 = 4660
  3. code_point is reassigned to (((code_point - 0xd800) << 10) | (code_point_2 - 0xdc00)) + 0x10000; and becomes 13876 (which is between 0...0x10FFFF) and the function succeeds with an invalid low surrogate

To Reproduce Call simdjson handle_unicode_codepoint with \uD888\u1234 or simdjson decode in a binding with "\uD888\u1234" (testing with dom bindings)

If we cannot reproduce the issue, then we cannot address it. Note that a stack trace from your own program is not enough. A sample of your source code is insufficient: please provide a complete test for us to reproduce the issue. Please reduce the issue: use as small and as simple an example of the bug as possible.

It should be possible to trigger the bug by using solely simdjson with our default build setup. If you can only observe the bug within some specific context, with some other software, please reduce the issue first.

simjson release

https://github.com/simdjson/simdjson/tree/v2.2.2/singleheader

Indicate whether you are willing or able to provide a bug fix as a pull request

If you plan to contribute to simdjson, please read our

  • CONTRIBUTING guide: https://github.com/simdjson/simdjson/blob/master/CONTRIBUTING.md and our
  • HACKING guide: https://github.com/simdjson/simdjson/blob/master/HACKING.md
Created at 6 hours ago
opened issue
Set up test scripts to compare simdjson_decode and json_decode handling of json edge cases

See https://github.com/TkTech/pysimdjson/tree/master/jsonexamples/test_parsing - y_* files are valid json, n_* files are invalid json, and i_*.json are implementation defined (I think)

Ideally, if simdjson_decode returns a valid result instead of throwing, then json_decode should return the exact same result (e.g. for parsing floats)

  • simdjson_decode will throw in some cases where json_decode won't. See the README.
Created at 18 hours ago
issue comment
Ignore pathologically large depths longer than strlen

I expect 2.0.4 to be released tomorrow at the earliest, if it's approved by then

Created at 18 hours ago

Release simdjson 2.0.4

Created at 18 hours ago
pull request opened
Ignore pathologically large depths longer than strlen

Closes #66

For example, for simdjson_decode('{}', true, 1000000000), we only need a depth that's at least 2 to get identical results.

PHP programmers may assume that a large depth is a safe way to avoid depth errors because it was safe for json_decode

To avoid allocating too much virtual memory and wasting page table entries or potentially fatal out of memory errors, reduce depth to a safe value behaving the same way if the requested depth is larger than the allocated depth, larger than the string length, and exceeds 100000. Do this in a way that reduces reallocations.

Created at 18 hours ago
create branch
TysonAndre create branch depth-ignore
Created at 18 hours ago
issue comment
apc_fetch returns copy of cached values, could not another behaviour be added ?

What would be relative simple to implement is a mechanism that stores something into apcu forever and which can then be used without copying. But I don't know how to make this work with storage that is supposed to be freed again, while also being robust to abnormal conditions (such as crashed worker processes).

Yeah, if that were to be done, having a separate memory area with separate memory limits for data with no expiry that couldn't be deleted would seem more appropriate, but would still have the issues you mentioned. Though at that point, it might make sense to instead

What kind of abnormal conditions/crashes have been seen with opcache? Worker processes that fail to exit while holding a lock?

Possible workarounds:

  1. Integrate into opcache directly as opcache_store_immutable_value(string $key, array|string|float|int|bool|null $value): int and opcache_get_immutable_value(string $key, bool &$success = false): array|string|float|int|bool|null $value - This seems like it'd have somewhat lower memory usage for arrays without serialized objects because it could reuse interned strings more effectively

    Though I don't think a majority of people would have this use case, it'd compete with memory used for compiled files, and the hackish workaround of file_put_contents(fileNameForKey, var_export(..., true)) exists and works with opcache's file cache as well (as mentioned in https://web.archive.org/web/20170608185145/https://blog.graphiq.com/500x-faster-caching-than-redis-memcache-apc-in-php-hhvm-dcd26e8447ad?gi=1c30f8eb7302 )

  2. Create a new PECL extension entirely

https://www.npopov.com/2021/10/13/How-opcache-works.html - your post cleared up what opcache is doing - I'd assume apcu is doing something similar.


I'm not actively working on alternatives right now, I'd just decided to research how apcu_fetch worked for string data, and whether it'd be constant time or take additional time/memory.

It turns out it'd take additional time/memory

Created at 1 day ago
opened issue
Ignore pathologically large depths such as `simdjson_decode('{}', true, 1000000000)`

When the depth is both larger than the string length (guaranteeing depth limit won't be hit) and larger than a cutoff such as 100000, ignore it and use a smaller byte length value such as

  • the existing parser depth, if that is >= the string length
  • 100000, if that's sufficient
  • strlen * 2, if that's sufficient (adding a buffer to make reallocations less frequent if the application calls this again with the same depth but varying string sizes

The fact that memory is allocated for depth is different from json_decode, so users might not expect the high memory usage when attempting to avoid depth errors

Created at 1 day ago
issue comment
Add `simdjson_free_memory(): int`
simdjson_warn_unused
inline error_code dom_parser_implementation::allocate(size_t capacity, size_t max_depth) noexcept {
  if (this->max_depth() != max_depth) {
    error_code err = set_max_depth(max_depth);
    if (err) { return err; }
  }
  if (_capacity != capacity) {
    error_code err = set_capacity(capacity);
    if (err) { return err; }
  }
  return SUCCESS;
}

Most of the memory can be freed by calling simdjson_decode('""') as a side effect of how the C library is implemented - it will always change the depth/capacity

Created at 1 day ago
opened issue
Patch C simdjson library to use emalloc_safe/efree for depth/byte buffers
  • This would allow performance monitoring tools to properly indicate how much memory simdjson uses during a request
  • This would respect memory_limit settings in a conventional way and properly cause a php fatal error for invalid requests, e.g. an application receiving too much json

https://github.com/simdjson/simdjson/issues/1017 does not allow customizing the allocator yet

Created at 1 day ago
opened issue
Add `simdjson_free_memory(): int`

This may be useful if applications parse megabytes of json on startup, but don't use simdjson for large requests later on.

This would free the buffers used by the underlying C simdjson library

Created at 1 day ago
issue comment
Add SO_REUSEPORT support to socket

change looks great - i need to test this before merging.

This seems to work locally - testing since I haven't used SO_REUSEPORT elsewhere. I can start two nutcracker instances for memcached with reuseport: true and

  • stats ports are shared and go to one of the two instances
  • New connections are shared and go to one of the two instances
  • Connections stay connected to the same instance
Created at 1 day ago
pull request opened
Exclude other files from .dockerignore

And use a newer php 8.2 release candidate to test

Created at 1 day ago
create branch
TysonAndre create branch dockerignore-update
Created at 1 day ago

Document differences from json_decode()

And add tests of the current behavior of decoding json number syntax

Document exception handling for simdjson_decode

Add documentation of provided global functions

Created at 1 day ago

Don't export symbols by default on Unix/Linux

By default in gcc, all compiled symbols are exported. This compilation flag changes that behavior, and only exports symbols that are explicitly exported, such as macros. (Like Windows already does)

With debugging symbols(-g): simdjson.so 5213920->5189440 bytes Without debugging symbols(without -g): simdjson.so 242776->224488

Update the benchmarks

Re-run benchmarks in 8.2.0-dev with a fresh installation.

Use stringifying a php file as a benchmark. (e.g. https://microsoft.github.io/language-server-protocol/specification uses json rpc and sends PHP file contents as part of the JSON)

Merge pull request #58 from TysonAndre/reduce-size

Don't export symbols by default on Linux/Unix

Merge pull request #59 from TysonAndre/update-benchmarks

Update the benchmarks

Document differences from json_decode()

And add tests of the current behavior of decoding json number syntax

Document exception handling for simdjson_decode

Add documentation of provided global functions

Set both precision settings in decode_integer_test

Ensure consistent test output across installation ini configurations and php versions.

Created at 1 day ago
pull request opened
Fix nit on if/else branch doing the same thing in phpinfo
Created at 1 day ago
create branch
TysonAndre create branch table-nit
Created at 1 day ago
opened issue
Make error handling stricter
  • Throw RuntimeException in simdjson_key_exists/simdjson_key_count for invalid json
  • Throw RuntimeException instead of returning null and emitting a notice for invalid int $depth values

Mark this as a major release for anything relying on the old behavior

Created at 2 days ago
issue comment
Phan has conflicting view on DateTime::getTimestamp return type
  1. Part of that is that prior to php 8.0, if getTimestamp was called with an invalid parameter list, it'd return false
  2. I thought phan was removing phpdoc types that weren't compatible with the real types already when loading method definitions. I'll look into that later. PhanImpossibleConditionInGlobalScope uses the real type set, PhanPossiblyFalseTypeArgument uses the phpdoc type set
  3. FunctionSignatureMap should probably have deltas in src/Phan/Language/Internal/FunctionSignatureMap_php80_delta.php but doesn't
php > $x = new DateTime();
php > var_export($x->getTimestamp(123));
Warning: DateTime::getTimestamp() expects exactly 0 parameters, 1 given in php shell code on line 1
false
Created at 2 days ago

Add documentation of provided global functions

Created at 2 days ago