Efficient, Immutable, Thread-Safe Collection classes for Ruby
@pakutoma Do you already have a test case like this for JIS and ISO-2022-JP?
"\x1B(J\x0E\x0F\x1B(B"
i.e. trying to use Shift In/Shift Out when not in ASCII mode.
@pakutoma Thanks for the good research.
Suggestions:
I'm seeing if I can find any other interesting cases using libfuzzer.
@pakutoma Thanks for the analysis. Even though it seems the behavior of your new code is correct (which is great!), I would like to suggest we add this string as a test case, since it expresses some interesting aspects of the code's behavior.
I just fuzzed a bit more and discovered some more interesting test cases.
https://3v4l.org/8jXpD3
var_dump(mb_check_encoding("\x1b(I\x1b(H", 'UTF-7'));
// this was false in PHP 8.0, but true in PHP 8.1+
// I don't know what the reason for the change was
https://3v4l.org/j8lhB
var_dump(mb_check_encoding("\x1b(I\x1b(H", 'ISO-2022-JP'));
// This was also false in PHP 8.0, but true in PHP 8.1+
@pakutoma Thank you very much for your nice contributions!
I am still fuzzing your previous PR to check for any problems. When that is done and the PR is merged, I will move on to this one.
(It's not like I am running the fuzzer continuously day and night; it's more like I run the fuzzer for 30 seconds, it finds a crash, I check the reason and find that my assertions need to be adjusted a bit, run it for another 30 seconds, it finds another crash... this process is time-consuming and I have limited time available, but it is moving ahead bit by bit.
From my experience, I will be extremely surprised if the fuzzer does not find anything which needs to be adjusted in the previous PR. For my own PRs, when it is possible to apply fuzzing effectively, it almost always finds things which I would never have thought of.)
@pakutoma I was just doing some more fuzzing of this PR, and discovered an interesting test case: https://3v4l.org/EU6fQ
var_dump(mb_check_encoding("+++++++++\x22", 'UTF-7'));
// true in PHP 8.1 and 8.2, false in PHP 8.0
I haven't analyzed yet to see why this was treated as invalid by mb_check_encoding
in PHP 8.0. I know why it is treated as valid in PHP 8.1 and 8.2... because my implementation of mb_check_encoding
used the text encoding conversion filters, and the conversion filters do not emit an error marker if a Base64 section is terminated by a plain ASCII character.
Do you have any thoughts? It sure looks like bad UTF-7 to me.
It looks like this is the way to proceed, what do I need to do? I am thinking that I may need to create a similar PR for the PHP-8.1 branch and move the additions to NEWS and UPGRADING there.
Yes, a similar PR needs to be created for PHP-8.1.
We need NEWS and UPGRADING entries both in this PR and in the PHP-8.1 PR. The NEWS in this PR should highlight what is relevant for people upgrading to a new version of PHP 8.2; the NEWS in the other PR should highlight what is relevant for people upgrading to a new version of PHP 8.1. (I guess the information in NEWS and UPGRADING might be similar for both PRs.)
This should be merged soon.
I am not sure either, but I don't think RFC 2152, which is the basis for the UTF-7 implementation, has anything to say about terminating the Modified Base64 sequence. RFC 2152 describes the termination of a Modified Base64 sequence as follows...
OK, very well. It looks like your current implementation is in harmony with the RFC then.
This is in contrast to RFC 2060, which is the basis for the UTF7-IMAP implementation, which states.
All names start in US-ASCII, and MUST end in US-ASCII (that is, a name that ends with a Unicode 16-bit octet MUST end with a "-").
Right. It also looks like your current implementation is in harmony with the RFC for UTF7-IMAP.
@pakutoma I am just fuzzing this PR.
One thing that has come up... the new check function for UTF-7 does not reject strings with an unterminated Base64 section (i.e. we have +
to start a Base64-encoded section, but there is no -
to go back to ASCII mode).
It's been a long time since I read the RFC defining UTF-7, but I doubt whether this is actually valid for UTF-7 or not...
Slight performance lost for mb_detect_encoding
on good ISO-2022-JP strings:
old|mb_detect_encoding|JIS, long - strict|1.9221076965332|5000
new|mb_detect_encoding|JIS, long - strict|1.9659136931101|5000
Script: https://gist.github.com/alexdowad/7d23f9bd183a575a9b424b69f59c8073
The performance loss is understandable, since we weren't using the 'check' function before; now we call the 'check' function, and if it returns true, we still have to decode the string to estimate likelihood for each candidate encoding.
I think it's small enough not to worry about.
Little script I made to benchmark mb_check_encoding
: https://gist.github.com/alexdowad/b7952a8d733ba9255daa527d15004928
Results:
new|mb_check_encoding|JIS, short (0-5 chars)|0.00029759407043457|10000
new|mb_check_encoding|JIS, medium (~100 chars)|0.0019969940185547|10000
new|mb_check_encoding|JIS, long (~10000 chars)|0.1247300306956|10000
old|mb_check_encoding|JIS, short (0-5 chars)|0.00040802955627441|10000
old|mb_check_encoding|JIS, medium (~100 chars)|0.0023880004882812|10000
old|mb_check_encoding|JIS, long (~10000 chars)|0.21371102333069|10000
This is a very nice speedup for mb_check_encoding
.
@Girgias I think the issues which @pakutoma has raised require a rethink. I am not going to be merging this PR any time soon.
@pakutoma Thanks very much for your comments.
Based on what you have said, I don't think we can "fix" the existing behavior of mb_strcut
, as much as it doesn't make sense.
This is because ISO-2022-JP does not support JIS X 0212. https://www.rfc-editor.org/rfc/rfc1468
The variant of ISO-2022-JP that supports JIS X 0212 is called ISO-2022-JP-1. https://www.rfc-editor.org/rfc/rfc2237
OK, fair enough. :+1:
To fix #10853, a new flag should be added if the UTF-8 validity was checked or not, otherwise invalid UTF-8 string will need to be check on each call.
:+1:
It looks like availability of bits in the object header should not be a problem... type_info
, where the GC flags are kept, is 32 bits wide, and it looks like there are currently only 9 flags for zend_string
objects.
This looks great to me. Thanks.
The table-driven state machine for UTF-8 validity checking is very interesting. I'm interested to know how it compares when benchmarked against the 'fallback' UTF-8 validity checking implementation in mbstring (which was borrowed from PCRE).
Just checked the new tests cases for ISO-2022-JP and 'JIS'.
One question only: Why does mb_check_encoding
return true for the "JIS X 0212 character" test case with JIS, but false with ISO-2022-JP?
I have just checked the new test cases for UTF-7... all of the EXPECT section looks good to me.
Just tried rebasing on PHP-8.1. There are a lot of nuisance merge conflicts, especially in the definitions of the mbfl_encoding
structs, because they already have one added member in 8.2 and now you have added another member again.
Personally since this is so close to being ready, I am wondering if it may be OK to go ahead and merge into 8.2/master, then go back and fix up 8.1 in a separate PR.
I do think we can improve the NEWS and UPGRADING messages a bit. But at this stage, rebasing on PHP-8.1 may be more trouble than it's worth.
@Girgias Any concerns about this plan?
@pakutoma I just fetched the latest version of this PR and read through the diff. I think you have addressed just about all the feedback from myself and @Girgias.
If I am the one to merge this PR, I may just wrap your commit log message to ~80 columns.
Thanks for the comment.
I'm confused by the NEWS entry, it mentions fixing behavioural differences between PHP 8.0 and 8.1 while only targeting PHP 8.2.
This change will be put into PHP 8.2 and then backported to PHP 8.1. So perhaps NEWS and UPGRADING should be written when backporting to PHP8.1.
@alexdowad What do you think about this?
Hmm. Normally if changes are intended to target PHP 8.1, they should be merged into 8.1 first, then merged down into 8.2 and master.
I am trying to remember... was it me who advised you to target PHP 8.2 first and then fix 8.1 later? If so, I must have been thinking that we would make changes to the conversion filters. Actually... I think I remember now. Originally I thought we would use the unsigned int *state
parameter to detect ISO-2022-JP strings which end in the wrong state.
That part of the code is completely different between PHP 8.1 and 8.2, so if you had done 8.1 first, it would not have been possible to merge it down into 8.2 in any straightforward way. That was why I suggested fixing 8.2/master first.
But now that we have decided to add this new check
function pointer instead, I think this code will apply to both 8.1 and 8.2 without any problems. So it is better you target 8.1. I think there should only be very few merge conflicts when merging down to 8.2 and master.
The documentation for mb_strcut reads:
mb_strcut() extracts a substring from a string similarly to mb_substr(), but operates on bytes instead of characters. If the cut position happens to be between two bytes of a multi-byte character, the cut is performed starting from the first byte of that character. This is also the difference to the substr() function, which would simply cut the string between the bytes and thus result in a malformed byte sequence.
(Ref: https://www.php.net/manual/en/function.mb-strcut.php)
However, from the very beginning, this function has never behaved exactly as described in the documentation. To be more precise, for some of mbstring's supported text encodings, it would behave as described, but for many others, it would not.
For the starting cut, the function has always behaved as described. The problem was with the ending cut. Instead of cutting out bytes up to the ending cut point if it was a character boundary, or the immediately preceding character boundary otherwise, the function would do this:
It would start cutting out bytes from the starting cut point (or immediately preceding character boundary), running these bytes through the mbstring decoding/encoding filters for the text encoding in question, and collect the output of the encoding filter, until the number of bytes of output equalled the requested cut length.
That was fine for text encodings which only have one way to represent any sequence of characters, and where no extra bytes need to be used for escape sequences.
For encodings with more than one way to represent the same characters, it didn't work. Take JIS7/8 for example. In this text encoding, a single byte from 0xA1-DF encodes a katakana character. But the same katakana can also be encoded using two bytes. For example, take the following (contrived) JIS string (in hexadecimal notation, with UTF-8 below):
61 62 63 A7 A8 A9 AA AB 61 62 63
a b c ァ ィ ゥ ェ ォ a b c
Say we want to extract the 3 kana "ァィゥ". The kana start at an offset of 3 bytes from the beginning of the string, so this should work:
mb_strcut("abc\xA7\xA8\xA9\xAA\xABabc", 3, 3, 'JIS');
However, in PHP 8.2, this returns an empty string! After this patch, however, it returns:
1b2442 0027 0028 0029 1b2842
(Escape sequence for JIS X 0208) ァ ィ ゥ (Escape sequence for ASCII)
One might think that mb_strcut should cut out raw bytes from the input string, and simply adjust the cut points to character boundaries, rather than adding escape sequences or converting JISX-0201-encoded kana to JISX-0208-encoded kana. However, from the beginning, mb_strcut has always decoded and re-encoded its input, so this is the backwards- compatible behavior. This patch maintains that behavior, but ensures that the range of characters extracted actually correspnds to the requested byte range.
For encodings which use escape sequences, mb_strcut could also fail in a similar way. When cutting bytes out of the middle of such a string, it is often necessary to prefix them with an escape sequence so their meaning is not changed. But this leading escape sequence would take up some of the limited number of bytes which the function wanted to output, meaning that the portion actually cut out was systemically too short. This affected text encodings like UTF-7, HZ, and all the ISO-2022 text encodings.
For reviewers, please note that I did not blindly adjust the expected results of test cases to match the output of my code. I looked at each failing test case in hex notation and considered whether the new result makes sense or not.
@NathanFreeman might be interested in this, since he provided the XFAIL test cases which are now passing.
@cmb69 @Girgias @kamil-tekiela @youkidearitai @pakutoma