Steven Levithan

1.2K posts

Steven Levithan

Steven Levithan

@slevithan

Creator → Regex+, Oniguruma-To-ES, https://t.co/HwKEPCL2wV, https://t.co/GBaROcUNIY Coauthor → Regular Expressions Cookbook, High Performance JavaScript

انضم Mayıs 2008
194 يتبع818 المتابعون
Steven Levithan
Steven Levithan@slevithan·
@fabiospampinato I've just added regexp-simpler-parser. GrepGrep looks awesome, but there are a ton of grep-like tools so I've kept the relevant section on that ultra slim.
English
0
0
1
40
Fabio Spampinato
Fabio Spampinato@fabiospampinato·
New package: regexp-simple-parser, a "simple" and modern RegExp parser. Unless there is some bug in it it should support every feature and every quirk of regexes. It's ~3x smaller and ~7x slower than regjsparser. github.com/fabiospampinat…
English
2
0
7
582
Steven Levithan
Steven Levithan@slevithan·
I shouldn't be too opinionated about this stuff. :) I'm assuming you created an excellent design, and that you had at least slightly different principles and goals. And to be clear, it's obviously not that bad if all I'm advocating is to change the property name from kind/subtype to format (or similar) in such cases. The main reason I'm suggesting this is so it's clear you can change or ignore any format value when working with an AST, which you can't do for a single-purpose kind/subtype without changing the meaning or breaking things.
English
1
0
0
67
Fabio Spampinato
Fabio Spampinato@fabiospampinato·
> I'm definitely opposed to using "kind" (or subtype) for distinguishing different ways to format the same thing (ex: q, \x71, \161), which regjsparser does. Yeeeeah I thought about that, for whatever reason I went with something closer to regjsparser... a bit inconsistent on my part. I think at some point I was like "I'm going to output the same thing so that I can use their test suite", but that never quite happened. In my defense it's a v0.1 🤣 some of these things might change. I would kinda like to waste an afternoon trying to golf it as much as possible for no reason, having a single "value" type would be natural in that kind of context, probably other things like that would emerge if I did that.
English
1
0
0
47
Steven Levithan
Steven Levithan@slevithan·
Thanks for the details. :D JS regex AST design is a space I looked at when creating TS lib oniguruma-parser github.com/slevithan/onig… earlier this year. It was my first time creating a true parser, so the implementation might be nontraditional and no doubt has things to improve. But I like to think it includes some good stuff. It parses Oniguruma regexes though, so there are a fair number of syntax/behavior differences and different considerations compared to JS RegExp. If I recall, I wasn't a big fan of regjsparser's AST design and names. That said, I'm sure I took some bits of inspiration from it, as well as from the other three AST structures in the AST explorer I linked to. All of them had some things I thought should be better and/or simpler, more consistent, etc. If I recall, eslint-community/regexpp's AST design was fairly nice, but then made some strange decisions for new features from flag v. I tried to be quite thoughtful about the AST design I used in oniguruma-parser. I don't have a UI for testing the parser, but if it's of interest, you can open the dev console at slevithan.github.io/oniguruma-pars… and enter ex: printAst(toOnigurumaAst(String.raw`[\w&&[^\d_]&&\p{a_hex}]?(?(?!)|(?i:a\b)||)`)) > I like that mine uses "type" and "subtype", while regjsparser's uses "type" and "kind", which I find is much more confusing hierarchy-wise. Good call-out. I'll think about adopting the name "subtype" as well. I'm not sure its implications are perfect in all cases, though. I'm definitely opposed to using "kind" (or subtype) for distinguishing different ways to format the same thing (ex: q, \x71, \161), which regjsparser does. > I don't have a dedicated "value identifier" node type, for example for the name of named groups, it's just a string. +1 to this. > Both \uXXXX and \u{XXXX} emit the same node type for mine but not for regjsparser. I think you made the right choice. Or rather, regjsparser shouldn't distinguish these via `kind`, since `kind` is also used for critical distinctions that are not about formatting. Distinguishing between \u0000, \u{0}, etc. is not a concern for most AST uses (though it's relevant for code generators, minifiers, etc.). See github.com/slevithan/onig… where I discuss distinguishing (and modifying) these kinds of things via a `format` prop (which would only be taken as a suggestion by the code generator, if applying the requested format would cause contextual problems).
English
1
0
1
77
Fabio Spampinato
Fabio Spampinato@fabiospampinato·
Uhmmm I'm not sure 🤔 I've only compared against regjsparser. I'd say my AST is light on the details, you can't reconstruct the input regex exactly from it (i.e. same source string). The types externally should be ~correct (internally there are some issues I'm not sure it's worth fully addressing, it's a bit complicated), while in regjsparser for example I think \q{} is a "classStrings" or something like that, but there's no mention of that in the .d.ts file 🤷‍♂️ Also instead of a dedicated node for that I'm just outputting a "disjuction" node. In some sense the types externally are also ~incorrect for my thing compared to regjsparser because they filter down the types of Nodes that could be found traversing the AST depending on which features are enabled, and I don't do that. Mine is quirky when parsing stuff like /|||/, which is considered an empty capturing block even though it isn't, but they would match the same things. I like that mine uses "type" and "subtype", while regjsparser's uses "type" and "kind", which I find is much more confusing hierarchy-wise. I don't have a dedicated "value identifier" node type, for example for the name of named groups, it's just a string. Both \uXXXX and \u{XXXX} emit the same node type for mine but not for regjsparser. There are probably other differences, nothing too interesting maybe.
English
1
0
1
34
Steven Levithan
Steven Levithan@slevithan·
@fabiospampinato I mentioned earlier that browsers have changed their handling of octals over the years. You shouldn't take my word on it, except that that's definitely how browsers behaved back in 2010 when I wrote the post I linked to. :)
English
0
0
0
30
Fabio Spampinato
Fabio Spampinato@fabiospampinato·
@slevithan Oh, I have implemented this incorrectly then 🤔 I'm quite sure I looked into it though, I need to double-check.
English
1
0
0
28
Fabio Spampinato
Fabio Spampinato@fabiospampinato·
TIL there's a very bizarre "control letter escape sequence" supported by JS regexes. If you write /\ca/ it matches "\u0001", if you write /\cA/ it also matches "\u0001", then /\cb/ matches "\u0002" and so on. Presumably this made higher than 0% sense at some point in history?
Fabio Spampinato tweet media
English
3
0
2
850
Steven Levithan
Steven Levithan@slevithan·
Don't get me started. But yeah, so long as you're treating the \1 in that position as a backreference (and not as an octal or identity escape), the fact that backreferences to nonparticipating (and not-yet-participating) groups match the empty string (instead of the standard regex behavior of failing to match and therefore triggering backtracking) is canonical JS regex handling from the beginning. JS is alone in this terrible behavior, apart from flavors that have explicitly taken prior art from JS (e.g. .NET's non-default ECMAScript mode). IMO this behavior is the second biggest mistake in JS's regex design, after `/g`. It's unintuitive, nonportable, and breaks all kinds of useful things you could otherwise do with backreferences when pushing regex boundaries. In fact, ~15 years ago I convinced Brendan Eich of this - he was open to changing the spec and invited me to help work on bolder RegExp changes for ES6. But I couldn't at the time, and I think the opportunity for changing this has probably passed. > it's called a backreference I'd refer to it as a forward reference in that position. Forward references are not the only way that a backreference might refer to a nonparticipating group. It's the same thing conceptually as `(a)|\1` or `(.)??\1` (but NOT the same as `(.??)\1`). Forward references can sometimes be useful, if they're within a quantified group like `(\2x|(.))+`. But they're never useful in JS because they will always match the empty string. That's because of another JS-specific regex design mistake: backrefs in quantified groups like this are reset on each repetition. So that last regex is equivalent to `(x|(.))+`. 😢
English
0
0
0
24
Fabio Spampinato
Fabio Spampinato@fabiospampinato·
> - /\1(a)/ This is insane, apparently that matches 'a' in V8 🥲 I'm going to say even though I'm a back to the future fan this doesn't make sense to me, it's called a backreference, if it's matching a capturing group from the feature, and actually not even matching that and instead matching nothing, that doesn't make any sense to me.
English
2
0
0
92
Steven Levithan
Steven Levithan@slevithan·
Agree that we could do without it. \x00 is good enough. IMO JS made a mistake by keeping \0 while deprecating octals, especially since doing so came with an extra rule about what characters are allowed after \0. That made interpolation and escaping regex special characters more complicated. PS: Some regex flavors additionally allow bare \x (not followed by a hex digit or `{`) as an alternative to \x00. While other flavors treat bare \x as either an error or an identity escape. Identity escapes are another big regex syntax mistake. Glad that JS made their rules stricter with \u and \v.
English
0
0
0
26
Fabio Spampinato
Fabio Spampinato@fabiospampinato·
> And all this mess for a regex feature (octals) that no one ever uses, except sometimes \0. To be honest I didn't even know the null escape was a thing before implementing this parser 🤣 On some level it's cool that the 0 digit escape has been given a meaning too, for completeness, and it can't conflict with referencing the 0-th capturing group which doesn't make sense, but at the same time it's an extra feature I guess used rarely that we could just do without.
English
1
0
0
65
Steven Levithan
Steven Levithan@slevithan·
Not if the leading digit is a zero. If it is, then up to three digits after the leading zero can be considered to be part of the octal. But this exception only applied outside of character classes. At least, that was the behavior in browsers at the time. (I wrote lots of tooling around this back then.)
English
1
0
0
23
Fabio Spampinato
Fabio Spampinato@fabiospampinato·
@slevithan > - /(a)\0001[\0001]/: The \0001 outside the character class is an octal; Is it? I thought it was limited to 3 digits 🤔🤔🤔
English
2
0
0
60
Steven Levithan
Steven Levithan@slevithan·
Behavior details for the overlap between octals, backreferences, and identity escapes (\8\9) vary significantly across regex flavors and can be extremely complex and unintuitive. Browsers also used to have inconsistent handling for octals, and changed over the years to follow Annex B. This old post of mine blog.stevenlevithan.com/archives/fixin… from 2010 includes a section that documented the behavior back then: - /a\1/: \1 is an octal. - /(a)\1/: \1 is a backreference. - /(a)[\1]/: \1 is an octal. - /(a)\1\2/: \1 is a backreference; \2 is an octal. - /(a)\01\001[\01\001]/: All occurrences of \01 and \001 are octals. However, according to the ES3+ specs, 0-9 following \0 should cause a SyntaxError. - /(a)\0001[\0001]/: The \0001 outside the character class is an octal; but inside, the octal ends at the third zero (i.e., the character class matches character index zero or "1"). This regex is therefore equivalent to /(a)\x01[\x00\x31]/; although, as mentioned above, adherence to ES3 would change the meaning. - /(a)\00001[\00001]/: Outside the character class, the octal ends at the fourth zero and is followed by a literal "1". Inside, the octal ends at the third zero and is followed by a literal "01". - /\1(a)/: Given that, in JavaScript, backreferences to capturing groups that have not (yet) participated match the empty string, does this regex match "a" (i.e., \1 is treated as a backreference since a corresponding capturing group appears in the regex) or does it match "\x01a" (i.e., the \1 is treated as an octal since it appears before its corresponding group)? Unsurprisingly, browsers disagree. - /(\2(a)){2}/: Now things get really hairy. Does this regex match "aa", "aaa", "\x02aaa", "2aaa", "\x02a\x02a", or "2a2a"? All of these options seem plausible, and browsers disagree on the correct choice. Additional aspects to worry about in other regex flavors include whether octal escapes go up to \377 (\xFF) or \777 (\u01FF), whether octal values >= \200 refer to encoded bytes or code points, whether single-digit \1 to \9 should always be treated as backreferences if not followed by digits, etc. And all this mess for a regex feature (octals) that no one ever uses, except sometimes \0.
English
1
1
2
212
Fabio Spampinato
Fabio Spampinato@fabiospampinato·
Somewhat interestingly /(a)\10/ doesn't match "aa0", because the "\10" bit doesn't mean "backreference to the first capturing group followed by a 0", it means "backreference to the 10th capturing group". According to the "regjsparser" package, what seems to be the most popular package for parsing regexes, that's actually an octal escape, which looks like it got deleted under strict mode or something, as it conflicted with backreferences I guess 🤔
English
2
0
1
333
Steven Levithan
Steven Levithan@slevithan·
@fabiospampinato If we were cleaning up useless/unused regex features in JS, in addition to removing `\cX`, I'd also remove the special meaning of `\b` within character classes to match a backspace control character, and remove all legacy support for octals.
English
1
0
1
37
Steven Levithan
Steven Levithan@slevithan·
@fabiospampinato Probably you already know, but note that SpiderMonkey uses Irregexp from V8 as its regex implementation, so generally they should act the same.
English
1
0
2
35
Fabio Spampinato
Fabio Spampinato@fabiospampinato·
Apparently regexes bigger than (2^15)-1 characters are considered "too large" in both V8 and SpiderMonkey 🥲 ```js new RegExp('a'.repeat(2**15)).test(''); // throws ```
English
3
0
3
816
Steven Levithan
Steven Levithan@slevithan·
Yeah, and it's supported in nearly every modern regex flavor. Only exceptions I can think of are RE2 and Python. Unlike JS, most flavors support more characters than just A-Za-z after `\c` (but the meaning isn't always consistent). And some like Oniguruma support uppercase-C `\C-x` and meta `\M-\C-x`. It gets weird, man.
English
0
0
0
27
Richard Gibson
Richard Gibson@gibson042·
@fabiospampinato I think it's from POSIX: #tag_19_02_04" target="_blank" rel="nofollow noopener">pubs.opengroup.org/onlinepubs/979… (although finding something even older wouldn't surprise me)
English
1
0
1
68
Steven Levithan
Steven Levithan@slevithan·
@fabiospampinato > I guess it's not super useful. I've definitely used `??` before in real regexes. But yeah, very rarely :)
English
0
0
1
23
Fabio Spampinato
Fabio Spampinato@fabiospampinato·
Apparently optional modifiers can also be lazy: /a?/ optionally matches "a" greedily, and /a??/ optionally matches "a" lazily. This one makes a lot of sense, I'm not sure why I had never thought about that, I guess it's not super useful.
English
2
0
1
533
Steven Levithan
Steven Levithan@slevithan·
I've needed this and variations fairly often when processing regexes (but not for perf). > I *think* you can detect this without parsing (absent flag `v`) Also not hard to deal with flag v. My ~600 byte github.com/slevithan/rege… helps with processing JS regex syntax when you don't need a parser/AST. Has utils for finding unescaped instances of a regex pattern, with optional limit to inside/outside v-mode char classes.
English
1
0
1
38
Fabio Spampinato
Fabio Spampinato@fabiospampinato·
A bit random, but I wouldn't mind a RegExp.prototype.hasCapturingGroups 🤔 Knowing that can sometimes unlock better performance by executing the regex with .test() instead of .exec(), but in general you need to throw a parser at this, and regexp-tree's one is like 100kb, but also the browser has to parse it already, why doing it twice?
English
1
0
4
662
Steven Levithan
Steven Levithan@slevithan·
Regex+ 6.1.0 released. Adds even cleaner and more robust interpolation of RegExp instances and partial patterns. And improved docs. 😁
English
1
0
0
62
Steven Levithan
Steven Levithan@slevithan·
Aside: There may be interesting perf opportunities by making optimizations in V8/Irregexp that benefit Shiki. CC @antfu Although VS Code hasn’t adopted Oniguruma-To-ES (and I’m not sure they should), Shiki has. Shiki is massively popular, and since earlier this year has recommended its “JS regex engine” (which wraps Oniguruma-To-ES) when using it in browsers. And since Shiki’s syntax highlighting exclusively uses TextMate grammars, I assume it’s among the most widespread libraries with its level of regex intensity and reliance on native JS regex perf. The thing that stands out most to me as an obvious and possibly-quick win is if V8 were willing to get ahead of ES standards by adding atomic groups ‘(?>…)’ to Irregexp. There’s a stage 1 proposal at github.com/tc39/proposal-… . Possessive quantifiers (from the same proposal) are great syntax sugar for atomic groups, but if they were left out initially, I think it would be non-controversial to add atomic groups since they share identical syntax and behavior across .NET, Perl, PCRE, Python, Ruby, Java, Oniguruma, Boost, etc. If Irregexp supported them, Oniguruma-To-ES/Shiki would use them to quickly start providing a perf boost to many sites that use Shiki for syntax highlighting.
English
1
0
0
44
Erik
Erik@erikcorry·
I wonder what the typical regexp is that vscode is running when syntax highlighting. @slevithan says V8 is not much faster than the other regexp engine. A little disappointing...
English
2
1
1
774