The fast, flexible, and elegant library for parsing and manipulating HTML and XML.
HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant.
chore(deps-dev): Bump eslint-plugin-jsdoc from 44.2.7 to 46.1.0 (#1390)
Bumps eslint-plugin-jsdoc from 44.2.7 to 46.1.0.
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
You can trigger Dependabot actions by commenting on this PR:
@dependabot rebase
will rebase this PR@dependabot recreate
will recreate this PR, overwriting any edits that have been made to it@dependabot merge
will merge this PR after your CI passes on it@dependabot squash and merge
will squash and merge this PR after your CI passes on it@dependabot cancel merge
will cancel a previously requested merge and block automerging@dependabot reopen
will reopen this PR if it is closed@dependabot close
will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually@dependabot ignore this major version
will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this minor version
will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this dependency
will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)build(deps-dev): bump eslint-plugin-jsdoc from 44.2.5 to 46.0.0 (#3219)
Bumps eslint-plugin-jsdoc from 44.2.5 to 46.0.0.
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
You can trigger Dependabot actions by commenting on this PR:
@dependabot rebase
will rebase this PR@dependabot recreate
will recreate this PR, overwriting any edits that have been made to it@dependabot merge
will merge this PR after your CI passes on it@dependabot squash and merge
will squash and merge this PR after your CI passes on it@dependabot cancel merge
will cancel a previously requested merge and block automerging@dependabot reopen
will reopen this PR if it is closed@dependabot close
will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually@dependabot ignore this major version
will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this minor version
will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this dependency
will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)docs(readme): Update Sponsors (#3180)
Automated changes by create-pull-request GitHub action
My point is more that some users will not use that.
Agreed. Let's document this properly, so users can make the right choice for themselves.
With
<meta charset=utf-8>
, that they must always add.
Does the spec mandate a charset? A charset meta tag is also only one of many ways of defining a document as UTF-8 and is redundant if an UTF-8 BOM is present.
How does windows-1252 vs utf-8 affect this project if a string is given that includes an UTF-8 (or the other two) BOMs or not?
This project not supporting the default encoding for web content seems like a big deal.
it looks like DOMParser treats BOMs as regular characters, and puts them in body
"\ufeff"
only has a special meaning at the start of a file, and will be treated as regular characters otherwise.
chore(deps-dev): bump eslint-plugin-unicorn from 46.0.0 to 47.0.0 (#927)
Co-authored-by: Felix 188768+fb55@users.noreply.github.com
Bumps eslint-plugin-unicorn from 46.0.0 to 47.0.0.
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
You can trigger Dependabot actions by commenting on this PR:
@dependabot rebase
will rebase this PR@dependabot recreate
will recreate this PR, overwriting any edits that have been made to it@dependabot merge
will merge this PR after your CI passes on it@dependabot squash and merge
will squash and merge this PR after your CI passes on it@dependabot cancel merge
will cancel a previously requested merge and block automerging@dependabot reopen
will reopen this PR if it is closed@dependabot close
will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually@dependabot ignore this major version
will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this minor version
will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this dependency
will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)Disable rules, fix where relevant
This functionality is available in Cheerio. The unavailability of new loading methods shouldn't make Cheerio less useful.
But those BOMs should still be in the string no?
They won't be — eg. iconv-lite strips BOMs.
The HTML spec tells users to always use UTF-8.
The HTML spec uses unicode code points internally, but defaults to windows-1252 as the encoding for the vast majority of locales (see table at the bottom of https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding).
feels insufficient
Why?
For starters, because UTF8 isn't the default encoding for HTML.
As far as I am aware, this project does not expose byte information in positional info, it instead uses character offsets the way they work in JS strings?
This project does expose line/col positions as well as offsets.
I was thinking 13.2, but forgot that encoding handling is part of that section. Still, this is a quite complicated subject and should IMHO be handled by a separate module as a pre-processing step (this is what JSDOM and Cheerio do). I maintain https://github.com/fb55/encoding-sniffer as this pre-processing step.
users there often read HTML files from the file system or from over the network
Agreed. But supporting a small subset of possible encodings (only UTF8 with a BOM) feels insufficient; having a proper split of responsibilities should make it easier for users to know what to expect.
slice/replace messes up positional info, which is also a very useful part that this project provides which is outside of the scope of the HTML spec
This will always be an issue with character encodings — there is currently no easy way to map the original bytes to the code point positions in the input stream.
The expected byte positions also depend on the use-case. Code editors will ignore BOMs when displaying files, so document positions surfaced to users will be wrong if a BOM is taken into account. (And this applies to most character encodings.)
I don't feel like I have a good answer for how to deal with this — I'm open for suggestions.
it’s an understandable request that, [...] at least, it is explained clearly how to use this project from Node.js
Strong yes. Let's figure out how to deal with this issue, then update the docs.
docs(readme): Mention decoding
This library implements the HTML parsing spec, which does not expect a BOM.
If you will only ever encounter an UTF8 BOM, then a simple String.prototype.replace
call suffices (or a .slice
if you know the BOM will always be present). A BOM generally point towards a need to support different encodings, which is out of scope for this module.
I have attached two files. One of them starts with a utf-8 BOM (0xEF, 0xBB, 0xBF
), and the other does not. They are otherwise identical, and look like this:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Document</title>
</head>
<body>
Hello world!
</body>
</html>
They look the same in my editor, and they act the same when read and printed back out as files, but when I pass them to serialize(parse(fileContents))
the file without the BOM prints
<!DOCTYPE html><html lang="en"><head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document</title>
</head>
<body>
Hello world!
</body></html>
This all makes sense so far. However, when I do the exact same process to the file with the BOM I get the output
<html lang="en"><head></head><body>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document</title>
Hello world!
</body></html>
As you can see, the contents of the <head>
have magically moved down into the <body>
.
Duplicate of https://github.com/inikulin/parse5/issues/111. Tl;dr input stream decoding is out of scope for parse5, but you could use eg. https://github.com/fb55/encoding-sniffer to decode the input before using parse5.
build(deps-dev): Bump eslint-plugin-n from 15.7.0 to 16.0.0 (#611)
Bumps eslint-plugin-n from 15.7.0 to 16.0.0.
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
You can trigger Dependabot actions by commenting on this PR:
@dependabot rebase
will rebase this PR@dependabot recreate
will recreate this PR, overwriting any edits that have been made to it@dependabot merge
will merge this PR after your CI passes on it@dependabot squash and merge
will squash and merge this PR after your CI passes on it@dependabot cancel merge
will cancel a previously requested merge and block automerging@dependabot reopen
will reopen this PR if it is closed@dependabot close
will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually@dependabot ignore this major version
will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this minor version
will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this dependency
will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)