Golang x/net
From wikinotes
The official library for parsing HTML.
It is not shipped with go's standard library, but it is maintained by the go developers.
Documentation
official docs https://pkg.go.dev/golang.org/x/net atom.Atom constants (element types) https://pkg.go.dev/golang.org/x/net@v0.0.0-20220706163947-c90051bbdb60/html/atom#Atom
Install
go get golang.org/x/net
Components
Nodes, ElementTypes
- ElementType describes the type of element in the DOM (ex. text, element, doctype, ..)
- Nodes represent xml-like elements
- Atoms represent html element types
ElementNodes contain TextNodes
ElementNode
s represent an HTML element.TextNode
s store the value of an HTML element (nested under ElementNodes).import "golang.org/x/net/html" import "golang.org/x/net/html/atom" headerVal := html.Node{ Type: html.TextNode, Data: "My Page", } header := html.Node{ Type: html.ElementNode, DataAtom: atom.H1, Data: "h1", FirstChild: &headerVal, LastChild: &headerVal, }Means the same as
<h1>My Page</h1>
Parsing/Rendering
Basics
import "golang.org/x/net/html" raw := ` <html> <head> <title>foo</title> </head> <body> <h1>Foo</h1> <p>hello world</p> </body> </html>` // parse html node, _ := html.Parse(strings.NewReader(raw)) // render html var render strings.Builder html.Render(&render, node) render.String() // '<html><head>...'Modifying Parsed HTML
You can mutate
Node
structs in place,
if adding to children make sure toAppendChild()
so it gets added to the array the slice points to.
atom
has constants representing every type of HTML element.- Nodes keep information about their first/last child
- Nodes keep information about their siblings (neighbors under same parent)
To iterate through children, start at the node's first-child, and loop through it's siblings.
Here's a reusable setup:type HTML struct{} // recurse through all nodes func (this *HTML) adjust(node *html.Node) (*html.Node, error) { err := this.adjustAnchorNode(node) if err := nil { return nil, err } // recurse through and modify children for child := node.FirstChild; child != nil; child = child.NextSibling { err = this.adjust(child, page) if err != nil { return nil, err } } return node, nil } // lower-cases all 'href' links in a '<a href="Foo/Bar">' func (this *HTML) adjustAnchorNode(node *html.Node) error { if node.Type != html.ElementNode { return nil } if node.DataAtom != atom.A { return nil } var attrs []html.Attribute for _, attr := range node.Attr { if attr.Key != "href" { attrs = append(attrs, attr) } else { attrs = append(attrs, strings.ToLower(attr.Val)) // <-- modify attr } } node.Attr = attrs return nil }