Golang x/net
From wikinotes
The official library for parsing HTML.
It is not shipped with go's standard library, but it is maintained by the go developers.
Documentation
official docs https://pkg.go.dev/golang.org/x/net atom.Atom constants (element types) https://pkg.go.dev/golang.org/x/net@v0.0.0-20220706163947-c90051bbdb60/html/atom#Atom
Install
go get golang.org/x/net
Components
Nodes, ElementTypes
- ElementType describes the type of element in the DOM (ex. text, element, doctype, ..)
- Nodes represent xml-like elements
- Atoms represent html element types
ElementNodes contain TextNodes
ElementNode
s represent an HTML element.TextNode
s store the value of an HTML element (nested under ElementNodes).import "golang.org/x/net/html" import "golang.org/x/net/html/atom" headerVal := html.Node{ Type: html.TextNode, Data: "My Page", } header := html.Node{ Type: html.ElementNode, DataAtom: atom.H1, Data: "h1", FirstChild: &headerVal, LastChild: &headerVal, }Means the same as
<h1>My Page</h1>
Parsing/Rendering
Basics
import "golang.org/x/net/html" raw := ` <html> <head> <title>foo</title> </head> <body> <h1>Foo</h1> <p>hello world</p> </body> </html>` // parse html node, _ := html.Parse(strings.NewReader(raw)) // render html var render strings.Builder html.Render(&render, node) render.String() // '<html><head>...'Modifying Parsed HTML
The
Node
datastructure uses value objects,
you can mutate nodes,
if adding to children make sure toAppendChild()
so it gets added to the array the slice points to.
atom
has constants representing every type of HTML element.- Nodes keep information about their first/last child
- Nodes keep information about their siblings (neighbors under same parent)
To iterate through children, start at the node's first-child, and loop through it's siblings.
Here's a reusable setup:func adjust(node *html.Node, page *mwdump.Page) (*html.Node, error) { var err error // match current node, return new/modified instances where desired node = adjustHeadNode(node, page) node = adjustBodyNode(node, page) node = adjustAnchorNode(node) if err != nil { return nil, err } // recurse through children var children []*html.Node for child := node.FirstChild; child != nil; child = child.NextSibling { child, err = adjust(child, page) if err != nil { return child, err } children = append(children, child) } // point Child/Sibling info in structs to the new children if len(children) > 0 { node.FirstChild = children[0] node.LastChild = children[len(children)-1] } for index, child := range children { if 0 < index && index < len(children)-1 { child.PrevSibling = children[index-1] child.NextSibling = children[index+1] } } return node, nil }Here's a sample method that mutates a node
// lower-cases all 'href' links in a '<a href="Foo/Bar">' func adjustAnchorNode(node *html.Node) (*html.Node, error) { if node.Type != html.ElementNode { return node, nil } if node.DataAtom != atom.A { return node, nil } var attrs []html.Attribute for _, attr := range node.Attr { if attr.Key != "href" { attrs = append(attrs, attr) continue } newAttr := html.Attribute{ Namespace: attr.Namespace, Key: attr.Key, Val: strings.ToLower(attr.Val), } attrs = append(attrs, newAttr) } return &html.Node{ Parent: node.Parent, FirstChild: node.FirstChild, LastChild: node.LastChild, PrevSibling: node.PrevSibling, NextSibling: node.NextSibling, Type: node.Type, DataAtom: node.DataAtom, Data: node.Data, Namespace: node.Namespace, Attr: attrs, }, nil }