Golang x/net
From wikinotes
The official library for parsing HTML.
It is not shipped with go's standard library, but it is maintained by the go developers.
Documentation
official docs https://pkg.go.dev/golang.org/x/net atom.Atom constants (element types) https://pkg.go.dev/golang.org/x/net@v0.0.0-20220706163947-c90051bbdb60/html/atom#Atom
Install
go get golang.org/x/net
Components
ElementNodes vs TextNodes
ElementNode
s represent an HTML element.TextNode
s store the value of an HTML element (nested under ElementNodes).atom
has constants for all HTML element types.
- ElementType describes the type of element in the DOM (ex. text, element, doctype, ..)
- Nodes represent xml-like elements
import "golang.org/x/net/html" import "golang.org/x/net/html/atom" headerVal := html.Node{ Type: html.TextNode, Data: "My Page", } header := html.Node{ Type: html.ElementNode, DataAtom: atom.H1, Data: "h1", FirstChild: &headerVal, LastChild: &headerVal, }Means the same as
<h1>My Page</h1>
Parsing/Rendering
Basics
import "golang.org/x/net/html" raw := ` <html> <head> <title>foo</title> </head> <body> <h1>Foo</h1> <p>hello world</p> </body> </html>` // parse html node, _ := html.Parse(strings.NewReader(raw)) // render html var render strings.Builder html.Render(&render, node) render.String() // '<html><head>...'Modifying Parsed HTML
The
Node
datastructure uses value objects,
you cannot simply locate/mutate nodes - you'll need to create and connect new instances.
atom
has constants representing every type of HTML element.- Nodes keep information about their first/last child
- Nodes keep information about their siblings (neighbors under same parent)
To iterate through children, start at the node's first-child, and loop through it's siblings.
Here's a reusable setup:func adjust(node *html.Node, page *mwdump.Page) (*html.Node, error) { var err error // match current node, return new/modified instances where desired node = adjustHeadNode(node, page) node = adjustBodyNode(node, page) node = adjustAnchorNode(node) if err != nil { return nil, err } // recurse through children var children []*html.Node for child := node.FirstChild; child != nil; child = child.NextSibling { child, err = adjust(child, page) if err != nil { return child, err } children = append(children, child) } // point Child/Sibling info in structs to the new children if len(children) > 0 { node.FirstChild = children[0] node.LastChild = children[len(children)-1] } for index, child := range children { if 0 < index && index < len(children)-1 { child.PrevSibling = children[index-1] child.NextSibling = children[index+1] } } return node, nil }Here's a sample method that mutates a node
// lower-cases all 'href' links in a '<a href="Foo/Bar">' func adjustAnchorNode(node *html.Node) (*html.Node, error) { if node.Type != html.ElementNode { return node, nil } if node.DataAtom != atom.A { return node, nil } var attrs []html.Attribute for _, attr := range node.Attr { if attr.Key != "href" { attrs = append(attrs, attr) continue } newAttr := html.Attribute{ Namespace: attr.Namespace, Key: attr.Key, Val: strings.ToLower(attr.Val), } attrs = append(attrs, newAttr) } return &html.Node{ Parent: node.Parent, FirstChild: node.FirstChild, LastChild: node.LastChild, PrevSibling: node.PrevSibling, NextSibling: node.NextSibling, Type: node.Type, DataAtom: node.DataAtom, Data: node.Data, Namespace: node.Namespace, Attr: attrs, }, nil }