Golang x/net

From wikinotes

The official library for parsing HTML.
It is not shipped with go's standard library, but it is maintained by the go developers.

Documentation

official docs https://pkg.go.dev/golang.org/x/net
atom.Atom constants (element types) https://pkg.go.dev/golang.org/x/net@v0.0.0-20220706163947-c90051bbdb60/html/atom#Atom

Install

go get golang.org/x/net

Components

Nodes, ElementTypes

  • ElementType describes the type of element in the DOM (ex. text, element, doctype, ..)
  • Nodes represent xml-like elements
  • Atoms represent html element types

ElementNodes contain TextNodes

  • ElementNodes represent an HTML element.
  • TextNodes store the value of an HTML element (nested under ElementNodes).
import "golang.org/x/net/html"
import "golang.org/x/net/html/atom"

headerVal := html.Node{
    Type: html.TextNode,
    Data: "My Page",
}

header := html.Node{
    Type: html.ElementNode,
    DataAtom: atom.H1,
    Data: "h1",
    FirstChild: &headerVal,
    LastChild: &headerVal,
}

Means the same as

<h1>My Page</h1>

Parsing/Rendering

Basics

import "golang.org/x/net/html"

raw := `
    <html>
      <head>
        <title>foo</title>
      </head>
      <body>
        <h1>Foo</h1>
        <p>hello world</p>
      </body>
    </html>`

// parse html
node, _ := html.Parse(strings.NewReader(raw))

// render html
var render strings.Builder
html.Render(&render, node)
render.String()              // '<html><head>...'

Modifying Parsed HTML

You can mutate Node structs in place,
if adding to children make sure to AppendChild() so it gets added to the array the slice points to.

  • atom has constants representing every type of HTML element.
  • Nodes keep information about their first/last child
  • Nodes keep information about their siblings (neighbors under same parent)

To iterate through children, start at the node's first-child, and loop through it's siblings.
Here's a reusable setup:

type HTML struct{}

// recurse through all nodes
func (this *HTML) adjust(node *html.Node) (*html.Node, error) {
    err := this.adjustAnchorNode(node)
    if err := nil {
        return nil, err
    }

    // recurse through and modify children
    for child := node.FirstChild; child != nil; child = child.NextSibling {
        err = this.adjust(child, page)
        if err != nil {
            return nil, err
        }
    }
    return node, nil
}

// lower-cases all 'href' links in a '<a href="Foo/Bar">'
func (this *HTML) adjustAnchorNode(node *html.Node) error {
    if node.Type != html.ElementNode {
        return nil
    }
    if node.DataAtom != atom.A {
        return nil
    }
    var attrs []html.Attribute
    for _, attr := range node.Attr {
        if attr.Key != "href" {
            attrs = append(attrs, attr)
        } else {
            attrs = append(attrs, strings.ToLower(attr.Val))  // <-- modify attr
        }
    }
    node.Attr = attrs

    return nil
}