Golang x/net

From wikinotes

The official library for parsing HTML.
It is not shipped with go's standard library, but it is maintained by the go developers.

Documentation

official docs https://pkg.go.dev/golang.org/x/net
atom.Atom constants (element types) https://pkg.go.dev/golang.org/x/net@v0.0.0-20220706163947-c90051bbdb60/html/atom#Atom

Install

go get golang.org/x/net

Components

Nodes, ElementTypes

  • ElementType describes the type of element in the DOM (ex. text, element, doctype, ..)
  • Nodes represent xml-like elements
  • Atoms represent html element types

ElementNodes contain TextNodes

  • ElementNodes represent an HTML element.
  • TextNodes store the value of an HTML element (nested under ElementNodes).
import "golang.org/x/net/html"
import "golang.org/x/net/html/atom"

headerVal := html.Node{
    Type: html.TextNode,
    Data: "My Page",
}

header := html.Node{
    Type: html.ElementNode,
    DataAtom: atom.H1,
    Data: "h1",
    FirstChild: &headerVal,
    LastChild: &headerVal,
}

Means the same as

<h1>My Page</h1>

Parsing/Rendering

Basics

import "golang.org/x/net/html"

raw := `
    <html>
      <head>
        <title>foo</title>
      </head>
      <body>
        <h1>Foo</h1>
        <p>hello world</p>
      </body>
    </html>`

// parse html
node, _ := html.Parse(strings.NewReader(raw))

// render html
var render strings.Builder
html.Render(&render, node)
render.String()              // '<html><head>...'

Modifying Parsed HTML

The Node datastructure uses value objects,
you can mutate nodes, if adding to children make sure to AppendChild() so it gets added to the array the slice points to.

  • atom has constants representing every type of HTML element.
  • Nodes keep information about their first/last child
  • Nodes keep information about their siblings (neighbors under same parent)

To iterate through children, start at the node's first-child, and loop through it's siblings.
Here's a reusable setup:

func adjust(node *html.Node, page *mwdump.Page) (*html.Node, error) {
    var err error

    // match current node, return new/modified instances where desired
    node = adjustHeadNode(node, page)
    node = adjustBodyNode(node, page)
    node = adjustAnchorNode(node)
    if err != nil {
        return nil, err
    }

    // recurse through children
    var children []*html.Node
    for child := node.FirstChild; child != nil; child = child.NextSibling {
        child, err = adjust(child, page)
        if err != nil {
            return child, err
        }
        children = append(children, child)
    }

    // point Child/Sibling info in structs to the new children
    if len(children) > 0 {
        node.FirstChild = children[0]
        node.LastChild = children[len(children)-1]
    }
    for index, child := range children {
        if 0 < index && index < len(children)-1 {
            child.PrevSibling = children[index-1]
            child.NextSibling = children[index+1]
        }
    }

    return node, nil
}

Here's a sample method that mutates a node

// lower-cases all 'href' links in a '<a href="Foo/Bar">'
func adjustAnchorNode(node *html.Node) (*html.Node, error) {
    if node.Type != html.ElementNode {
        return node, nil
    }
    if node.DataAtom != atom.A {
        return node, nil
    }

    var attrs []html.Attribute
    for _, attr := range node.Attr {
        if attr.Key != "href" {
            attrs = append(attrs, attr)
            continue
        }

        newAttr := html.Attribute{
            Namespace: attr.Namespace,
            Key:       attr.Key,
            Val:       strings.ToLower(attr.Val),
        }
        attrs = append(attrs, newAttr)
    }

    return &html.Node{
        Parent:      node.Parent,
        FirstChild:  node.FirstChild,
        LastChild:   node.LastChild,
        PrevSibling: node.PrevSibling,
        NextSibling: node.NextSibling,
        Type:        node.Type,
        DataAtom:    node.DataAtom,
        Data:        node.Data,
        Namespace:   node.Namespace,
        Attr:        attrs,
    }, nil
}