Golang x/net: Difference between revisions

From wikinotes
Line 91: Line 91:
<blockquote>
<blockquote>
The <code>Node</code> datastructure uses value objects,<br>
The <code>Node</code> datastructure uses value objects,<br>
you cannot simply locate/mutate nodes - you'll need to create and connect new instances.
you can mutate nodes, if adding to children make sure to <code>AppendChild()</code> so it gets added to the array the slice points to.


* <code>atom</code> has constants representing every type of HTML element.
* <code>atom</code> has constants representing every type of HTML element.

Revision as of 18:54, 24 July 2022

The official library for parsing HTML.
It is not shipped with go's standard library, but it is maintained by the go developers.

Documentation

official docs https://pkg.go.dev/golang.org/x/net
atom.Atom constants (element types) https://pkg.go.dev/golang.org/x/net@v0.0.0-20220706163947-c90051bbdb60/html/atom#Atom

Install

go get golang.org/x/net

Components

Nodes, ElementTypes

  • ElementType describes the type of element in the DOM (ex. text, element, doctype, ..)
  • Nodes represent xml-like elements
  • Atoms represent html element types

ElementNodes contain TextNodes

  • ElementNodes represent an HTML element.
  • TextNodes store the value of an HTML element (nested under ElementNodes).
import "golang.org/x/net/html"
import "golang.org/x/net/html/atom"

headerVal := html.Node{
    Type: html.TextNode,
    Data: "My Page",
}

header := html.Node{
    Type: html.ElementNode,
    DataAtom: atom.H1,
    Data: "h1",
    FirstChild: &headerVal,
    LastChild: &headerVal,
}

Means the same as

<h1>My Page</h1>

Parsing/Rendering

Basics

import "golang.org/x/net/html"

raw := `
    <html>
      <head>
        <title>foo</title>
      </head>
      <body>
        <h1>Foo</h1>
        <p>hello world</p>
      </body>
    </html>`

// parse html
node, _ := html.Parse(strings.NewReader(raw))

// render html
var render strings.Builder
html.Render(&render, node)
render.String()              // '<html><head>...'

Modifying Parsed HTML

The Node datastructure uses value objects,
you can mutate nodes, if adding to children make sure to AppendChild() so it gets added to the array the slice points to.

  • atom has constants representing every type of HTML element.
  • Nodes keep information about their first/last child
  • Nodes keep information about their siblings (neighbors under same parent)

To iterate through children, start at the node's first-child, and loop through it's siblings.
Here's a reusable setup:

func adjust(node *html.Node, page *mwdump.Page) (*html.Node, error) {
    var err error

    // match current node, return new/modified instances where desired
    node = adjustHeadNode(node, page)
    node = adjustBodyNode(node, page)
    node = adjustAnchorNode(node)
    if err != nil {
        return nil, err
    }

    // recurse through children
    var children []*html.Node
    for child := node.FirstChild; child != nil; child = child.NextSibling {
        child, err = adjust(child, page)
        if err != nil {
            return child, err
        }
        children = append(children, child)
    }

    // point Child/Sibling info in structs to the new children
    if len(children) > 0 {
        node.FirstChild = children[0]
        node.LastChild = children[len(children)-1]
    }
    for index, child := range children {
        if 0 < index && index < len(children)-1 {
            child.PrevSibling = children[index-1]
            child.NextSibling = children[index+1]
        }
    }

    return node, nil
}

Here's a sample method that mutates a node

// lower-cases all 'href' links in a '<a href="Foo/Bar">'
func adjustAnchorNode(node *html.Node) (*html.Node, error) {
    if node.Type != html.ElementNode {
        return node, nil
    }
    if node.DataAtom != atom.A {
        return node, nil
    }

    var attrs []html.Attribute
    for _, attr := range node.Attr {
        if attr.Key != "href" {
            attrs = append(attrs, attr)
            continue
        }

        newAttr := html.Attribute{
            Namespace: attr.Namespace,
            Key:       attr.Key,
            Val:       strings.ToLower(attr.Val),
        }
        attrs = append(attrs, newAttr)
    }

    return &html.Node{
        Parent:      node.Parent,
        FirstChild:  node.FirstChild,
        LastChild:   node.LastChild,
        PrevSibling: node.PrevSibling,
        NextSibling: node.NextSibling,
        Type:        node.Type,
        DataAtom:    node.DataAtom,
        Data:        node.Data,
        Namespace:   node.Namespace,
        Attr:        attrs,
    }, nil
}