Crate css_lexer

Source
Expand description

An implementation of the CSS Syntax Level 3 tokenization algorithm. It is intended as a low-level building block for buidling parsers for CSS or CSS-alike languages (for example SASS).

This crate provides the Lexer struct, which borrows &str and can incrementally produce Tokens. The encoding of the &str is assumed to be utf-8.

The Lexer may be configured with additional Features to allow for lexing tokens in ways which diverge from the CSS specification (such as tokenizing comments using //). With no additional features this lexer is fully spec compliant.

Tokens are untyped (there are no super-classes like Ident); but they have a Kind which can be used to determine their type. Tokens do not store the underlying character data, nor do they store their offsets. They just provide “facts” about the underlying data. In order to re-build a string, each Token will need to be wrapped in a Cursor and consult the original &str to get the character data. This design allows Tokens live in the stack, avoiding heap allocation as they are always size_of 8. Likewise Cursors are always a size_of 12.

§Limitations

The Lexer has limitations around document sizes and token sizes, in order to keep Token, SourceOffset and Cursor small. It’s very unlikely the average document will run into these limitations, but they’re listed here for completeness:

  • Documents are limited to ~4gb in size. SourceOffset is a u32 so cannot represent larger offsets. Attempting to lex larger documents is considrered undefined behaviour.

  • Tokens are limited to ~4gb in length. A Token’s is a u32 so cannot represent larger lengths. If the lexer encounters a token with larger length this is considered undefined behaviour.

  • Number Tokens are limited to 16,777,216 characters in length. For example encountering a number with 17MM 0s is considered undefined behaviour. This is not the same as the number value, which is an f32. (Please note that the CSS spec dictates numbers are f32, CSS does not have larger numbers).

  • Dimension Tokens are limited to 4,096 numeric characters in length and 4,096 ident characters in length. For example encountering a dimension with 4,097 0s is considered undefined behaviour.

§General usage

A parser can be implemented on top of the Lexer by instantiating a Lexer with Lexer::new() or Lexer::new_with_features() if you wish to opt-into non-spec-compliant features. The Lexer needs to be given a &str which it will reference to produce Tokens.

Repeatedly calling Lexer::advance() will move the Lexer’s internal position one Token forward, and return the newly lexed Token, once the end of &str is reached Lexer::advance() will repeatedly return Token::EOF.

§Example

use css_lexer::*;
let mut lexer = Lexer::new("width: 1px");
assert_eq!(lexer.offset(), 0);
{
    let token = lexer.advance();
    assert_eq!(token, Kind::Ident);
    let cursor = token.with_cursor(SourceOffset(0));
    assert_eq!(cursor.str_slice(lexer.source()), "width");
}
{
    let token = lexer.advance();
    assert_eq!(token, Kind::Colon);
    assert_eq!(token, ':');
}
{
    let token = lexer.advance();
    assert_eq!(token, Kind::Whitespace);
}
{
    let token = lexer.advance();
    assert_eq!(token, Kind::Dimension);
    assert_eq!(token.dimension_unit(), DimensionUnit::Px);
}

Structs§

AssociatedWhitespaceRules
A [bitmask][bitmask_enum] representing rules around the whitespace surrounding a Kind::Delim token.
Cursor
Wraps Token with a SourceOffset, allows it to reason about the character data of the source text.
Feature
A set of runtime feature flags which can be enabled individually or in combination, which will change the way individual Lexer produces Tokens.
KindSet
Match a token against one or more Kinds.
Lexer
The Lexer struct - the core of the library - borrows &str and can incrementally produce Tokens.
SourceCursor
Wraps Cursor with a str that represents the underlying character data for this cursor.
SourceOffset
Represents a position in the underlying source.
Span
Represents a range of text within a document, as a Start and End offset.
Token
An abstract representation of the chunk of the source text, retaining certain “facts” about the source.
Whitespace
A [bitmask][bitmask_enum] representing the characters that make up a Kind::Whitespace token.

Enums§

CommentStyle
An enum representing the “Style” the Kind::Comment token represents.
DimensionUnit
Represents a Kind::Dimension’s unit, if it is “known”: defined by the CSS grammar.
Kind
Kind represents the token “Type”, categorised mostly by the token types within the CSS Syntax spec.
PairWise
Represents either the left or right Kind of a PairWise set.
QuoteStyle
An enum representing the “Style” the Kind::String token represents.

Traits§

ToSpan
A trait representing an object that can derive its own Span. This is very similar to From<MyStuct> for Span, however From<MyStruct> for Span requires Sized, meaning it is not dyn compatible.