Parsing

Chris Dolan

cdolan@cpan.org

February 21, 2012

http://chrisdolan.net/madmongers/parsing.html

Agenda

What is PPI?

What is PPI useful for?

Introspecting Perl code that you do not want to execute

What are the good parts?

What are the negatives?

PPI::Document

 1: use PPI;
 2: 
 3: # empty
 4: my $doc = PPI::Document->new;
 5: 
 6: # parse a string
 7: my $src = 'print "hello world";';
 8: $doc = PPI::Document->new(\$src);
 9: 
10: # parse a file
11: $doc = PPI::Document->new('Foo.pm');

PPI::HTML

This syntax highlighting is done by PPI::HTML, colors are controlled by CSS

1: use PPI::HTML;
2: my $doc = PPI::Document->new('foo.pm');
3: my $formatter = PPI::HTML->new(
4:     line_numbers => 1
5: );
6: my $html = $formatter->html($doc);

find

find() takes either a class name or a function and returns an arrayref or false(!)

1: use 5.14.0;
2: use PPI;
3: my $file = $INC{'PPI/Document.pm'};
4: my $doc = PPI::Document->new($file);
5: my $subs = $doc->find('PPI::Statement::Sub');
6: say $_->name for @{$subs};
BEGIN
new
load
_setattr
set_cache
get_cache
readonly
tab_width
save
serialize

find

 5: $doc->find(sub {
 6:    my ($root,$node) = @_;
 7:    if ($node->isa('PPI::Token::Symbol') &&
 8:        $node->symbol_type eq '$') {
 9:       say ref($node),' ',$node;
10:    }
11: });
PPI::Token::Symbol $VERSION
PPI::Token::Symbol $errstr
PPI::Token::Symbol $CACHE
PPI::Token::Magic $_
PPI::Token::Symbol $class
PPI::Token::Symbol $self
PPI::Token::Symbol $class
PPI::Token::Symbol $self
PPI::Token::Symbol $self
PPI::Token::Symbol $self
PPI::Token::Symbol $source
PPI::Token::Symbol $timeout
PPI::Token::Symbol $timeout
PPI::Token::Symbol $source
PPI::Token::Symbol $class

Inspecting the PDOM

1: use PPI::Dumper;
2: use PPI;
3: my $in = $ARGV[0] || join('', <STDIN>);
4: my $doc = PPI::Document->new(-f $in ? $in : \$in);
5: PPI::Dumper->new($doc)->print;
PPI::Document
  PPI::Statement::Include
    PPI::Token::Word    'use'
    PPI::Token::Whitespace      ' '
    PPI::Token::Word    'PPI::Dumper'
    PPI::Token::Structure       ';'
  PPI::Token::Whitespace        '\n'
  PPI::Statement::Include
    PPI::Token::Word    'use'
    PPI::Token::Whitespace      ' '
    PPI::Token::Word    'PPI'
    PPI::Token::Structure       ';'
  PPI::Token::Whitespace        '\n'
  PPI::Statement::Variable
    PPI::Token::Word    'my'
    PPI::Token::Whitespace      ' '
    PPI::Token::Symbol          '$in'
    PPI::Token::Whitespace      ' '
    PPI::Token::Operator        '='
    PPI::Token::Whitespace      ' '
    PPI::Token::Symbol          '$ARGV'

Analyzing code

This is roughly how Perl::Critic finds violations of the 'ProhibitNestedSubs' policy

 1: use PPI;
 2: my $doc = PPI::Document->new($0);
 3: my $subs = $doc->find('PPI::Statement::Sub');
 4: for my $subr (@{$subs}) {
 5:    die if $subr->find('PPI::Statement::Sub');
 6: }
 7: 
 8: sub foo {
 9:    sub bar {}
10: }

What is parsing?

Example: English

"The quick brown fox jumps over the lazy dog."
Lexing (data -> tokens)

Parsing (tokens -> tree)

Grammar

A Grammar is a declaration of a syntax that encompasses both tokenization and structure

A Syntax error can occur in either the lexer ("thedogisred") or the parser ("the is dog red")

BNF grammars

BNF ("Backus–Naur Form") is a popular syntax for documenting grammars (EBNF adds *, +, and ?)

<syntax> ::= <rule> | <rule> <syntax>
<rule>   ::= <opt-whitespace> "<" <rule-name> ">" 
             <opt-whitespace> "::=" <opt-whitespace>
             <expression> <line-end>
<opt-whitespace> ::= " " <opt-whitespace> | ""
<expression>     ::= <list> | <list> "|" <expression>
<line-end>       ::= <opt-whitespace> <EOL> | <line-end>
                     <line-end>
<list>    ::= <term> | <term> <opt-whitespace> <list>
<term>    ::= <literal> | "<" <rule-name> ">"
<literal> ::= '"' <text> '"' | "'" <text> "'"

Lexing

Scanner
a state machine that validates a token (e.g. whitespace, delimited string, floating point number)
Regular expression
common implementation for a scanner
Tokenizer
classifies scanned elements into labeled tokens

Tools for creating parsers

Non-Perl
Lex, Flex
creates scanners (Perl has its own in toke.c)
Yacc, Bison
"Yet Another Compiler Compiler", used by Perl
Antlr
Given a BNF-like grammar and input stream, makes a parse tree
JavaCC
Like Yacc, outputs Java code

Tools for creating parsers

Perl
Parse::RecDescent, Regexp::Grammars
Pure-Perl5 grammar engines
Marpa (Jeffrey Kegler)
Perl implementation of the Earley algorithm, enhanced to be O(n) for almost all grammars
Perl6 Grammars
Combined lexer and parser with regex-like syntax