學習《Python Cookbook》第三版
你有一個字串,想從左至右將其解析為一個令牌流。
假如你有下面這樣一個文字字串:
my_str = 'total = 32 + 12 * 5'
為了令牌化字串,你不僅需要匹配模式,還得指定模式的型別。比如,你可能想將字串像下面這樣轉換為序列對:
tokens = [('NAME', 'total'), ('EQ','='), ('NUM', '32'), ('PLUS','+'), ('NUM', '12'), ('TIMES', '*'), ('NUM', 5')]
為了執行這樣的切分,第一步就是像下面這樣利用命名捕獲組的正則表示式來定義所有可能的令牌,包括空格:
name = r'(?P<NAME>\w*)'num = r'(?P<NUM>\d*)'plus = r'(?P<PLUS>\+)'times = r'(?P<TIMES>\*)'eq = r'(?P<EQ>=)'ws = r'(?P<WS>\s+)'compiled_pattern = re.compile('|'.join([name, num, plus, times, eq, ws]))
在上面的模式中, ?P<TOKENNAME> 用於給一個模式命名,供後面使用。下一步,為了令牌化,使用模式物件很少被人知道的 scanner() 方法。這個方法會建立一個 scanner 物件,在這個物件上不斷的呼叫 match() 方法會一步步的掃描目標文字,每步一個匹配。下面是演示一個 scanner 物件如何工作的互動式例子:
scanner = .scanner('foo = 42')matched = scanner.match()print(matched.group()) # fooprint(matched.lastgroup) # NAME
其中,matched.lastgroup是最後匹配的捕獲組的名稱;如果該組沒有名稱,或者根本沒有匹配的組,則為None。
官方對於scanner是沒有文件註釋的,scanner可以看做是Python低版本的SRE的功能,對於目前來說使用re.finditer即可
def generate_tokens(pat, text): Token = namedtuple('Token', ['type', 'value']) for matched_item in re.finditer(pat, text): yield Token(matched_item.lastgroup, matched_item.group())for tok in generate_tokens(compiled_pattern, 'foo = 32'): print(tok) """Token(type='NAME', value='foo')Token(type='WS', value=' ')Token(type='EQ', value='=')Token(type='WS', value=' ')Token(type='NAME', value='32')"""
如果你想過濾令牌流,你可以定義更多的生成器函式或者使用一個生成器表示式。比如,下面演示怎樣過濾所有的空白令牌:
tokens = (tok for tok in generate_tokens(compiled_pattern, 'foo = 32') if tok.type != 'WS')for tok in tokens: print(tok)"""Token(type='NAME', value='foo')Token(type='EQ', value='=')Token(type='NAME', value='32')"""
令牌的順序也是有影響的。 re 模組會按照指定好的順序去做匹配。因此,如果一個模式恰好是另一個更長模式的子字串,那麼你需要確定長模式寫在前面。比如:
NUM = r'(?P<NUM>\d+)'LT = r'(?P<LT><)'LE = r'(?P<LE><=)'EQ = r'(?P<EQ>=)'master_pat = re.compile('|'.join([NUM, LE, LT, EQ]))other_pat = re.compile('|'.join([NUM, LT, LE, EQ]))def generate_tokens(pat, text): Token = namedtuple('Token', ['type', 'value']) for matched_item in re.finditer(pat, text): yield Token(matched_item.lastgroup, matched_item.group())for tok in generate_tokens(master_pat, '30 <= 31'): print(tok) """Token(type='NUM', value='30')Token(type='LE', value='<=')Token(type='NUM', value='31')""" # 這種是不符合我們要求的for tok in generate_tokens(other_pat, '30 <= 31'): print(tok)"""Token(type='NUM', value='30')Token(type='LT', value='<')Token(type='EQ', value='=')Token(type='NUM', value='31')"""
如果你使用了scanner掃描方法,你需要記住這裡一些重要的幾點。第一點就是你必須確認你使用正則表示式指定了所有輸入中可能出現的文字序列。如果有任何不可匹配的文字出現了,掃描就會直接停止。這也是為什麼上面例子中必須指定空白字元令牌的原因。如:
def generate_tokens(pat, text): Token = namedtuple('Token', ['type', 'value']) scanner = pat.scanner(text) for m in iter(scanner.match, None): yield Token(m.lastgroup, m.group())# Example usefor tok in generate_tokens(other_pat, '30 <= 42'): print(tok)"""Token(type='NUM', value='30')"""
但是對於re.finditer()我們不用擔心掃描會停止,如:
def generate_tokens(pat, text): Token = namedtuple('Token', ['type', 'value']) for matched_item in re.finditer(pat, text): yield Token(matched_item.lastgroup, matched_item.group())# 這裡沒有匹配到<for tok in generate_tokens(compiled_pattern, '30 <= 31'): print(tok)"""Token(type='NAME', value='30')Token(type='WS', value=' ')Token(type='EQ', value='=')Token(type='WS', value=' ')Token(type='NAME', value='31')"""
更多的程式碼例子:
import reimport collectionsToken = collections.namedtuple('Token', ['type', 'value', 'line', 'column'])# 換行\n算一個字元def tokenize(code): keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'} token_specification = [ ('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number ('ASSIGN', r':='), # Assignment operator ('END', r';'), # Statement terminator ('ID', r'[A-Za-z]+'), # Identifiers ('OP', r'[+\-*/]'), # Arithmetic operators ('NEWLINE', r'\n'), # Line endings ('SKIP', r'[ \t]+'), # Skip over spaces and tabs ('MISMATCH', r'.'), # Any other character ] tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification) line_num = 1 line_start = 0 for mo in re.finditer(tok_regex, code): kind = mo.lastgroup value = mo.group() column = mo.start() - line_start if kind == 'NUMBER': value = float(value) if '.' in value else int(value) elif kind == 'ID' and value in keywords: kind = value elif kind == 'NEWLINE': line_start = mo.end() line_num += 1 continue elif kind == 'SKIP': continue elif kind == 'MISMATCH': raise RuntimeError(f'{value!r} unexpected on line {line_num}') yield Token(kind, value, line_num, column)statements = ''' IF quantity THEN total := total + price * quantity; tax := price * 0.05; ENDIF;'''for token in tokenize(statements): print(token)"""Token(type='IF', value='IF', line=2, column=4)Token(type='ID', value='quantity', line=2, column=7)Token(type='THEN', value='THEN', line=2, column=16)Token(type='ID', value='total', line=3, column=8)Token(type='ASSIGN', value=':=', line=3, column=14)Token(type='ID', value='total', line=3, column=17)Token(type='OP', value='+', line=3, column=23)Token(type='ID', value='price', line=3, column=25)Token(type='OP', value='*', line=3, column=31)Token(type='ID', value='quantity', line=3, column=33)Token(type='END', value=';', line=3, column=41)Token(type='ID', value='tax', line=4, column=8)Token(type='ASSIGN', value=':=', line=4, column=12)Token(type='ID', value='price', line=4, column=15)Token(type='OP', value='*', line=4, column=21)Token(type='NUMBER', value=0.05, line=4, column=23)Token(type='END', value=';', line=4, column=27)Token(type='ENDIF', value='ENDIF', line=5, column=4)Token(type='END', value=';', line=5, column=9)"""