Having a small user community, for the library I'm encountering difficulties with, I feel I need to open up this problem to the wider communities in the hope that someone is able to offer some advice.
Basically, I'm using a library called nltk, to perform some Natural Language Processing on some text files. Before giving details about the specifics of the problem, for anyone who wishes to help out, you will need to do the following steps first:
1. Go to: http://nltk.sourceforge.net/install.html and install the appropriate version of the library.
2. Check out: http://nltk.sourceforge.net/api-1.4/index.html for documentation on all classes within this library.
I will now go on to explain the problem before supplying the code which you can copy and dump into a python file and then just run the .py file and see the error message for yourself!!!
The problem I'm having difficulty with is making use of the Brill Tagger supplied with the nltk library. I seem to have run into trouble invoking the tag method of the 'BrillTagger' class.
I've managed to train the brill tagger on the 'treebank' corpus, but when I come to invoke the 'tag' method, I receive the error 'KeyError: SUBTOKENS'. I can't seem to find the reason for it throwing up this error, even though I understand what the error is referring to. The error is basically indicating that the method: 'tag(self, token)' requires a token instance in order to assign pos tags and for some reason its not liking the variable I'm passing!!
Below is the code I'm using which you can just copy and run directly to see the error message I receive.
import re
import sys
sys.path.append('/home/csunix/extras/nltk/1.4.2/lib/python2.3/site-packages')
sys.path.append('/home/csunix/extras/nltk/1.4.2/lib/python2.3/site-packages/Numeric')
import nltk
from nltk.tokenizer import *
from nltk.corpus import SimpleCorpusReader
from nltk.probability import FreqDist
from nltk.parser import ParserI
from nltk.stemmer.porter import *
from nltk.tagger import *
from nltk.tagger.brill import *
from nltk.corpus import words as w, brown
corpusTStext = "some text to be assigned part of speech tags. I am using a corpus but for this example might as well just use a small string of text"
# Tokenize string to extract words
corpusTStoken = Token(TEXT=corpusTStext)
wstokenizer = WhitespaceTokenizer(SUBTOKENS='WORDS').tokenize(corpusTStoken)
# Tokenize string to extract bi-grams
# Create bi-grams constructed from current word and word adjacent to the left
corpusTSnglhstoken = Token(TEXT=corpusTStext)
pat = '\w+\s+\w+'
RegexpTokenizer(pat, negative=False, SUBTOKENS='WORDS').tokenize(corpusTSnglhstoken)
train_tokens = []
items = treebank.items('tagged')
for item in items[:100]:
item = treebank.read(item)
for sent in item['SENTS']:
train_tokens += sent['WORDS']
train_tokens = [train_tokens[i] for i in range(len(train_tokens))
if train_tokens[i]['TEXT'][0] not in "[]="]
#train_tokens.append(w.read('en_GB.dic'))
trainCutoff = int(len(train_tokens)*0.8)
train_tokens = Token(SUBTOKENS=train_tokens[0:trainCutoff])
# Train a Unigram Tagger
postagger = UnigramTagger(TAG='POS')
postagger.train(train_tokens)
# Train Brill Tagger
templates = [
SymmetricProximateTokensTemplate(ProximateTagsRule, (1,1)),
SymmetricProximateTokensTemplate(ProximateTagsRule, (2,2)),
SymmetricProximateTokensTemplate(ProximateTagsRule, (1,2)),
SymmetricProximateTokensTemplate(ProximateTagsRule, (1,3)),
SymmetricProximateTokensTemplate(ProximateWordsRule, (1,1)),
SymmetricProximateTokensTemplate(ProximateWordsRule, (2,2)),
SymmetricProximateTokensTemplate(ProximateWordsRule, (1,2)),
SymmetricProximateTokensTemplate(ProximateWordsRule, (1,3)),
ProximateTokensTemplate(ProximateTagsRule, (-1, -1), (1,1)),
ProximateTokensTemplate(ProximateWordsRule, (-1, -1), (1,1))
]
trace = 3
brilltrainer = BrillTaggerTrainer(postagger, templates, trace, TAG='POS')
brillrules = brilltrainer.train(train_tokens, max_rules=50, min_score=2)
brillrules = brillrules.rules
# (POS) Tag corpus training set (corpusTS)
brilltagger = BrillTagger(postagger, brillrules)
brilltagger.tag(corpusTStoken)
tagwords = open("taggedwords.txt","w")
for token in corpusTStoken['WORDS']:
tagwords.write(token['TEXT'] + "/" + str(token['TAG']) + "\n")
Will be very appreicative for any advice anyone is able to offer.
Thanks in advance,
Mark