Clean up data efficiently in Python

I have data in the following format:

TOP (S (PP-LOC (IN In) (NP (NP (DT an) (NNP Oct.) (CD 19) (NN overview) ) (PP (IN of) (NP (

)
(NP) ('' ' ') (PP-LOC (IN at) (DT) (NN Misanthrope) / strong> (NP (NP (NNP Chicago) (POS) ) (NNP Goodman) (NNP theater) )))) (PRN (- LRB- -LRB -) (

)
(S-HLN (NP-SBJ (VBN Revitalized) (NNS class) ) (VP (VBP Take) (NP (DT the) (NN stage) ) (PP-LOC (IN in) (NP(NNP Windy) (NNP city) )))) (,) ('' '') (NP-TMP (NN Leisure) (CC &) (NNS Arts) (- RRB- -RRB -) rong>))) (,) (NP-SBJ-2 (NP (NP (DT) (NN role) ) (PP (IN of) (NP (NNP Celimene) ))) (,) (VP (VBN played) (NP (- NONE - *) ) (PP (IN by) (NP-LGS (NNP Kim) (NPP Cattrall) (,) ) (VP (VBD was) (VP (ADVP-MNR (RB by mistake) (NP (- NONE- * -2) ) (PP-CLR (TO to)(NP (NNP Christina) (NNP Haag) )))) (..) ))

(TOP (S (NP-SBJ (NNP Ms.) (NNP Haag) ) (VP (VBZ plays ) (NP ( NNP Elianti) )) (...) ))

..... (There are 7000 more ..)

This data was taken from the newspaper. The new line is the new sentence (starts with "TOP") From this data, I only need bold (no brackets) for each sentence:

(IN In)(DT an) (NNP Oct.) (CD 19) (NN review) (IN of) (`` ``) (DT The) (NN Misanthrope)   ('' '')  (IN at)  (NNP Chicago) (POS 's) (NNP Goodman) (NNP Theatre)(-LRB- -LRB-) (`` ``)     (VBN Revitalized) (NNS Classics) (VBP Take) (DT the) (NN Stage)  (IN in)   (NNP Windy) (NNP    City) (, ,) ('' '') (NN Leisure) (CC &) (NNS Arts) (-RRB- -RRB-)(, ,) (DT the) (NN role)(IN of)  (NNP Celimene) (, ,) (VBN played) (-NONE- *)(IN by)(NNP Kim) (NNP Cattrall) (, ,) (VBD was)  (RB mistakenly)(VBN attributed) (-NONE- *-2) (TO to)(NNP Christina) (NNP Haag) (. .)

(NNP Ms.) (NNP Haag) (VBZ plays)(NNP Elianti)(. .)

      

I tried the following:

f = open('filename')
data = f.readlines()
f.close()

      

this part is for splitting an array of tuples for each line (using regex):

tag_word_train = numpy.empty((5000), dtype = 'object')
for i in range(0,5000) :
    tag_word_train[i] = re.findall(r'\(([\w.-]+)\s([\w.-]+)\)',data[i])

      

it takes a very long time so I couldn't tell if it is correct /

Do you have any idea how to do this in an efficient way?

Thank,

Hadas

+3


source to share


3 answers


nltk.tree

provides functions that are read in parse and extract the word pairs and part-of-speech tags you want in your output:



>>> import nltk.tree
>>> t = nltk.tree.Tree.fromstring("(TOP (S (NP-SBJ (NNP Ms.) (NNP Haag) ) (VP (VBZ plays) (NP (NNP Elianti) )) (. .) ))")
>>> t.pos()
[('Ms.', 'NNP'), ('Haag', 'NNP'), ('plays', 'VBZ'), ('Elianti', 'NNP'), ('.', '.')]

      

0


source


nltk

has a classTree

that probably suits your needs. In particular, you will want to use a class method nltk.tree.Tree.fromstring

:



>>> import nltk.tree
>>> nltk.tree.Tree.fromstring("(S (NP (DT The) (N cat)) (VP (V ran)))")
Tree('S', [Tree('NP', [Tree('DT', ['The']), Tree('N', ['cat'])]), Tree('VP', [Tree('V', ['ran'])])])

      

+2


source


Try the following:

import re

f = open('filename')
data = f.readlines()
f.close()
tag_word_train = numpy.empty((5000), dtype = 'object')
exp = re.compile("\([^()]*\)")

i = 0

for line in data:
    #out = re.findall(exp, data)
    #print(out)
    tag_word_train[i] = re.findall(exp, data)               
    i = i + 1

      

Breaking regex down:

\(

match left brackets

[^()]*

matches zero or more characters that are not left or right parentheses

\)

match the correct parenthesis

(I am assuming that what you want are terms that do not themselves include the parenthesized term. If I am wrong on this assumption, the regex will not do what you want).

+1


source







All Articles