Limiting the number of iterations in Stanford NER

I am training a Stanford NER CRF model on an individual dataset, but the number of iterations that are used to train the model has now gone to 333 iterations, i.e. and this learning process has now taken hours. Following is the message printed on the terminal -

Iter 335 evals 400 <D> [M 1.000E0] 2.880E3 38054.87s |5.680E1| {6.652E-6} 4.488E-4 - 
Iter 336 evals 401 <D> [M 1.000E0] 2.880E3 38153.66s |1.243E2| {1.456E-5} 4.415E-4 -
 - 

      

The properties file used is below, is there a way to limit the number of iterations to say 20.

location of the training file
trainFile = TRAIN5000.tsv
#location where you would like to save (serialize to) your
#classifier; adding .gz at the end automatically gzips the file,
#making it faster and smaller
serializeTo = ner-model_TRAIN5000.ser.gz

#structure of your training file; this tells the classifier
#that the word is in column 0 and the correct answer is in
#column 1
map = word=0,answer=1

#these are the features we'd like to train with
#some are discussed below, the rest can be
#understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
useNGrams=true
#no ngrams will be included that do not contain either the
#beginning or end of the word
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
#the next 4 deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
saveFeatureIndexToDisk = true
printFeatures=true
flag useObservedSequencesOnly=true
featureDiffThresh=0.05

      

+3


source to share


2 answers


I have experimented with training a biomedical (BioNER) model using Stanford CoreNLP CRF classifier

the IOB tagged text as described at https://nlp.stanford.edu/software/crf-faq.html .

My corpus - from downloaded sources - was very large (~ 1.5M lines, 6 functions: GENE; ...). Since the learning seemed to go on indefinitely, I plotted a value ratio to get an idea of ​​the progress:

training ratio graph CRF Values ​​for each iteration

By comparing the Java source code, I found that the default value TOL

( tolerance

; used to decide when to end the training session) was 1E-6 (0.000001) given in .../CoreNLP/src/edu/stanford/nlp/optimization/QNMinimizer.java

.

Looking at this plot, my initial training was never over. [This graph also shows that setting a larger value TOL

, eg. tolerance=0.05

, will cause premature completion of training, as this value is TOL

triggered by the "noise" that occurs at the beginning of training. I confirmed this with an entry tolerance=0.05

in my file .prop

; however the TOL

values 0.01

, 0.005

etc. were "OK." ]



Adding " maxIterations=20

" to the properties file as described by @StanfordNLPHelp (elsewhere in this thread) seemed to be ignored unless I added and changed the value tolerance=

in my properties file bioner.prop

; eg.

tolerance=0.005
maxIterations=20    ## optional

      

and in this case the classifier quickly trained the model ( bioner.ser.gz

). [When I added a line maxIterations

to my file .prop

without adding a line tolerance

, the model just kept running "forever" as before.]

A list of parameters that can be included in the file .prop

can be found here:

https://nlp.stanford.edu/nlp/javadoc/javanlp-3.5.0/edu/stanford/nlp/ie/NERFeatureFactory.html

+1


source


Add maxIterations=20

to properties file.



0


source







All Articles