Suggestion on how to use for POS tagging

Apr 8, 2013 at 3:07 PM
Hi, actually using HMM for pos tagging, and CRF is still a way better!

I have two questions:

¿Are the model files (.alpha .feature) always so huge? >200MB as inthe NER example ?
¿Is there a way the model can shrink donw to a manageable size?

If making POS-tagging with this models, how might the layout of the data be set?

I have actually a not-so large annotated corpus (many sequences are not just there)
and the problem is there are no so many available (free) corpuses in Spanish
so I want to make some bootstraping, like training from a small core, then read bigger data and aquire atommatically some extra trainig from this autommatically detected data, ¿does this work on CRF?

many thanks in advance, the piece of software just shines (glimpsed inside)

Andres H
Apr 9, 2013 at 10:20 AM
  1. The model files size depends on the size of feature set. The bigger feature set is, the bigger mode files is.
  2. While encoding model, you can use L1 regularization to generate sparse model (-regtype L1 as the parameter in CRFSharpConsole.exe -encode), and then use "CRFSharpConsole.exe -shrink" to shrink the model. After do that, the size of model files will be reduced significantly.
  3. For POS-tagging, actually, the suitable size depends on many different factors, such as corpus size, feature set, the coverage of different language phenomenon in the corpus and others. If you want to train a excellent model, spending time on corpus and featuring engineering is necessary.
  4. I think it's a good idea to generate training corpus automatically by some bootstraping method. At first, maybe you can set a few types tags, and generate model, and then try to mine some rules from the result and set rules to label training corpus automatically.
If you have any issue, please feel free to let me know. :)

Zhongkai Fu