[To get the latest source code, please access source code section and sync project from code base]

Project Description
CRFSharp is Conditional Random Fields implemented by .NET(C#), a machine learning algorithm for learning from labeled sequences of examples.


CRFSharp is Conditional Random Fields implemented by .NET(C#), a machine learning algorithm for learning from labeled sequences of examples. It is widely used in Natural Language Process (NLP) tasks, for example: word breaker, postaging, named entity recognized and so on.

CRFSharp (aka CRF#) is based on .NET Framework 4.0 and its mainly algorithm is similar with CRF++ written by Taku Kudo. It encodes model parameters by L-BFGS. Moreover, it has many significant improvements than CRF++, such as totally parallel encoding, optimizing memory usage and so on.

Currently, when training corpus, compared with CRF++, CRFSharp can make full use of multi-core CPUs and use memory effectively, especially for very huge size training corpus and tags. So in the same environment, CRFSharp is able to encode much more complex models with less cost than CRF++.

The following screenshot is an example that CRFSharp is running on a machine with 16 cores CPUs and 96GB memory.

The training corpus has 1.24 million records with nearly 1.2 billion features. From the screenshot, all CPU cores are full used and memory usage is stable. The average encoding time per iteration is 3 minutes and 33 seconds.

Besides command line tool, CRFSharp has also provided APIs and these APIs can be used into other projects and services for key techincal tasks. For example: WordSegment project has used CRFSharp to recognize named entity; Query Term Analyzer project has used it to analyze query term important level in word formation and Geography Coder project has used it to detect geo-entity from text. For detailed information about APIs, please see section [Use CRFSharp API in your project] in below.

To use CRFSharp, we need to prepare corpus and design feature templates at first. CRFSharp's file formats are compatible with CRF++(official website:http://crfpp.googlecode.com/svn/trunk/doc/index.html). The following paragraphs will introduce data formats and how to use CRFSharp in both command line and APIs

Training file format

Training corpus contains many records to describe what the model should be. For each record, it is split into one or many tokens and each token has one or many dimension features to describe itself.

In training file, each record can be represented as a matrix and ends with an empty line. In the matrix, each row describes one token and its features, and each column represents a feature in one dimension. In entire training corpus, the number of column must be fixed.

When CRFSharp encodes, if the column size is N, according template file describes, the first N-1 columns will usually be used as input data to generate binary feature set and train model. The Nth column (aka last column) is the answer that the model should output. The means, for one record, if we have an ideal encoded model, given all tokens’ the first N-1 columns, the model should output each token’s Nth column data as the entire record’s answer.

There is an example (a bigger training example file is at download section, you can see and download it there):

and CC S
are VBP S
major JJ S
financial JJ S
centers NNS S
p FW S
y NN S
h FW S
44 CD S

The example is used to label named entities in records. It has two records and each token has three columns. The first column is the token’s words; the second column is the token’s pos-tag result; and the third column is to describe whether the token is a named entity or a part of named entity and its type. The first and the second columns are input data for encoding model, and the third column is the model ideal output as answer.

Test file format

Test file has the similar format as training file. The only different between training and test file is the last column. In test file, all columns are features for CRF model.

CRFSharp command line tools

CRFSharpConsole.exe is a command line tool to encode and decode CRF model. By default, the help information showed as follows:
Linear-chain CRF encoder & decoder by Zhongkai Fu (fuzhongkai@gmail.com)
CRFSharpConsole.exe [parameter list...]
-encode [parameter list...] - Encode CRF model from given training corpus
-decode [parameter list...] - Decode CRF model to label text

As the above information shows, the tool provides two run modes. Encode mode is for training model, and decode mode is for testing model. The following paragraphs introduces how to use these two modes.

Encode model

This mode is used to train CRF model from training corpus. Besides -encode parameter, the command line parameters as follows:
CRFSharpConsole.exe -encode [parameters list]
-template <filename>: template file name
-trainfile <filename>: training corpus file name
-modelfile <filename>: encoded model file name
-maxiter <integer number>: maximum iteration, when encoding iteration reaches this value, the process will be ended. Default value is 1000
-minfeafreq <integer number>: minimum feature frequency, if one feature's frequency is less than this value, the feature will be dropped. Default value is 2
-mindiff <float-point number>: minimum diff value, when diff less than the value consecutive 3 times, the process will be ended. Default value is 0.0001
-thread <integer number>: threads used to train model. Default value is 1
-slotrate <float-point value>: the maximum slot usage rate threshold when building feature set. it is ranged in (0.0, 1.0). the higher value means longer time to build feature set, but smaller feature set size. Default value is 0.95
-hugelexmem <integer>:  build lexical dictionary in huge mode and shrink starts when used memory reaches this value. This mode can build more lexical items, but slowly. Value ranges [1,100] and default is disabled.
-regtype <type string>: regularization type. L1 and L2 regularization are supported. Default is L2
-debug: encode model as debug mode

Note: either -maxiter reaches setting value or -mindiff reaches setting value in consecutive three times, the training process will be finished and saved encoded model.

Note: -hugelexmem is only used for special task, and it is not recommended for common task, since it costs lots of time for memory shrink in order to load more lexical features into memory

A command line example as follows:
CRFSharpConsole.exe -encode -template template.1 -trainfile ner.train -modelfile ner.model -maxiter 100 -minfeafreq 1 -mindiff 0.0001 -thread 4 –debug

The entire encoding process contains four main steps as follows:
1. Load train corpus from file, generate and select feature set according templates.
2. Build selected feature set index data as double array trie-tree format, and save them into file
3. Run encoding process iteratively to tune feature values until reach end condition
4. Save encoded feature values into file
In step 3, after run each iteration, some detailed encoding information will be show. For example:
M_RANK_1 [FR=47658, TE=54.84%]
M_RANK_2:27.07% M_RANK_0:26.65% E_RANK_0:0.31% B_RANK_0:0.21% E_RANK_1:0.19%
iter=65 terr=0.320290 serr=0.717372 diff=0.0559666295793355 fsize=73762836(1.10% act)
Time span: 00:31:56.4866295, Aver. time span per iter: 00:00:29
The encoding information has two parts. The first part is information about each tag, the second part is information in overview.For each tag, it has two lines information. The first line shows the number of this tag in total (FR) and current token error rate (TE) about this tag. The second line shows this tag's token error distribution. In above example, in No.65 iteration, M_RANK_1 tag's token error rate is 54.84% in total. In these token error, 27.07% is M_RANK_2, 26.65% is M_RANK_0 and so on.For second part (information in overview), some global information is showed.
iter : the number of iteration processed
terr : tag's token error rate in all
serr : record's error rate in all
diff : different between current and previous iteration
fsize( x% act) : the number of feature set in total, x% act means the number of non-zero value features. In L1 regularization, with the increasement of iter, x% is reduced. In L2 regularization, x% is always 100%.
Time span : how long the encoding process has been taken
Aver. time span per iter : the average time span for each iteration

When encoding model process finish, three files will be generated at least.
file1: [model file name]
This is meta data file for model. It contains model global parameters, feature templates, output tags and so on.
file2: [model file name].feature
This is feature set list file for model. It contains all features's strings and corresponding ids. For high performance, these data is built by double array tri-tree.
file3: [model file name].alpha
This is feature set value file for model, It contains all features' value encoded by CRFSharp
If debug mode is enabled, a additional file will be generated.
file4: [model file name].feature.raw_text
This is feature set list file for model in raw text format. It contains the same data as model file name.feature. However, for debugging, these data is saved in raw text format.

Decode model

This mode is used to decode and test encoded model. Besides -decode parameter, the command line as follows:
CRFSharpConsole.exe -decode [Encoded CRF model file name] [input file name] [output file name]
[Encoded CRF model file name] is encoded CRF model by CRFSharp
[input file name] is test file name
[output file name] is decoder result file name. Besides this file, the tool generates another file named [output file name].raw which contains raw result from decoder.
An example as follows:
CRFSharpConsole.exe -decode ner.model nertest.txt nertestresult.txt

Shrink model

Encoded model with L1 regularization is usually a sparse model. Shrink parameter is used to reduce model file size. With -shrink parameter, the command line as follows:
CRFSharpConsole.exe -shrink [Encoded CRF model file name] [Shrinked CRF model file name] [thread num]
An example as follows:
CRFSharpConsole.exe -shrink ner.model ner_shrinked.model 16
This example is used to shrink ner.model files and the working thread is 16. 

Feature templates

CRFSharp template is totally compatible with CRF++ and used to generate feature set from training and testing corpus.

In template file, each line describes one template which consists of prefix, id and rule-string. The prefix is used to indicate template type. There are two prefix, U for unigram template, and B for bigram template. Id is used to distinguish different templates. And rule-string is used to guide CRFSharp to generate features.

The rule-string has two types of form, one is constant string, and the other is macro. The simplest macro form is {“%x[row,col]”}. Row specifies the offset between current focusing token and generating feature token in row. Col specifies the absolute column position in corpus. Moreover, combined macro is also supported, for example: {“%x[row1, col1]/%x[row2, col2]”}. When generating feature set, macro will be replaced as specific string. A template file example as follows:

# Unigram
# Bigram

In this template file, it contains both unigram and bigram templates. Assuming current focusing token is “York NNP E_LOCATION” in the first record in training corpus above, the generated unigram feature set as follows:


Although U07 and U08, U11 and U12’s rule-string are the same, we can still distinguish them by id string.

In encoding process, according templates, encoder will generate feature set (like the example in above) from records in training corpus and save them into model file.

In decoding process, for each test record, decoder will also generate features by template, and check every feature whether it exists in model. If it is yes, feature’s alpha value will be applied while processing cost value.

For each token, how many features will be generated from unigram templates? As the above said, if we have M unigram templates, each token will have M feature generated from the template set. Moreover, assuming each token has N different output classes, in order to indicate all possible statuses by binary function, we need to have {“M*N”} features for one token in total. For a record which contains L tokens, the feature size of this record is {“M*N*L”}.

For bigram template, CRFSharp will enumerate all possible combined output classes of two contiguous tokens, and generate features for each combined one. So, if each token has N different output classes, and the number of features generated by templates is M, the total bigram feature set size is {“N*N*M”}. For a record which contains L tokens, the feature size of this record is {“M*N*N*(L-1)”}.

Use CRFSharp API in your project

Besides command line tool, to encode and decode CRF model, you can also add CRFSharp.dll or CRFSharpWrapper.dll as references in your project. CRFSharp.dll contains core algorithm and provides many interfaces in low level. On the contrary, CRFSharpWrapper.dll  wraps above low level interfaces and provides interfaces in high level.

The following paragraphs show how to use CRFSharpWrapper.dll in a project

Encode model

1. Add CRFSharpWrapper.dll as reference

2. Add following code snippet

int max_iter = 1000;
int min_feature_freq = 2;
double min_diff = 0.0001;
double slot_usage_rate_threshold = 0.95;
int threads_num = 1;
string strTemplateFileName = null; //template file name
string strTrainingCorpus = null; //training corpus file name
string strEncodedModelFileName = null; //encoded model file name
bool bDebugMode = false;
CRFSharpWrapper.Encoder encoder = new CRFSharpWrapper.Encoder();
encoder.Learn(strTemplateFileName, strTrainingCorpus, strEncodedModelFileName, max_iter, min_feature_freq, min_diff, 1, threads_num, slot_usage_rate_threshold, bDebugMode);

The Encoder.Learn is wrapped encoder interface. It's defined as follows:

//encoding CRF model from training corpus
public bool Learn(string templfile, //template file name
         string trainfile, //training corpus file name
         string modelfile, //encoded model file name
         int maxitr, //maximum iteration, when encoding iteration reaches this value, the process will be ended.
         int freq, //minimum feature frequency, if one feature's frequency is less than this value, the feature will be dropped.
         double eta, //minimum diff value, when diff less than the value consecutive 3 times, the process will be ended.
         float C,
         int thread_num, //the amount of threads used to train model.
         double slot_usage_rate_threshold, //the slot usage rate threshold when building feature set.
         bool bDebugMode //encode model as debug mode if it is true

Decode model

1. Add CRFSharpWrapper.dll as reference

2. Add following code snippet

//Create CRFSharp wrapper instance. It's a global instance
CRFSharpWrapper.Decoder crfWrapper = new CRFSharpWrapper.Decoder();
//Load model from file
if (crfWrapper.LoadModel(strModelFileName) == false)
return false;
//Create decoder tagger instance. If the running environment is multi-threads, each thread needs a separated instance
CRFSharpWrapper.SegDecoderTagger tagger = crfWrapper.CreateTagger();
if (tagger == null)
return false;
List<List<string>> featureSet = BuildFeatureSet(strTestText); //Build feature set from given test text.
crfWrapper.Segment(crf_out, tagger, featureSet, 1, 0); //Label tokens according feature set and save result into crf_out. //Parse result and save N-best result into rstList List<string> rstList = new List<string>(); for (int i = 0; i < crf_out.nbest; i++) { string strout = ""; crf_term_out crf_term_out = crf_out.term_buf[i]; for (int j = 0; j < crf_term_out.Count; j++) { string str = strTestText.Substring(crf_term_out.offsetList[j], crf_term_out.lengthList[j]); string strNE = crf_term_out.nePropList[j].strTag; strout += str + "[" + strNE + "] "; } rstList.Add(strout); }

//An example for feature set builidng. Only use 1-dim character based feature
private static List<List<string>> BuildFeatureSet(string str)
List<List<string>> sinbuf = new List<List<string>>();
foreach (char ch in str)
sinbuf.Add(new List<string>());
sinbuf[sinbuf.Count - 1].Add(ch.ToString());
return sinbuf;

The Decoder.Segment is wrapped decoder interface. It's defined as follows:

 //Segment given text
 public int Segment(crf_out pout, //segment result
     SegDecoderTagger tagger, //Tagger per thread
     List<List<string>> inbuf, //feature set for segment
     int nbest_value, //N-best result is needed
     int vlevel_value //0 - no need to calculate probability, 
                      //1 - calculate all types probability
                      //2 - only calculate named entity's probability

Last edited Jul 27, 2013 at 2:06 AM by monkeyfu, version 10


khaledhejazy86 Sep 11, 2014 at 11:04 AM 
Could you please update the API-Decode section? crf_out and crf_term_out need modification