Many might have come across a requirement for reasonable sized English word frequency lists. Here is one good and free word frequency list based on British National Corpus (BNC). This post is just a pointer to the real resource (Reference 1), but I will copy some text from the reference describing the details about the file structure.
-----------------------------------------------
These are all available in 6 forms:
For a list and brief descriptions of CLAWS POS-tags, see here.
The format is: four fields, separated by spaces.
Lists are provided for the complete BNC (all), and for three subsets, as below:
For all.al.o5 click here
For all.al.o5.gz click here
For all.num.gz click here
For all.num.o5 click here
For all.num.o5.gz click here
For written.al.gz click here
For written.al.o5 click here
For written.al.o5.gz click here
For written.num.gz click here
For written.num.o5 click here
For written.num.o5.gz click here
For cg.al.gz click here
For cg.al.o5 click here
For cg.al.o5.gz click here
For cg.num.gz click here
For cg.num.o5 click here
For cg.num.o5.gz click here
For demog.al.gz click here
For demog.al.o5 click here
For demog.al.o5.gz click here
For demog.num.gz click here
For demog.num.o5 click here
For demog.num.o5.gz click here
References:
1. http://www.kilgarriff.co.uk/bnc-readme.html
-----------------------------------------------
These are all available in 6 forms:
- sorted alphabetically ("al") or by frequency (highest frequency first) ("num");
- the complete lists, or a smaller file containing only those items occurring over five times (suffix "o5");
- all lists are available compressed using gzip (".gz"). The
For a list and brief descriptions of CLAWS POS-tags, see here.
The format is: four fields, separated by spaces.
1: frequency 2: word 3: pos 4: number of files the word occurs inFor non-orthographic words, spaces are replaced by underscore, giving eg "in_spite_of".
Lists are provided for the complete BNC (all), and for three subsets, as below:
cg 'context-governed' spoken material (eg meetings, lectures etc) 6.2M tokens, 79,906 types demog 'demographic' spoken material (eg conversation) 4.2M tokens, 54,652 types written 89.7M tokens, 921,074 types all 100.1M tokens, 939,028 typesFile sizes in MB ("al" and "num" variants all the same size) are:
all uncompressed .gz o5 o5.gz ------------------------------------------------------------- all 18.1 4.8 4.0 1.32 cg 1.4 0.39 0.43 0.15 demog 0.9 0.26 0.25 0.09 written 17.8 4.7 3.9 1.30 -------------------------------------------------------------For all.al.gz click here
For all.al.o5 click here
For all.al.o5.gz click here
For all.num.gz click here
For all.num.o5 click here
For all.num.o5.gz click here
For written.al.gz click here
For written.al.o5 click here
For written.al.o5.gz click here
For written.num.gz click here
For written.num.o5 click here
For written.num.o5.gz click here
For cg.al.gz click here
For cg.al.o5 click here
For cg.al.o5.gz click here
For cg.num.gz click here
For cg.num.o5 click here
For cg.num.o5.gz click here
For demog.al.gz click here
For demog.al.o5 click here
For demog.al.o5.gz click here
For demog.num.gz click here
For demog.num.o5 click here
For demog.num.o5.gz click here
References:
1. http://www.kilgarriff.co.uk/bnc-readme.html
No comments:
Post a Comment