Many might have come across a requirement for reasonable sized English word frequency lists. Here is one good and free word frequency list based on
British National Corpus (BNC). This post is just a pointer to the real resource (Reference 1), but I will copy some text from the reference describing the details about the file structure.
-----------------------------------------------
These are all available in 6 forms:
- sorted alphabetically ("al") or by frequency (highest frequency first) ("num");
- the complete lists, or a smaller file containing only those items occurring over five times (suffix "o5");
- all lists are available compressed using gzip (".gz"). The
o5 lists are also available uncompressed (no suffix). The frequencies are for <CLAWS-word, POS> pairs.
For a list and brief descriptions of CLAWS POS-tags, see
here.
The format is: four fields, separated by spaces.
1: frequency
2: word
3: pos
4: number of files the word occurs in
For non-orthographic words, spaces are replaced by underscore, giving eg "in_spite_of".
Lists are provided for the complete BNC (
all), and for three subsets, as below:
cg 'context-governed' spoken material
(eg meetings, lectures etc) 6.2M tokens, 79,906 types
demog 'demographic' spoken material
(eg conversation) 4.2M tokens, 54,652 types
written 89.7M tokens, 921,074 types
all 100.1M tokens, 939,028 types
File sizes in MB ("al" and "num" variants all the same size) are:
all uncompressed .gz o5 o5.gz
-------------------------------------------------------------
all 18.1 4.8 4.0 1.32
cg 1.4 0.39 0.43 0.15
demog 0.9 0.26 0.25 0.09
written 17.8 4.7 3.9 1.30
-------------------------------------------------------------
For all.al.gz click
hereFor all.al.o5 click
hereFor all.al.o5.gz click
hereFor all.num.gz click
hereFor all.num.o5 click
hereFor all.num.o5.gz click
hereFor written.al.gz click
hereFor written.al.o5 click
hereFor written.al.o5.gz click
hereFor written.num.gz click
hereFor written.num.o5 click
hereFor written.num.o5.gz click
hereFor cg.al.gz click
hereFor cg.al.o5 click
hereFor cg.al.o5.gz click
hereFor cg.num.gz click
hereFor cg.num.o5 click
hereFor cg.num.o5.gz click
hereFor demog.al.gz click
hereFor demog.al.o5 click
hereFor demog.al.o5.gz click
hereFor demog.num.gz click
hereFor demog.num.o5 click
hereFor demog.num.o5.gz click
here
References:
1.
http://www.kilgarriff.co.uk/bnc-readme.html