DragonFly On-Line Manual Pages
COMBIG(1) User Contributed Perl Documentation COMBIG(1)
NAME
combig.pl - Combine frequency counts to determine co-occurrence
SYNOPSIS
Combines (sums) the frequency counts of bigrams made up of the same
pair of words in either possible order. It will count the number of
time two words occur together in a bigram regardless of which one
comes first.
DESCRIPTION
USAGE
combig.pl [OPTIONS] BIGRAM
INPUT PARAMETERS
o BIGRAM
Specify a file of bigram counts created by NSP programs count.pl.
The entries in BIGRAM will be formatted as follows:
word1<>word2<>n11 n1p np1
Here, word1 is followed by word2 n11 times. word1 occurs as the 1st
word in total n1p bigrams and word2 occurs as the 2nd word in np1
bigrams.
o OPTIONS
--help
Displays this message.
--version
Displays the version information.
OUTPUT
combig.pl produces a count of the number of times two words make up a
bigram in either order, whereas count.pl produces counts for a single
fixed ordering. In other words, combig.pl combines the counts of
bigrams that are composed of the same words but in reverse order. While
the BIGRAM shows pairs of words forming bigrams, output of combig will
show the pairs of words that are co-occurrences or that co-occur
irrespective of their order.
e.g. if bigrams
word1<>word2<>n11 n1p np1
and
word2<>word1<>m11 m1p mp1
are found in BIGRAM file, then combig.pl treats these as a single
unordered bigram
word1<>word2<>n11+m11 n1p+mp1 np1+m1p
where the new bigram will show a combined contingency table in which
the order of words doesn't matter.
word2 ~word2
___________________________________________________________
word1 | n11+m11 n12+m21 | n1p+mp1
| |
~word1 | n21+m12 n22+m22-n | n2p+mp2-n
___________________________________________________
np1+m1p np2+m2p-n | n
here the entry
o (word1,word2)=n11+m11
shows the number of bigrams having both word1 and word2 in either
order
i.e. word1<>word2 + word2<>word1
o (word1,~word2)=n12+m21
shows the number of bigrams having word1 but not word2 at either
position
i.e. word1<>~word2 + ~word2<>word1
o (~word1,word2)=n21+m12
shows the number of bigrams having word2 but not word1 at either
position
i.e. ~word1<>word2 + word2<>~word1
o (~word1,~word2)=n22+m22
shows the number of bigrams not having word1 nor word2 at either
position
i.e. ~word1<>~word2 + ~word2<>~word1 - n
where n=total number of bigrams
The mathematical proof of how the cell counts in the above
contingency table are counted is explained in section Proof.
When a bigram appears in only one order i.e.
word1<>word2<>n11 n1p np1
appears but
word2<>word1<>m11 m1p mp1
does not, then the combined bigram will be same as the original bigram
that appears. Or in other words,
word1<>word2<>n11 n1p np1
is displayed as it is.
PROOF OF CORRECTNESS
A bigram word1<>word2<>n11 n1p np1 represents a contingency table
word2 ~word2
--------------------------------------
word1 n11 | n12 | n1p
| |
~word1 n21 | n22 | n2p
--------------------------------------
np1 | np2 | n
while a bigram word2<>word1<>m11 m1p mp1 represents a contingency table
word1 ~word1
--------------------------------------
word2 m11 | m12 | m1p
| |
~word2 m21 | m22 | m2p
--------------------------------------
mp1 | mp2 | n
Here,
n11+n12+n21+n22 = n
Also,
m11+m12+m21+m22 = n
combig.pl combines bigram counts into a single order independant word
pair
word1<>word2<>n11+m11 n12+m21 n21+m12
And the corresponding contingency table will be shown as
word2 ~word2
-----------------------------------------
word1 n11+m11 | n12+m21 | n1p+mp1
| |
~word1 n21+m12 | n22+m22-n | n2p+mp2
-----------------------------------------
np1+m1p | np2+m2p | n
The first cell (n11+m11) shows the #bigrams having word1 and word2
(irrespective of their positions) i.e. word1<>word2 or word2<>word1
which is n11+m11.
The second cell (n12+m21) shows the #bigrams having word1 but not word2
at any position i.e. word1<>~word2 or ~word2<>word1 which is n12+m21.
The third cell (n21+m12) shows the #bigrams having word2 but not word1
at any position i.e. ~word1<>word2 or word2<>~word1 which is n21+m12.
The fourth cell (m22+n22-n) shows the #bigrams not having word1 nor
word2 at any position which
= n - (n11+m11) - (n12+m21) - (n21+m12)
= n - (n11+n12+n21) - (m11+m12+m21)
= n - (n-n22) - (n-m22)
= n22 + m22 - n
Alternative proof -
n22 = m11 + m12 + m21 + X (a)
m22 = n11 + n12 + n21 + X (b)
where X = #bigrams not having either word1 or word2.
as both n22 and m22 have some terms in common which show the bigrams
not having either word1 or word2. But,
m11+m12+m21 = n - m22
Substituting this in eqn (a)
n22 = n - m22 + X
Or
X = n22 + m22 - n
Or add (a) and (b) to get
n22+m22 = (n11+m11) + (n12+m21) + (n21+m12) + 2X
rearranging terms,
n22+m22 = (n11+n12+n21) + (m11+m12+m21) + 2X
but
n11+n12+n21 = n - n22 and
m11+m12+m21 = n - m22
Hence,
n22+m22 = (n-n22) + (n-m22) + 2X
2(n22+m22-n) = 2X
Or
(n22+m22-n) = X
which is the fourth cell count.
Viewing Bigrams as Graphs
In bigrams, the order of words is important. Bigram word1<>word2 shows
that word2 follows word1. Bigrams can be viewed as a directed graph
where a bigram word1<>word2 will represent a directed edge e from
initial vertex word1 to terminal vertex word2(word1->word2).
In this case,
n11, which is the number of times bigram word1<>word2 occurs, becomes
the weight of the directed edge word1->word2.
n1p, which is the number of bigrams having word1 at 1st position,
becomes the out degree of vertex word1
and
np1, which is the number of bigrams having word2 at 2nd position,
becomes the in degree of vertex word2
combig.pl creates a new list of word pairs from these bigrams such that
the order of words can be ignored. Viewed another way, it converts the
directed graph of given bigrams to an undirected graph showing new word
pairs.
A pair say
word1<>word2<>n11 n1p np1
shown in the output of combig can be viewed as an undirected edge
joining word1 and word2 having weight n11. If we count the degree of
vertex word1 it will be n1p and degree of vertex word2 will be np1.
AUTHORS
Amruta Purandare, pura0010@d.umn.edu
Ted Pedersen, tpederse@d.umn.edu
Last update 03/22/04 by ADP
This work has been partially supported by a National Science Foundation
Faculty Early CAREER Development award (#0092784).
BUGS
SEE ALSO
http://www.d.umn.edu/~tpederse/nsp.html
COPYRIGHT
Copyright (c) 2004, Amruta Purandare and Ted Pedersen
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to
The Free Software Foundation, Inc., 59 Temple Place - Suite 330,
Boston, MA 02111-1307, USA.
perl v5.20.2 2008-03-24 COMBIG(1)