I write application for plagiarism detection in big text files. After reading many articles about it i decided to use Winnowing algorithm (with Karp-Rabin rolling hash function), but i have some problems with it.
Data:
I have two simple text files - first is bigger one, second is just one paragraph from first one.
Used algorithm:
This is algorithm which i used to select my fingerprints from all hashes.
void winnow(int w /*window size*/) {
// circular buffer implementing window of size w
hash_t h[w];
for (int i=0; i<w; ++i) h[i] = INT_MAX;
int r = 0; // window right end
int min = 0; // index of minimum hash
// At the end of each iteration, min holds the
// position of the rightmost minimal hash in the
// current window. record(x) is called only the
// first time an instance of x is selected as the
// rightmost minimal hash of a window.
while (true) {
r = (r + 1) % w; // shift the window by one
h[r] = next_hash(); // and add one new hash, if hash = -1, then it's end of file
if(h[r] == -1)
break;
if (min == r) {
// The previous minimum is no longer in this
// window. Scan h leftward starting from r
// for the rightmost minimal hash. Note min
// starts with the index of the rightmost
// hash.
for(int i=(r-1)%w; i!=r; i=(i-1+w)%w)
if (h[i] < h[min]) min = i;
record(h[min], global_pos(min, r, w));
} else {
// Otherwise, the previous minimum is still in
// this window. Compare against the new value
// and update min if necessary.
if (h[r] <= h[min]) { // (*)
min = r;
record(h[min], global_pos(min, r, w));
}
}
}
}
Next, to detect if we have same text in both files i just compare fingerprints from both textes to check if we have matches. So to detect plagiarism, algorithm has to take hashes which will start exactly in the same place in text, for example:
Text1: A do run |t^his my check your.
Text2: My bla lol |t^his my dasd chicken.
To get correct hashes, which will have the same values (which also means that we have the same text), algorithm should take fingerprints from places which i pointed by '|' or '^' (i assume that we take 5 characters to compute hash, without spaces). It can't take hash from '|' in text 1 and '^' in text 2 because these two hashes will be different and plagiarism won't be detected.
Problem:
To detect if this paragraph was copied from text number 1 i have to have two same fingerprints, somewhere in both texts. Problem is algorithm choose that fingerprints, which don't fit eachother, i mean they just miss, even in much bigger texts.
Question:
Have you got any ideas how i can improve this algorithm (which actually brings down to correct algorithm of takin fingerprints), that it would have more probability of finding plagiarisms?
My thoughts:
I thought about run winnow function couple times, for different window sizes (which will cause that different hashes would be taken), but for large texts on which this program will have to work (like 2MB just text) this will take too much time.
If you have running window over which you are computing the hash, you can update the hash value in constant time when window moves. The method is called Rabin fingerprint (see also). This should allow you to compute all fingerprints of size X in O(n) running time (n is a size of an input document). I guess the paper you cite is some advanced extension of this method and when implemented correctly it should also give you a similar running time. The key is to update the hash not recompute it.