剽窃检测 - 风选算法 - 指纹冲突

Plagiarism detection - winnowing algorithm - fingerprints clash

I write application for plagiarism detection in big text files. After reading many articles about it i decided to use Winnowing algorithm (with Karp-Rabin rolling hash function), but i have some problems with it.

Data:

I have two simple text files - first is bigger one, second is just one paragraph from first one.

Used algorithm:

This is algorithm which i used to select my fingerprints from all hashes.

void winnow(int w /*window size*/) {
    // circular buffer implementing window of size w
    hash_t h[w];
    for (int i=0; i<w; ++i) h[i] = INT_MAX;
    int r = 0; // window right end
    int min = 0; // index of minimum hash
    // At the end of each iteration, min holds the
    // position of the rightmost minimal hash in the
    // current window. record(x) is called only the
    // first time an instance of x is selected as the
    // rightmost minimal hash of a window.
    while (true) {
        r = (r + 1) % w; // shift the window by one
        h[r] = next_hash(); // and add one new hash, if hash = -1, then it's end of file
        if(h[r] == -1)
            break;
        if (min == r) {
            // The previous minimum is no longer in this
            // window. Scan h leftward starting from r
            // for the rightmost minimal hash. Note min
            // starts with the index of the rightmost
            // hash.
            for(int i=(r-1)%w; i!=r; i=(i-1+w)%w)
                if (h[i] < h[min]) min = i;
                    record(h[min], global_pos(min, r, w));
        } else {
            // Otherwise, the previous minimum is still in
            // this window. Compare against the new value
            // and update min if necessary.
            if (h[r] <= h[min]) { // (*)
                min = r;
                record(h[min], global_pos(min, r, w));
            }
        }
    }
}

Next, to detect if we have same text in both files i just compare fingerprints from both textes to check if we have matches. So to detect plagiarism, algorithm has to take hashes which will start exactly in the same place in text, for example:

Text1: A do run |t^his my check your.

Text2: My bla lol |t^his my dasd chicken.

To get correct hashes, which will have the same values (which also means that we have the same text), algorithm should take fingerprints from places which i pointed by '|' or '^' (i assume that we take 5 characters to compute hash, without spaces). It can't take hash from '|' in text 1 and '^' in text 2 because these two hashes will be different and plagiarism won't be detected.

Problem:

To detect if this paragraph was copied from text number 1 i have to have two same fingerprints, somewhere in both texts. Problem is algorithm choose that fingerprints, which don't fit eachother, i mean they just miss, even in much bigger texts.

Question:

Have you got any ideas how i can improve this algorithm (which actually brings down to correct algorithm of takin fingerprints), that it would have more probability of finding plagiarisms?

My thoughts:

I thought about run winnow function couple times, for different window sizes (which will cause that different hashes would be taken), but for large texts on which this program will have to work (like 2MB just text) this will take too much time.

If you have running window over which you are computing the hash, you can update the hash value in constant time when window moves. The method is called Rabin fingerprint (see also). This should allow you to compute all fingerprints of size X in O(n) running time (n is a size of an input document). I guess the paper you cite is some advanced extension of this method and when implemented correctly it should also give you a similar running time. The key is to update the hash not recompute it.

Yes i totally agree, but that's not the problem. Running time and memory requirements are my last worries and i don't have to carry about (at least not now), so if you have something on your mind - i'll be happy to listen (or actually read :D)
Maybe i don't understand what you wrote, but actually i have linear time of computing hashes, because i use rolling hash function. My problem is to get these hashes, which will allow me to find plagiarism, but not to get too much of them.
But if one text is short, you can compute and store all hashes from this short text. Then when you compute rolling hash over the large text, you just check if a new hash value equals to one of hash values for the short text (you can do it in O(1) with a hash table). Is this right or did I misunderstand something?
Actually, this was just an example, that second text is just one paragraph from first one. In fact both texts can have over half milion characters. I'm sorry if i wrote it misleadingly. By the way - +1 for trying help.
OK, now I understand the issue. I remember watching a talk about detecting similar documents among a large set of documents with rabbit fingerprints. The algorithm was probabilistic and randomized (so it was impossible for a person that created a document to guess which fingerprints are to be taken). But the method did not guarantee to always find a match. I guess you need to rely on a fact that in large plagiarized documents, you will have many matches. If you need an algorithm that guarantees to find a match I guess you need to sacrifice running time and memory requirements.