Re: approximate string matching (Torben AEgidius Mogensen)
19 Jan 2000 01:15:39 -0500

          From comp.compilers

Related articles
approximate string matching (Martin Okrslar) (2000-01-15)
Re: approximate string matching (2000-01-15)
Re: approximate string matching (2000-01-19)
Re: approximate string matching (Lionel Delafosse) (2000-01-19)
Re: approximate string matching maratb@CS.Berkeley.EDU (Marat Boshernitsan) (2000-01-19)
Re: approximate string matching (2000-01-19)
| List of all articles for this month |

From: (Torben AEgidius Mogensen)
Newsgroups: comp.compilers
Date: 19 Jan 2000 01:15:39 -0500
Organization: Department of Computer Science, U of Copenhagen
References: 00-01-044
Keywords: theory

Martin Okrslar <> writes:

>I would like to 'cluster' some files regarding their syntactic
>similarity. (I am correcting the homework of some students, and since
>I showed them with 'diff', that they did a simple cp they started to
>change their files minimally, so that one does not see on the first
>diff-glance, that the file is a copy. Note: We are not allowed to
>decrease the score of anybody, just because we 'suspect' copying. I
>just want to show the students, that we are not completely dumb.)

I read (in some months ago that a
professor at Glasgow University wrote a program to find similarities
in student reports and found some cases of cheating this way. He is
apparently intending to sell this program to other universities.

A simple test that handles things like interchanging sections,
systematic replace of one word by another is to gzip both files and
check if they are of nearly equal size afterwards. This, obviously,
can give a lot of false positives but might work as an initial coarse

This procedure is probably best for program texts, where most
systematic wrangling won't change the gzipped size very much.

Torben Mogensen (

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.