Re: approximate string matching

torbenm@diku.dk (Torben AEgidius Mogensen)
19 Jan 2000 01:15:39 -0500

From comp.compilers

Related articles
approximate string matching okrslar@informatik.uni-muenchen.de (Martin Okrslar) (2000-01-15)
Re: approximate string matching jrs@JustMakesSense.com (2000-01-15)
Re: approximate string matching rweaver@ix.netcom.com (2000-01-19)
Re: approximate string matching lionel_delafosse@mail.dotcom.fr (Lionel Delafosse) (2000-01-19)
Re: approximate string matching maratb@CS.Berkeley.EDU (Marat Boshernitsan) (2000-01-19)
*Re: approximate string matching torbenm@diku.dk* (2000-01-19)**

| List of all articles for this month |

From:	torbenm@diku.dk (Torben AEgidius Mogensen)
Newsgroups:	comp.compilers
Date:	19 Jan 2000 01:15:39 -0500
Organization:	Department of Computer Science, U of Copenhagen
References:	00-01-044
Keywords:	theory

Martin Okrslar <okrslar@informatik.uni-muenchen.de> writes:

>I would like to 'cluster' some files regarding their syntactic
>similarity. (I am correcting the homework of some students, and since
>I showed them with 'diff', that they did a simple cp they started to
>change their files minimally, so that one does not see on the first
>diff-glance, that the file is a copy. Note: We are not allowed to
>decrease the score of anybody, just because we 'suspect' copying. I
>just want to show the students, that we are not completely dumb.)

I read (in http://www.theregister.co.uk) some months ago that a
professor at Glasgow University wrote a program to find similarities
in student reports and found some cases of cheating this way. He is
apparently intending to sell this program to other universities.

A simple test that handles things like interchanging sections,
systematic replace of one word by another is to gzip both files and
check if they are of nearly equal size afterwards. This, obviously,
can give a lot of false positives but might work as an initial coarse
sorting.

This procedure is probably best for program texts, where most
systematic wrangling won't change the gzipped size very much.

Torben Mogensen (torbenm@diku.dk)

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: approximate string matching

torbenm@diku.dk (Torben AEgidius Mogensen)19 Jan 2000 01:15:39 -0500

torbenm@diku.dk (Torben AEgidius Mogensen)
19 Jan 2000 01:15:39 -0500