|
|
Programming questions
-
09-02-2007, 12:45 PM |
-
09-02-2007, 7:53 PM |
-
Rory
-
-

-
Joined on 08-14-2006
-
Portland, OR
-
Posts 22,180
-
Points 320,535
-
|
Re: Programming questions
1) In a world of six billion people, there's a market for anything.
2) Speed, if I understand your question correctly. String operations are expensive in every language in the universe. Working with a hash is going to be significantly faster. Since you don't need to compare the files line by line, but just check instead if you get the same hash, you're cutting out a lot of overhead.
3) I don't know about "cunning" :) I'd suggest you start off with something simple and that you make it modular so that you can have your choice of multiple methods. The simple one - maybe doing that slow text comparison - might trigger other ideas. You're a freakishly clever guy - I don't doubt you'll come up with something interesting. If you want a jumpstart, download the source for one of these apps and see how they're doing it. Looking at someone else's implementation gives you something to approach critically. Look for any possible mistakes and then try to improve upon them.
That's my version of what-to-do. Given that this is a development issue, there are widely different views out there just waiting to invalidate every aspect of mine :)
- Rory - Owner of Neopoleon - Monster in the sack
|
|
-
09-02-2007, 10:10 PM |
-
GuyIncognito
-
-
-
Joined on 08-15-2006
-
Parts Uknown
-
Posts 285
-
Points 3,745
-
|
Re: Programming questions
Who is this aimed at? Home computer users? Business computer users?
What kind of duplicate files? Document files? Music or video files? Any kind of files?
I thought I read about something recently about new software finding duplicate files on a NAS (network area storage) --
http://download.microsoft.com/download/A/6/4/A6403FBB-8CDB-4826-BF8F-56F79CB5A184/CaseFSCUsingNAS.ppt http://www.infoworld.com/article/06/09/18/38TCstorageserver_1.html
The mind plays tricks on you. You play tricks back! It's like you're unraveling a big cable-knit sweater that someone keeps knitting and knitting and knitting and knitting and knitting and knitting...
|
|
-
09-03-2007, 1:27 AM |
-
punky
-
-

-
Joined on 08-16-2006
-
Oslo, Norway
-
Posts 284
-
Points 3,185
-
|
Re: Programming questions
Appreciate the feedback, always interesting to hear what clever people have to say about things like this. I guess I should have shared a bit more of my own thinking first, but I sort of wanted that un-punky-tainted response. Thanks.
1) I guess so. I wrote my initial naive version (sans GUI) in python in, like, 15 minutes, of which, like, 14 were spent looking up stuff in a paper brick. I'm not going to be selling my own version though -- I just think it's crazy that people can charge money for such a simple tool.
2) I think that's the rationale as well, but I also think it's wrong. Calculating an MD5 hash a) requires you to read an entire file and b) perform a non-trivial calculation upon it. Certainly something to avoid if possible. While string handling is expensive, so is file I/O. I think binary comparison, bit by bit, is going to be much faster, because you can terminate on as soon as there's a difference, and there's no math involved. For two different files, a differing bit should occur pretty soon. My current thinking is actually to have two passes over the directory (or set of directories) in which to find duplicates. In pass#1, I simply keep track of file sizes, and register sizes that occur more than once (this is to avoid keeping track of all the paths of all the files, 99% of which will be unique). Certainly, for two files to (binary) equal, they need to be of identical size. In pass#2, I gather up files that have one of the sizes found in pass#1. For each size, I get a set of files that are potentially equal. To verify equality, I compare the files bit-by-bit. Of course, equality is transitive, so if file A is equal to B and C, B is also equal to C. Furthermore, if A is equal to B but not to C, B is not equal to C. So in practice, there won't be that many comparisons to make.
3) I've actually been looking (a bit haphazardly) for an open-source implementation, but I haven't been able to find one. It's probably because authors can charge thirty dollars if they close the source ;-) It's good advice, though. Wish I could find one.
Why don't I challenge the Four Other Foreign Men and One Hot Girl to write a better/more clever tool than mine? Because they probably have better things to do with their time? Oh.
"If it was so, it might be; and if it were so, it would be; but as it isn't, it ain't. That's logic." - Tweedledee
|
|
-
09-03-2007, 1:37 AM |
-
punky
-
-

-
Joined on 08-16-2006
-
Oslo, Norway
-
Posts 284
-
Points 3,185
-
|
Re: Programming questions
GuyIncognito:Who is this aimed at?
It's not really aimed -- I'm just writing it to learn a bit of python. So I guess the answer is "myself". I'm not going to sell it. If it becomes any good at all, it might end up as one of those 12 downloads sourceforge thingies ;-) GuyIncognito:What kind of duplicate files? Document files? Music or video files? Any kind of files?
I'm thinking strict binary duplication. No intelligent diffing of text files, no fuzzy logic, nothing but bits.
"If it was so, it might be; and if it were so, it would be; but as it isn't, it ain't. That's logic." - Tweedledee
|
|
-
09-03-2007, 4:19 AM |
-
punky
-
-

-
Joined on 08-16-2006
-
Oslo, Norway
-
Posts 284
-
Points 3,185
-
|
Re: Programming questions
Having thought (thunk?) a bit more about the speed issue, I guess that the hash approach might perform better than mine in the worst case, but worse in the average and best case (I guess I should measure this to have some hard facts to back my claims -- I just might do that).
The worst case scenario for my approach is the situation where you have several (N) actual duplicate files, and no false duplicates (i.e., files that are different but happen to have the same size). To discover actual duplicates, there's no way to avoid reading the entire file at least once -- regardless of whether you calculate a hash or just compare the bits as you go. My current approach would lead me to read the first file N-1 times, and the N-1 other files one time, for a total of 2N-2 entire file reads. However, I think I've come up with an improvement that would have me perform no more than N file reads... I'll post an update when I get around to implementing it (and find out that the added complexity outweighs any theoretical performance improvements! :-P)
"If it was so, it might be; and if it were so, it would be; but as it isn't, it ain't. That's logic." - Tweedledee
|
|
-
09-03-2007, 8:51 AM |
-
Massif
-
-

-
Joined on 10-24-2006
-
Bristol
-
Posts 288
-
Points 2,655
-
|
Re: Programming questions
Surely read through once and create hashes for all files (N file reads for N files), find duplicate hashes (all in memory) and display?
Better would be if the OS provided some sort of unique file ID that you could get without opening the file, but I imagine any unique file ID will be entirely arbitrary and not related to the contents of the file.
Still, you could at least query the filesystem for sizes, use that as a first pass and then perform hash-type comparisons on any files of identical sizes.
The one, the only, the undisputed king of the world.
|
|
-
09-03-2007, 7:19 PM |
-
punky
-
-

-
Joined on 08-16-2006
-
Oslo, Norway
-
Posts 284
-
Points 3,185
-
|
Re: Programming questions
Massif:Surely read through once and create hashes for all files (N file reads for N files), find duplicate hashes (all in memory) and display?
Oh, you're right. I'm an Ass, I should not try to be clever. For N "false positives" (equal size, yet all different), pairwise bit-by-bit comparison requires N! file accesses! While each access is likely to be short (files will differ early on), surely that's prohibitive. I will do some performance tests to verify it though. And, because I remain an Ass, I will try an overly complex hybrid approach as well... Which I expect will come in in third place!
Massif:Better would be if the OS provided some sort of unique file ID that you could get without opening the file, but I imagine any unique file ID will be entirely arbitrary and not related to the contents of the file.
Well, couldn't the file system keep an MD5 hash of each file?
Massif:Still, you could at least query the filesystem for sizes, use that as a first pass and then perform hash-type comparisons on any files of identical sizes.
Certainly, this is what I'm doing.
"If it was so, it might be; and if it were so, it would be; but as it isn't, it ain't. That's logic." - Tweedledee
|
|
-
09-04-2007, 12:45 AM |
-
Massif
-
-

-
Joined on 10-24-2006
-
Bristol
-
Posts 288
-
Points 2,655
-
|
Re: Programming questions
Why would the filesystem want a hash of each file?
In my defence, I was thinking out loud (thinking on-type?) for most of that rambling.
The one, the only, the undisputed king of the world.
|
|
-
09-04-2007, 1:06 AM |
-
punky
-
-

-
Joined on 08-16-2006
-
Oslo, Norway
-
Posts 284
-
Points 3,185
-
|
Re: Programming questions
Massif:Why would the filesystem want a hash of each file?
To help me solve my problem :-P Actually, the file system could use a hash or other unique ID to solve the problem of duplication/data bloat once and for all; see the links provided by Guy. Massif:In my defence, I was thinking out loud (thinking on-type?) for most of that rambling.
I like that, it's what I've been doing myself :-) It's a bit sad if we need to think through things ten times before posting. I find it much more efficient to have someone else point out the flaws in my logic than to have to do so myself.
"If it was so, it might be; and if it were so, it would be; but as it isn't, it ain't. That's logic." - Tweedledee
|
|
-
09-06-2007, 12:33 AM |
-
punky
-
-

-
Joined on 08-16-2006
-
Oslo, Norway
-
Posts 284
-
Points 3,185
-
|
Re: Programming questions
OK. A word of advice to anyone still bothering to read this thread: Never do math in public!
For some reason, my brain conjectured that the numbers N-1, N-2 ... 1 should be multiplied, when in fact they should be added. I hate it when my brain undersmarts me.
Of course, this is living proof that no-one is, in fact, reading this thread anymore -- or else I would have been corrected, right?
Anyways, for N "false positives", the actual number of file accesses is the sum of n for n = 1 to n = N-1, which is N*(N-1)/2. In other words, O(N²) in big Oh notation. So - quite a bit less horrible. In fact, the first few tests I ran yesternight seemed to imply that bit-by-bit fared better than MD5. But I'll be more specific and scientific about it when I get a bit more time on my hands.
"If it was so, it might be; and if it were so, it would be; but as it isn't, it ain't. That's logic." - Tweedledee
|
|
-
01-12-2008, 2:59 PM |
-
NickInTheUk
-
-
-
Joined on 01-13-2008
-
-
Posts 1
-
Points 5
-
|
Re: Programming questions
Rory,
If your algorithm uses a first pass file size check, ensire you do it right, I'm guessing there is a potential problem checking files across different volumes (formatted differently, specifically with different allocation unit sizes).
Just my 2p worth.
Nick
|
|
-
-
|
|