in Search
Welcome to Neopoleon - Sign in | Join | Help
Navigation: Home | Forums | Galleries

Programming questions

Last post 03-14-2008, 6:20 AM by Photochicken. 13 replies.
Sort Posts: Previous Next
  •  09-02-2007, 12:45 PM 27023

    Programming questions

    Hi,

    As a simple self-educational exercise to learn a new language, I'm writing an app that finds file duplicates. A couple of questions that the readers of a "Tech Nerd Talk" forum might be able to answer:

    1) Are there really people who are willing to pay thirty bucks for shareware apps that do this?

    2) Why do so many of the available apps use MD5 hashes or CRC32 checks to find duplicates?

    3) Do you have any suggestions to a cunning algorithm for such an app?

    "If it was so, it might be; and if it were so, it would be; but as it isn't, it ain't. That's logic." - Tweedledee
  •  09-02-2007, 7:53 PM 27034 in reply to 27023

    Re: Programming questions

    1) In a world of six billion people, there's a market for anything.

    2) Speed, if I understand your question correctly. String operations are expensive in every language in the universe. Working with a hash is going to be significantly faster. Since you don't need to compare the files line by line, but just check instead if you get the same hash, you're cutting out a lot of overhead.

    3) I don't know about "cunning" :) I'd suggest you start off with something simple and that you make it modular so that you can have your choice of multiple methods. The simple one - maybe doing that slow text comparison - might trigger other ideas. You're a freakishly clever guy - I don't doubt you'll come up with something interesting. If you want a jumpstart, download the source for one of these apps and see how they're doing it. Looking at someone else's implementation gives you something to approach critically. Look for any possible mistakes and then try to improve upon them.

    That's my version of what-to-do. Given that this is a development issue, there are widely different views out there just waiting to invalidate every aspect of mine :)
    - Rory
    - Owner of Neopoleon
    - Monster in the sack
  •  09-02-2007, 10:10 PM 27036 in reply to 27034

    Re: Programming questions

    Who is this aimed at?  Home computer users?  Business computer users?

    What kind of duplicate files?  Document files?  Music or video files?  Any kind of files?

    I thought I read about something recently about new software finding duplicate files on a NAS (network area storage) --

    http://download.microsoft.com/download/A/6/4/A6403FBB-8CDB-4826-BF8F-56F79CB5A184/CaseFSCUsingNAS.ppt
    http://www.infoworld.com/article/06/09/18/38TCstorageserver_1.html


    The mind plays tricks on you. You play tricks back! It's like you're unraveling a big cable-knit sweater that someone keeps knitting and knitting and knitting and knitting and knitting and knitting...
  •  09-03-2007, 1:27 AM 27042 in reply to 27034

    Re: Programming questions

    Appreciate the feedback, always interesting to hear what clever people have to say about things like this. I guess I should have shared a bit more of my own thinking first, but I sort of wanted that un-punky-tainted response. Thanks.

    1) I guess so. I wrote my initial naive version (sans GUI) in python in, like, 15 minutes, of which, like, 14 were spent looking up stuff in a paper brick. I'm not going to be selling my own version though -- I just think it's crazy that people can charge money for such a simple tool.

    2) I think that's the rationale as well, but I also think it's wrong. Calculating an MD5 hash a) requires you to read an entire file and b) perform a non-trivial calculation upon it. Certainly something to avoid if possible. While string handling is expensive, so is file I/O. I think binary comparison, bit by bit, is going to be much faster, because you can terminate on as soon as there's a difference, and there's no math involved. For two different files, a differing bit should occur pretty soon. My current thinking is actually to have two passes over the directory (or set of directories) in which to find duplicates. In pass#1, I simply keep track of file sizes, and register sizes that occur more than once (this is to avoid keeping track of all the paths of all the files, 99% of which will be unique). Certainly, for two files to (binary) equal, they need to be of identical size. In pass#2, I gather up files that have one of the sizes found in pass#1. For each size, I get a set of files that are potentially equal. To verify equality, I compare the files bit-by-bit. Of course, equality is transitive, so if file A is equal to B and C, B is also equal to C. Furthermore, if A is equal to B but not to C, B is not equal to C. So in practice, there won't be that many comparisons to make.

    3) I've actually been looking (a bit haphazardly) for an open-source implementation, but I haven't been able to find one. It's probably because authors can charge thirty dollars if they close the source ;-) It's good advice, though. Wish I could find one.

    Why don't I challenge the Four Other Foreign Men and One Hot Girl to write a better/more clever tool than mine? Because they probably have better things to do with their time? Oh.


    "If it was so, it might be; and if it were so, it would be; but as it isn't, it ain't. That's logic." - Tweedledee
  •  09-03-2007, 1:37 AM 27045 in reply to 27036

    Re: Programming questions

    GuyIncognito:
    Who is this aimed at?

    It's not really aimed -- I'm just writing it to learn a bit of python. So I guess the answer is "myself". I'm not going to sell it. If it becomes any good at all, it might end up as one of those 12 downloads sourceforge thingies ;-)

    GuyIncognito:
    What kind of duplicate files?  Document files?  Music or video files?  Any kind of files?

    I'm thinking strict binary duplication. No intelligent diffing of text files, no fuzzy logic, nothing but bits.

    "If it was so, it might be; and if it were so, it would be; but as it isn't, it ain't. That's logic." - Tweedledee
  •  09-03-2007, 4:19 AM 27052 in reply to 27045

    Re: Programming questions

    Having thought (thunk?) a bit more about the speed issue, I guess that the hash approach might perform better than mine in the worst case, but worse in the average and best case (I guess I should measure this to have some hard facts to back my claims -- I just might do that).

    The worst case scenario for my approach is the situation where you have several (N) actual duplicate files, and no false duplicates (i.e., files that are different but happen to have the same size). To discover actual duplicates, there's no way to avoid reading the entire file at least once -- regardless of whether you calculate a hash or just compare the bits as you go. My current approach would lead me to read the first file N-1 times, and the N-1 other files one time, for a total of 2N-2 entire file reads. However, I think I've come up with an improvement that would have me perform no more than N file reads... I'll post an update when I get around to implementing it (and find out that the added complexity outweighs any theoretical performance improvements! :-P)

    "If it was so, it might be; and if it were so, it would be; but as it isn't, it ain't. That's logic." - Tweedledee
  •  09-03-2007, 8:51 AM 27056 in reply to 27052

    Re: Programming questions

    Surely read through once and create hashes for all files (N file reads for N files), find duplicate hashes (all in memory) and display?

    Better would be if the OS provided some sort of unique file ID that you could get without opening the file, but I imagine any unique file ID will be entirely arbitrary and not related to the contents of the file.

    Still, you could at least query the filesystem for sizes, use that as a first pass and then perform hash-type comparisons on any files of identical sizes.


    The one, the only, the undisputed king of the world.
  •  09-03-2007, 7:19 PM 27060 in reply to 27056

    Re: Programming questions

    Massif:
    Surely read through once and create hashes for all files (N file reads for N files), find duplicate hashes (all in memory) and display?

    Oh, you're right. I'm an Ass, I should not try to be clever. For N "false positives" (equal size, yet all different), pairwise bit-by-bit comparison requires N! file accesses! While each access is likely to be short (files will differ early on), surely that's prohibitive. I will do some performance tests to verify it though. And, because I remain an Ass, I will try an overly complex hybrid approach as well... Which I expect will come in in third place!

    Massif:
    Better would be if the OS provided some sort of unique file ID that you could get without opening the file, but I imagine any unique file ID will be entirely arbitrary and not related to the contents of the file.

    Well, couldn't the file system keep an MD5 hash of each file?

    Massif:
    Still, you could at least query the filesystem for sizes, use that as a first pass and then perform hash-type comparisons on any files of identical sizes.

    Certainly, this is what I'm doing.
    "If it was so, it might be; and if it were so, it would be; but as it isn't, it ain't. That's logic." - Tweedledee
  •  09-04-2007, 12:45 AM 27072 in reply to 27060

    Re: Programming questions

    Why would the filesystem want a hash of each file?

    In my defence, I was thinking out loud (thinking on-type?) for most of that rambling.


    The one, the only, the undisputed king of the world.
  •  09-04-2007, 1:06 AM 27075 in reply to 27072

    Re: Programming questions

    Massif:
    Why would the filesystem want a hash of each file?

    To help me solve my problem :-P
    Actually, the file system could use a hash or other unique ID to solve the problem of duplication/data bloat once and for all; see the links provided by Guy.

    Massif:
    In my defence, I was thinking out loud (thinking on-type?) for most of that rambling.

    I like that, it's what I've been doing myself :-)
    It's a bit sad if we need to think through things ten times before posting. I find it much more efficient to have someone else point out the flaws in my logic than to have to do so myself.

    "If it was so, it might be; and if it were so, it would be; but as it isn't, it ain't. That's logic." - Tweedledee
  •  09-06-2007, 12:33 AM 27093 in reply to 27075

    Re: Programming questions

    OK. A word of advice to anyone still bothering to read this thread: Never do math in public!

    For some reason, my brain conjectured that the numbers N-1, N-2 ... 1 should be multiplied, when in fact they should be added. I hate it when my brain undersmarts me.

    Of course, this is living proof that no-one is, in fact, reading this thread anymore -- or else I would have been corrected, right?

    Anyways, for N "false positives", the actual number of file accesses is the sum of n for n = 1 to n = N-1, which is N*(N-1)/2. In other words, O(N²) in big Oh notation. So - quite a bit less horrible. In fact, the first few tests I ran yesternight seemed to imply that bit-by-bit fared better than MD5. But I'll be more specific and scientific about it when I get a bit more time on my hands.

    "If it was so, it might be; and if it were so, it would be; but as it isn't, it ain't. That's logic." - Tweedledee
  •  01-12-2008, 2:59 PM 28788 in reply to 27023

    Re: Programming questions

    Rory,

    If your algorithm uses a first pass file size check, ensire you do it right, I'm guessing there is a potential problem checking files across different volumes (formatted differently, specifically with different allocation unit sizes).

    Just my 2p worth.

     

    Nick

  •  03-11-2008, 7:54 AM 29707 in reply to 27023

    Re: Programming questions

    This is in no way a real programming question because I'm not a programmer but maybe one of you can help me figure this out.

    I created a World Crossing forum to help raise money for a friend and wanted to do it as securely as possible. The choice I made was to make a nonprofit business account with Amazon.com so people from all around the world can donate with their credit cards and checking accounts securely. After setting it all up, I posted a html-based widget that consequently does nothing. Unfortunately, I can't get help making this thing work without paying Amazon Web Services. I have access keys and thought I did what I was supposed to do. I emailed Amazon asking why the widget seems inactive. No work, yet. Help??



    "If you plan to be offended, then don't ask."
  •  03-14-2008, 6:20 AM 29856 in reply to 29707

    Re: Programming questions

    FYI: I got my answer to this query from the Amazonians. I have to convince the WX server support staff that I am worthy of their attempts at helping me by getting them to actually incorporate the Amazon payment widget code into the entire page code. Ugh. ~not holding breath~







    "If you plan to be offended, then don't ask."
View as RSS news feed in XML