Add a smarter dupe check

lysp · Post by **lysp** » November 28th, 2011, 8:20 pm

Rather than keeping a copy of existing nzb files for checking for duplicates is it possible to use another method?

1 - a unique identifier in the nzb file (if there is one)
2 - an md5 of the nzb file

I'd rather keep a list of id's rather than needing to keep a copy of all the nzb's i've downloaded (including there name) for privacy reasons.

Post by **shypike** » November 29th, 2011, 4:36 am

lysp wrote:rather than needing to keep a copy of all the nzb's i've downloaded (including there name) for privacy reasons.

So you're not saving the downloads either

We're looking at an improved method, but it will take a while.

lysp · Post by **lysp** » November 29th, 2011, 8:38 pm

You have me there

However most of my files are automatically renamed to remove groups/etc and moved to another server.

I'm guessing like most people i'd prefer not to keep my complete download history in a single folder, but still have a check to make sure i (or my automation routines), don't get the same thing twice and waste bandwidth.

Had a quick follow up and saw that a nzb has a list of message-id's, but not a single one to identify that as a whole.

Possible implementing something like:

Hash(MessageID1 + MessageID2 + MessageID3 + MessageID4 ... MessageIDX) to identify a unique nzb.

Added benefit is it wont take up too much space either.

Another issue i found (which is kind of a bug), is that nzb's named the same name get dupe checked out even though the contents may be different.

Eg if a website standardises a download to "Cart.NZB", then that may be rejected. The above hash method would bypass this "bug".

Post by **shypike** » November 30th, 2011, 11:35 am

"Duplicate" is not a well-defined term in this case.
Most don't want to download the same item twice, not even from different release groups.
So a name is a valid criterium and so is a content check.
But neither are very reliable.
Duplicate detection is mostly effective against the same items sneaking in due to
different RSS feeds carrying the same posts.

We are looking at detection based on the methods used by the Sorting functions.
But of course that's less useful for unique items.

lysp · Post by **lysp** » January 10th, 2012, 7:47 pm

That's probably taking it one step further.

Currently the dupe check works on release name, and most scene releases include the group-name as part of the release name.

So dupe checking by different groups currently will not work.

However changing it to a hash will keep the checking functionality/logic the same but without exposing the file/release names to people browsing the computer.

Support Forum

Add a smarter dupe check

Add a smarter dupe check

Re: Add a smarter dupe check

Re: Add a smarter dupe check

Re: Add a smarter dupe check

Re: Add a smarter dupe check