Add a smarter dupe check

Want something added? Ask for it here.
Post Reply
lysp
Newbie
Newbie
Posts: 8
Joined: November 28th, 2011, 8:17 pm

Add a smarter dupe check

Post by lysp »

Rather than keeping a copy of existing nzb files for checking for duplicates is it possible to use another method?

1 - a unique identifier in the nzb file (if there is one)
2 - an md5 of the nzb file

I'd rather keep a list of id's rather than needing to keep a copy of all the nzb's i've downloaded (including there name) for privacy reasons.
User avatar
shypike
Administrator
Administrator
Posts: 19774
Joined: January 18th, 2008, 12:49 pm

Re: Add a smarter dupe check

Post by shypike »

lysp wrote:rather than needing to keep a copy of all the nzb's i've downloaded (including there name) for privacy reasons.
So you're not saving the downloads either :)
We're looking at an improved method, but it will take a while.
lysp
Newbie
Newbie
Posts: 8
Joined: November 28th, 2011, 8:17 pm

Re: Add a smarter dupe check

Post by lysp »

You have me there :)

However most of my files are automatically renamed to remove groups/etc and moved to another server.

I'm guessing like most people i'd prefer not to keep my complete download history in a single folder, but still have a check to make sure i (or my automation routines), don't get the same thing twice and waste bandwidth.

Had a quick follow up and saw that a nzb has a list of message-id's, but not a single one to identify that as a whole.

Possible implementing something like:

Hash(MessageID1 + MessageID2 + MessageID3 + MessageID4 ... MessageIDX) to identify a unique nzb.

Added benefit is it wont take up too much space either.

Another issue i found (which is kind of a bug), is that nzb's named the same name get dupe checked out even though the contents may be different.

Eg if a website standardises a download to "Cart.NZB", then that may be rejected. The above hash method would bypass this "bug".
User avatar
shypike
Administrator
Administrator
Posts: 19774
Joined: January 18th, 2008, 12:49 pm

Re: Add a smarter dupe check

Post by shypike »

"Duplicate" is not a well-defined term in this case.
Most don't want to download the same item twice, not even from different release groups.
So a name is a valid criterium and so is a content check.
But neither are very reliable.
Duplicate detection is mostly effective against the same items sneaking in due to
different RSS feeds carrying the same posts.

We are looking at detection based on the methods used by the Sorting functions.
But of course that's less useful for unique items.
lysp
Newbie
Newbie
Posts: 8
Joined: November 28th, 2011, 8:17 pm

Re: Add a smarter dupe check

Post by lysp »

That's probably taking it one step further.

Currently the dupe check works on release name, and most scene releases include the group-name as part of the release name.

So dupe checking by different groups currently will not work.

However changing it to a hash will keep the checking functionality/logic the same but without exposing the file/release names to people browsing the computer.
Post Reply