Duplicate Detection for RSS feeds

Get help with all aspects of SABnzbd
Forum rules
Help us help you:
  • Are you using the latest stable version of SABnzbd? Downloads page.
  • Tell us what system you run SABnzbd on.
  • Adhere to the forum rules.
  • Do you experience problems during downloading?
    Check your connection in Status and Interface settings window.
    Use Test Server in Config > Servers.
    We will probably ask you to do a test using only basic settings.
  • Do you experience problems during repair or unpacking?
    Enable +Debug logging in the Status and Interface settings window and share the relevant parts of the log here using [ code ] sections.
Post Reply
jt25741
Newbie
Newbie
Posts: 12
Joined: March 25th, 2009, 2:48 pm

Duplicate Detection for RSS feeds

Post by jt25741 »

Hi All,

I have some questions about the built-in duplicate detection, introduced in 0.4.6.  I understand it cannot be defeated from other posts, but where is the cache of nzbs stored for this feature to work?

What is the limit of nzbs that will be cached for duplicate detection?

Will duplicate detection persist across reboots?

How does the config switch "backup folder for nzbs" work with this feature?    Does the built-in dup detection go first, then the system checks the backup folder as a second pass for duplicates?

I had system running for many days with no duplicates and not using the backup directory either.    After a rather ungraceful shutdown, upon reboot SAB started to download all kinds of stuff that had already been downloaded -- effectively duplicate detection was not working after this reboot from my earlier runs.  Is it supposed to be persistent?

I have since activated the backup folder to hopefully help stop this behavior from reoccurring, but wanted to better understand how these features work both idependantly and together..

Thanks for any additional clarity, and congratulations again on SAB -- it is great!

jt
User avatar
switch
Moderator
Moderator
Posts: 1380
Joined: January 17th, 2008, 3:55 pm
Location: UK

Re: Duplicate Detection for RSS feeds

Post by switch »

Assuming you are using RSS, there are currently two levels of duplicate detection.

1) RSS feed detection

SABnzbd keeps a per feed track of every item that has been downloaded from that feed for a period of 4 weeks. This track is kept in memory and is written to disk on program exit. It is likely this file got corrupted when you had a bad shutdown and it create a new blank version.


2) Backup Folder detection

You need to have a backup folder set for this to work, it will scan the backup folder for any nzb names matching the one you are about to download, if it matches then that nzb will be blocked.

As for your specific questions:
where is the cache of nzbs stored for this feature to work?
The names of downloaded nzb files are stored in the file rss_data.sab in your cache folder, however it is in a python format, so it is not easily viewable/editable.
What is the limit of nzbs that will be cached for duplicate detection?
Unlimited, however ones older than 4 weeks are pruned.
Will duplicate detection persist across reboots?
Yes
How does the config switch "backup folder for nzbs" work with this feature?     Does the built-in dup detection go first, then the system checks the backup folder as a second pass for duplicates?
The RSS duplicate is detection is done first, checking the name of the nzb against those that have been downloaded from that feed. If it passes that, then the nzb is downloaded, and the filename is matched against the files in the nzbbackup folder.

I do think the duplicate detection does need some work. I have coded improvements to #1 to store downloaded items in a more sane format, however I couldn't decide between a .cvs file or using a database. Also I felt the backup folder detection feature came out and made the improvements less important.

I also looked at searching the history to use as duplicate detection, however that was put down due to users not keeping their history between major version changes or just keeping their history cleaned of items. This behaviour may change with 0.5 however, as their is more value in keeping the history, and it shouldn't be cleared upon major revisions any more.
Last edited by switch on April 29th, 2009, 12:54 pm, edited 1 time in total.
jt25741
Newbie
Newbie
Posts: 12
Joined: March 25th, 2009, 2:48 pm

Re: Duplicate Detection for RSS feeds

Post by jt25741 »

Switch, thanks for your comprehensive reply.  This helps quite a bit.

It seems my files must have been corrupted somehow. 

I would think that using the existing history for dup-detection might be the most elegant approach, thereby not adding additional complexity like a database or more flat files specifically for the purpose of this feature.      One can be told to retain history between upgrades if they care -- I certainly would if I was told to, as dup-detection is something that is great.

In the meantime, I guess I will just keep the backup directory growing bigger and bigger, but it would seem quite a waste to have all those compressed NZBs building up, when all that was really needed was the NZB filename (if that is all dup-detection uses anyways).  Even though the current implementation has persistence I would like it to last longer than specified, and as a safeguard I'll use the backup directory for when shutting down doesn't go quite as cleanly as needed.

Thanks again!

JT
jt25741
Newbie
Newbie
Posts: 12
Joined: March 25th, 2009, 2:48 pm

Re: Duplicate Detection for RSS feeds

Post by jt25741 »

In second thought :)    The history is for all activity right?  That is inclusive of both manual NZB operations and automated RSS.      I do not value dup-detection in anything but RSS feeds.    This is because I generally know what I am attempting to download when/if I do it manually.    It is just the automated RSS feeds that can get out of hand.  Consequently, I suppose the idea of a separate database is a good one.  This way it can have persistence and performance for larger than 4 weeks retention, which is not nearly long enough IMO.  And the history for everything, including ad-hoc downloads can be purged at will, independently.      If there was already an RSS history maintained inside SAB that could grow large, I guess I would like the original idea of using the history for dup-detection purposes.

Your thoughts, help and support on this stuff is extremely appreciated.

Best

JT
Post Reply