indexing items by ID rather than URL

Yesterday I made Forumzilla index items in its datastore by ID rather than URL. I did it to fix bug 12132, which causes Forumzilla to ignore multiple items with the same URL in feeds like the wiki.mozilla.org Recent Changes feed.

This is a pretty significant change to be taking on the stable 0.5 branch, but I’m doing so because I think it’s the necessary minimal fix for important feeds. Since not all feeds provide unique IDs for their items, I generate them if necessary by taking the SHA-1 digest of each item’s URL, date, title, and description.

To get SHA-1 digest functionality, I converted Paul Johnston’s JavaScript implementation of the SHA-1 algorithm to a JavaScript XPCOM component which implements the nsISHA1Service interface. I might have used his implementation of the MD5 algorithm instead, but its BSD license comes with an advertising clause (I think because he’s reusing some code that includes such a clause), while his SHA-1 implementation’s BSD license contains no such clause.

To prevent Forumzilla from redownloading already downloaded items, I wrote some code that checks to see if new items are actually old items that were previously indexed by URL. If so, I just convert the old record to a new, ID-indexed one and don’t redownload the item.

With the changes, the aforementioned wiki feed now works, and my regular subscriptions all seem to work, too, but the changes will need more testing to make sure I haven’t regressed any feeds in the process, so I built an 0.5 branch development package you can use to test the changes. I’ve tested it in Thunderbird 1.0, Thunderbird 1.5 (release candidates), and Mozilla 1.7 on Linux. Give it a try in your own favorite compatible mail client, and let me know how it works for you.

2 thoughts on “indexing items by ID rather than URL”

Dan Veditz says:

2005-11-21 at 21:40

If you didn’t care about the 1.0 branch you could use the built-in scriptable nsICryptoHash interface. As an example see the update code in mozilla/toolkit.
Myk says:

2005-11-22 at 02:43

Thanks Dan, nsICryptoHash works great! I’m using it now if available, otherwise falling back to the code I picked up from Paul Johnston for compatibility with earlier versions of Thunderbird that don’t have nsICryptoHash.

Comments are closed.