The Wayback Machine - https://web.archive.org/web/20081122150413/http://opensrs.com/blog/2008/10/cluster-a-email-service-issues/

Cluster A Email Service Issues

At approximately 17:30 UTC on October 6, Resellers on Cluster ‘A’ began experiencing issues with accessing mailboxes . We’ve put together some questions and answers to better inform Resellers what happened and what we’re doing to restore full service as quickly as possible.

The latest status updates are available at the OpenSRS Status Page and we will continue to update you here as well.

Q: How can I tell whether I’m affected?

A: Only Resellers on Cluster ‘A’ of our email service are affected. If you login to the Mail Administration Center (MAC) at http://admin.hostedemail.com/ or http://admin.a.hostedemail.com/ you are on Cluster ‘A’. If you login at http://admin.b.hostedemail.com/ then you are on Cluster ‘B’ and you are not affected.

Q: Does this affect both inbound and outbound mail?

A: No. Only inbound email is affected. Outbound messages will be sent as normally.

Q: What will end users see?

A: At this point, there are only intermittent issues accessing mail. Users logging into webmail may see a “Service Unavailable” error message. Users accessing mail via POP or IMAP may experience a denied login depending on the specific email client being used.

Q: When will this be fixed?

A: There’s no firm restore time at this point. Our Network Operations Center is working on the issue and the goal is to restore service for all users as soon as possible.

Q: Where can I find out more information about what’s happening?

A: You can track the status of the email service at http://status.opensrs.com/ where we post all status updates for OpenSRS services. You might also want to ensure that your contact information in the Mail Administration Center (MAC) is complete and accurate as status updates are automatically sent to the Emergency and Maintenance contacts listed in the MAC.

We’ll also post to the Reseller Blog when we need to provide more information than the status website allows.

Q: What about data? Is user data safe and what’s happening to incoming mail?

A: All user data (mail, contacts, filters, etc.) is safe. Incoming mail is being queued locally, on our system, and will be delivered as soon as possible.

Q: Is this related to the Cluster ‘A’ service interruption you experience in August?

A: No. While it IS on the same Cluster, we have no indication that this is any way related.

76 Comments

  1. And yet earlier one of the status messages indicated that this was again related to the (NetApp?) storage system which WAS the issue before.

    I realize you need to keep certain information quiet, however after two major failures on the same cluster this close together, this time I hope you come forward with more details.

    What I want to know:
    (1) A real root cause analysis report - what exactly happened and why?
    (2) What exactly is done to prevent it happening again (not just assurances that action is being taken, that’s not enough this time)
    (3) Why was cluster b not affected?
    (4) Why is your storage vendor so slow to assist with resolution of these problems? Last time I believe it was mention NetApp was your storage vendor. Is that a failing of the vendor or because you don’t have a high enough maintenance contract to get immediate turn around? If parts have to be shipped, are they fed-ex over night, or airport counter-to-counter?
    (5) Are you going to offer affected users the option to move to cluster b?

    Thank you.

    Comment by gsyoungblood — October 7, 2008 @ 1:22 pm

  2. Thanks for your comment. I realize you want a full report, but since we’re still in the middle of investigating the root cause, we can’t provide it just yet. I will say that despite the fact that Cluster A has been affected twice, there is absolutely no difference in the hardware or software used for each cluster. Moving customers would have no effect on future reliability.

    We hope to have more information for you soon. We are updating our status message approximately every two hours, so that’s your best bet for up-to-date information.

    Comment by James McNally — October 7, 2008 @ 9:56 pm

  3. Thank you for replying.

    I’m still down. At this point it’s a good thing I have a gmail account, but that doesn’t help my users.

    I realize you’re still in the thick of it. I wasn’t asking for a report during the problem, but after the service is restored.

    And, I really hope you will answer my questions in the previous comment.

    Thanks.

    Comment by gsyoungblood — October 7, 2008 @ 10:57 pm

  4. Would this be causing any entire domains to lose webmail access? 2 of my domains do not appear to be able to access webmail now (receive a Server not Found error like there is a DNS problem). Many other domains are accessing web mail just fine. The status message didn’t indicate anything about this and I can’t seem to access Support right now so I thought I’d post here and see if anyone is reading in case something else is going on as well.

    Thanks, and I’m crossing my fingers for an accelerated restore process tonight! Continue posting the updates - the more communication the better.

    Comment by jsdwd — October 8, 2008 @ 1:23 am

  5. maybe just a temporary DNS glitch. My phone browser can now access it, but desktop cannot(using different DNS server but same wifi). But wanted to put it out there in case there were some unseen DNS demons wreaking havoc in the midst of the other issues. Thanks again,

    Comment by jsdwd — October 8, 2008 @ 1:38 am

  6. I have been a Tucows email reseller for over 6 years, and many of my clients have stuck with this service through thick and thin, despite the huge number of problems we’ve seen, especially with the old system. I understand and accept that email failures will occur, but they need to be handled QUICKLY! It took over a week to fully restore service during the last outage, which in itself is completely unacceptable. One of my biggest international clients has not had access to their email for 2 days! They do humanitarian work in third world countries and email is absolutely essential. After all they’ve been through for the past several years, and especially after these past two outages, how can I possibly tell them that they should continue to use this service?

    If Tucows is serious about offering a business level, state of the art email service, you need to reevaluate your entire system and ensure that you can handle failures within minutes or hours, not days!

    Comment by John Pansewicz — October 8, 2008 @ 2:50 am

  7. I agreee with all of Mr. McNally’s points except his ascertion “I realize you need to keep certain information quiet…”. As resellers, our reputations are on the line and two major failures extending not only hours but days in length is not what I would expect from the Tucows organization. As resellers, we need ALL information relating to this outage including cause and the specific steps that have been taken to insure this doesn’t happen again.

    The fact that an outage of this type can occur at all seems to indicate a major design flaw in your system. Hardware is cheap when compared to lost revenues and your solution should include a more advanced RAID type configuration where failure of hard drive can be detected and repaired without ANY downtime. If the issue is the load balancer then obviously more redundancy is needed

    As a reseller, my reputation to my clients is on the line. In over 10 years of providing web hosting/email services to our customers utilizing our own colocated servers, we have NEVER had an outage of more than a couple of hours. I’m going to need to be able to present compelling assurances to my users that they can expect better reliability.

    Comment by Dave Wiese — October 8, 2008 @ 4:43 am

  8. Had an account working last night that is now down this morning.

    Found on the net:
    “I am delighted that I bit the bullet and left this sinking ship. Anybody who gave Tucows the benefit of the doubt back in August should now be thinking long and hard about why on earth they would depend on this outfit… Why they would pay them good money…” Comment posted 2008-10-07 at: http://www.thomascrampton.com/internet/netidentity-email-outage-19-hours-and-counting/

    See Also: http://en.wikipedia.org/wiki/Tucows#History_of_email_problems

    Comment by gsyoungblood — October 8, 2008 @ 8:48 am

  9. I agree with gsyoungblood — we’re going to need some answers after this mess, and some kind of real assurance that it wont happen again.

    I want to know why other email vendors like Mailtrust (who I luckily have my primary email account with) don’t have these (almost) monthly reliability issues.

    There has to be some fundamental flaw with OpenSRS email (hardware, management, something?) to account for so many occurrences of intermittent service or downtime.

    Also, what kind of compensation will be provided to resellers for having to put up with these headaches?

    Comment by sb — October 8, 2008 @ 12:49 pm

  10. I agree with the above comment.

    I think Tucows owes its customers a solid explanation and a plan of how to prevent future incidents.

    2 x 2 complete days of e-mail outage in two month is very hard to take and causing many phone calls on my end.

    Thanks

    Comment by Ralph — October 8, 2008 @ 4:09 pm

  11. That is the second major incident in two month!

    Comment by Ralph — October 8, 2008 @ 6:28 pm

  12. I share the disappointment of gsyoungblood in the performance of the OpenSRS email services and customer service. After watching the resolution time on the latest email outage be continually pushed back on the status web page, I finally called a support tech, who told me that the cause of the issue was “confidential”.

    Although we would like to consider competencies such as provisioned email services as failsafe, occasional outages can be endured. Two outages of more than two days in a few months is on the fringe of incompetence.

    If Tucows chooses not to disclose the cause and resolution to multiple catastrophic events, my confidence in their ability diminishes. How can wholesale customers be assured that the issues will not continue to arise, and how can we justify our choice of service provider to our retail clients?

    Comment by dc — October 8, 2008 @ 8:10 pm

  13. Hi all,

    First off, my sincere apologies for the email problems we’re having and for the time it’s taking to resolve them.

    I wanted to clarify that we DO provided detailed reports on what the cause of a major service interruption is, but we do this after we’ve resolved the issue. We want the engineers and developers working on fixing the problem rather than doing detailed briefings.

    Having said that, I do understand that Resellers need confidence that we’re doing everything possible and also need information they can pass on to their customers. We’re going to be launching a new Status Page before the end of the year and with that new page we’re looking at offering much more real-time information on what we’re doing. That’s a big change for the team here but we’re beefing up the communications team to help out in getting detailed technical information with minimal impact on our engineers’ ability to get the job done.

    Cheers,

    Ken Schafer
    VP Product Management
    OpenSRS

    Comment by Ken Schafer — October 8, 2008 @ 10:08 pm

  14. So does this latest “lead” mean that service will be restored this evening? The previous report sounded like the remaining boxes were about to come back online. And now that has been halted?? Please elaborate. I have clients waiting to access critical emails. There needs to be resolution this evening. And better communication would be great.

    Thanks.

    Comment by jsdwd — October 8, 2008 @ 10:22 pm

  15. @jsdwd - We’re all hoping for swift resolution but we need to let the engineers focus on making that happen.

    They feel that we’ll be better off by testing a possible solution to the root cause rather than rolling out the additional hardware which was more about easing pressure on the system rather than eliminating the issue completely.

    We’re not comfortable saying “It will..” rather than “we hope..” because with any kind of trouble shooting like this there is always a chance that there is another unseen hurdle beyond the one your leaping over at the moment.

    We’re all working on only a few hours sleep since Monday (as I’m sure many of you are), so apologies if I’m mangling my metaphors!

    Ken Schafer
    VP, Product Management
    OpenSRS

    Comment by Ken Schafer — October 8, 2008 @ 10:44 pm

  16. Thank you for the response. And I can certainly relate to the lack of sleep the past few days. The frustration levels are running high and I just hope there is a team(I hope a big one) working on this and close to resolution. Keep the communication flowing. I’m hopeful the engineers can get this one solved quickly.

    Comment by jsdwd — October 8, 2008 @ 10:49 pm

  17. Thanks for the encouraging words.

    The team working on this is pretty big. I haven’t counted exactly because we’re sending people home for a few hours sleep now and then but at our “War Room Briefing” we have every two hours we typically have between 15 and 25 people there representing Dev, Ops, NOC, Communications, Support, Account Management, most of the execs, and others.

    We’ve also pulled in several vendors and members of the open source community for help on specific issues.

    Comment by Ken Schafer — October 8, 2008 @ 10:57 pm

  18. We just posted a status update. We’ve found the root cause and are rolling out the fix now.

    Comment by Ken Schafer — October 8, 2008 @ 11:32 pm

  19. Best news I’ve heard all nite. Hope this gets it. Thanks.

    Comment by jsdwd — October 8, 2008 @ 11:33 pm

  20. Thank goodness. I’m really hopeful that this is the real root cause and that it will be the fix we are all hoping for.

    Comment by gsyoungblood — October 9, 2008 @ 12:24 am

  21. We just posted to the Status page that the issue has been resolved. All mailboxes are accessible, new mail is flowing, queued mail will deliver over the next six or so hours.

    Thanks again for your patience and words of support.

    Comment by Ken Schafer — October 9, 2008 @ 1:47 am

  22. Assuring me that there’s going to be a new status page is laughable! Let me be clear, the ONLY reason why I didn’t jump ship last year (or even 2 years ago) was because it’s a huge pain in the ass to move email accounts. If it was, Tucows would be out of the email business! I’m relieved that this failure has been resolved, but some of my clients have been without email since Monday morning. Four days without email for this professional organization is enough for them to pull the plug on me.

    Comment by JP — October 9, 2008 @ 3:14 am

  23. Following the “resolved” notification at 1am, we continue to experience complete outages for a portion of our user base. This includes the inability to receive email via POP, IMAP or through the Web Interface. “Service Unavailable” messages are still prevalent. This does not appear to be over by any means.

    Comment by rrmsc — October 9, 2008 @ 9:34 am

  24. Down again… It’s absolutely killing us. You have provided next to no info on what’s going on. You are absolutely killing us…

    Comment by Pat — October 9, 2008 @ 11:08 am

  25. They say that a fix has been implemented but my customers and I still cannot access webmail or receive emails. Doesn’t seem like there is a real fix in place. I don’t think the status page is telling anything that I don’t know myself. Tell us somethig that we don’t know (i.e when we will actually be back online).

    This is ridiculous.

    Comment by john — October 9, 2008 @ 11:18 am

  26. Forget what you write on the Status page. The service is not fully operational yet. Some of my mailboxes are still getting a Service Is Unavailable message and queued mail still isnt reaching others. Emailing support is not very useful either…. need I go on?

    After years of hosting our own mail servers, we thought we were moving our clients to something better, but we never knew outages of days before. Our clients are now asking to revert to the old service or else…

    I can understand technical faults. They happen to everybody, however, this is something else.

    Comment by TheSentinel — October 9, 2008 @ 11:32 am

  27. That’s not the case Ken. The status page says “At this time, all inbound mail to users on Cluster A is being queued. No mail will be lost. Additionally, some users may be experiencing issues logging in and viewing mail.”

    I have at least two customers that can’t access their mailbox. The status page doesn’t provide any information other than some mailboxes can’t be accessed.

    We need to know what to tell these folks.

    Comment by Mike Masin — October 9, 2008 @ 11:37 am

  28. I’m guessing it didn’t go so well… :(

    Still down. And no updates here at the blog and a cryptic message on status. Does not bode well. Though it does look like status was updated again since I last looked.

    Still, either way for me, I’m down. :(

    Comment by gsyoungblood — October 9, 2008 @ 12:30 pm

  29. Unfortunately the comment I made at #15 came to pass. We did resolve the root cause issue and things where proceeding as expected until about 6:00AM when we start to hit peak hours for European customers.

    Our analysis should about 25,000 files needed to be reindexed and we worked over night to make sure those were ready for the morning in advance of North American peak times.

    Unfortunately MANY more files need reindexing than we had originally estimated. That’s causing the extra load that we’re now dealing with.

    Once again, Status is the best place to get updates but I’m also trying to see if we can get an Engineer to take a break from the work to add a bit of background colour here. No guarantee but I’ll see what we can do.

    Comment by Ken Schafer — October 9, 2008 @ 12:55 pm

  30. This is beyond ridculous, unnacceptable, and unprofessional. You can not offer email services to professional companies that then are without service for 2-5 days at a time….with no way to notify their client base that their emails are not being received.

    The clients are calling claiming that we (the company) are not responding in a timely matter, because you have severed our main means of communication.

    Not only do you make your company look bad - but you make every company that uses your services look incompetent and unprofessional.

    With 2 outages in 2 months totaling (at this time close to 8 business days) which is equivelant to over 10% of business time… I too, am curious how you are going to be able to not only compensate all the users affected, but to retain your client base.

    Maybe if they all leave, you won’t have the stress on your servers and the hardware failures. and those few clients that remain can have relaible service.

    Comment by Dacy Juarez — October 9, 2008 @ 1:10 pm

  31. I’m confident that everyone is working their asses off to get this fixed, and I appreciate that. But I think it’s time for Tucows to issue a statement that we can pass onto our customers that addresses the following questions:

    1) What is cause of the current outage?

    2) What is being done to fix this outage, and what when will it be fixed. We don’t want to hear “I don’t know.” Tucows needs to give us a time frame, otherwise it tells me and my customers that Tucows doesn’t really understand the root cause of the problem.

    3) What is being done across the board to ensure that any failure can be resolved quickly? I mean, within a matter of minutes or a couple hours at the most?

    The last two outages, combined with the previous 2 years of outages in the old system are the Tucows record, and most of my clients have felt the pain that record. Many of my customers have had enough and they want me to move them to a new service, which takes enormous time and effort, for which I will not be compensated. If I can’t ensure that Tucows email service will never go down like this again, I have no choice but to move them to mailtrust.com.

    Comment by JP — October 9, 2008 @ 1:12 pm

  32. I am very angry this is happening again !
    Besides my accounts at OpenSRS I use the free Gmail (which can host my domain too) and they haven’t got any outage in two years time. I’m seriously thinking of moving all my OpenSRS mail accounts over to Gmail because two long outages in two / three months is too much for me. This sucks big time …

    Comment by Mark — October 9, 2008 @ 1:45 pm

  33. On of my customers just called me and said “Tucows? The name alone makes me wonder. I mean, why don’t they just call themselves TuTurles? Why aren’t we using a system called ‘Cheetah’ or something?

    Comment by JP — October 9, 2008 @ 1:49 pm

  34. I have cancellations starting to come in from my client base. It is obvious that this is not a dependable system and takes way too long to resolve any problems. We’ll be taking our clients elsewhere even if it involves additional cost.

    Comment by Dave Wiese — October 9, 2008 @ 2:06 pm

  35. This is becoming a farce. I sympathize and understand that developing and maintaining a large scale software system is difficult, but clearly there is a problem with the capablity of the engineering team and/or management involved with this project. Wholesale customers should be able to expect that commercial SaaS systems have a disaster recovery plan that can be implemented in less than three days. Excuses like “oops, we fixed one issue but forgot about the loads that occur at peak hours” are not acceptable and indicate a lack of professionalism.

    Once the technology issues are resolved, I will be curious to observe the executive team’s business disaster recovery plan.

    Comment by dc — October 9, 2008 @ 2:06 pm

  36. BTW, can anyone recommend a good alternative email provisioning wholesaler?

    Comment by dc — October 9, 2008 @ 2:09 pm

  37. Ken,

    It’s obvious you know nothing as to what is going on and the same holds true for the Tucows development staff. You are a marketing guy who is there to put a nice spin on the problem. Yet the reality is that you are making yourself look dumber and dumber by posting incorrect information.

    Comment by JP — October 9, 2008 @ 3:11 pm

  38. Ken Schafer, this is completely unacceptable. Tucows has now cost my company several long standing and loyal clients.

    While you are waiting for the technicians to resolve the issue, what is the Tucows reparation plan for those who have and are still losing revenue due to your service outages?

    Comment by DashOne — October 9, 2008 @ 4:10 pm

  39. We just posted a background video that should very useful if you’re looking for more background on what we’re doing:

    http://blip.tv/file/1341576

    Comment by Ken Schafer — October 9, 2008 @ 4:12 pm

  40. We just completed a review with our client base and over 80% are in the locked out status. One client is a real estate title closing company who uses e-mail for property closing transactions. This situation is now keeping them and their customers from property closings which also effects the mortgage and dollars involved. SEE THE PROBLEMS this is causing?

    The new video provides additional tech details but does nothing to resolve the business issues and impacts.

    Comment by DashOne — October 9, 2008 @ 4:39 pm

  41. I just finished watching the video. I wouldn’t like to be this guy, talk having a bullseye on your back. Not a fun position to be in.

    I’ve been in tech for the last 20+ years and have dealt with my share of fires, from lightning strikes to literal files. Getting through them is hard, but learning from them is critical!

    A few thoughts:

    (1) Write up what he said in the video, release it publicly as a report. The video was nice, and it included some real information for the first time — much appreciated. However, put it in writing. I would much rather read the report.
    (2) Why did it take from Monday to Wednesday night to figure out the problem was NFS and not hardware?
    (3) If this is a known issue for NetApp with Linux and NFS, why wasn’t there an advisory telling users to avoid it or watch for it? Or was there and did that advisory get missed by someone at Tucows?
    (4) What version of Linux? What distro?
    (5) Most important of all - What steps are being made to prevent similar problems from happening agian. At the very least, it seems like you need to examine IO capacity and make sure you have room to handle spikes.

    At the very least, it appears you need a total review and update to your disaster recovery plans for production environment. Right now it appears you either don’t have one or the one you have is out dated and doesn’t work for your current systems or load.

    Comment by gsyoungblood — October 9, 2008 @ 4:47 pm

  42. Thanks for the video Ken.

    Bottom line, if Tucows is serious about providing this service and want to continue in the future, you must be better equipped. Guess what, bugs will always exist, hardware will always fail and people will make mistakes. You just need to be better prepared to tolerate more than just a hit and have the capacity and foresight to deal with it without shutting everybody out until you fix it.

    Of course, you have little choice once you are in in the middle of a crisis, however Tucows must decide if it wants to truly offer a world class service or not.

    My company lost clients over this event. Clients whom I brought to Tucows and whom I have now let down. I dont think there is any way to repair the damage. Certainly it is in times like these that you appreciate the value of contingency.

    Comment by The Sentinel — October 9, 2008 @ 4:54 pm

  43. Great information in the video. However, the explanation exposes a negligence on the part of Tucows and NetApp in terms of identifying how this known bug in Linux could impact the entire email system. I understand the difficulties asserted in that statement, but I also think that a company shouldn’t deploy a mission critical system with so many users without intimately knowing every point of failure, and how to first prevent it, and second recover from it. It makes me wonder what else is unknown.

    Comment by JP — October 9, 2008 @ 5:35 pm

  44. Slick video. However, a bug in somebody else’s system does not redeem Tucows (news flash - all software has bugs.)

    The depth of a SaaS provider is measured by monitoring, planning and execution of contingency/recovery strategies. It seems that it took three days after the system fell over to realize that there was a resource leak.

    I may not be familiar with the system in question, but I do have an extensive background in performance management (dev on the MS Win2K performance management core). Resource leaks are an early warning sign and should be proactively monitored and dealt with before they eventually tear down the system.

    If the cause was from a) a known bug, and b) eventually identified by a resource leak, it was 99.9% preventable. It looks like some heads should roll.

    Comment by dc — October 9, 2008 @ 5:36 pm

  45. In addition to the previous questions, I’d like to know more details about the Linux bug this is being attributed to. Specifically what the actual bug really is.

    Comment by gsyoungblood — October 9, 2008 @ 6:21 pm

  46. when this complete farce is over will Tucows please take their own company e-mail off-line for 10 days (preferably in 3 random blocks during normal office hours) so that they can see the effect it has on their customers - please do this with without telling your staff that it’s going to happen - and make sure that you randomly allow some staff (but none at management level) to access their mailboxes for short periods - also make sure that you only post occasional status updates (obviously with no helpful explanations) and once you get 60% of your staff mailboxes back on-line take them all off-line again for the best part of a day

    of course at the end of this experiment you will need to tell all your email staff that they don’t really have to worry about it happening again - as all the resellers have gone out of business, because their customers don’t have any faith in them running a bath, let alone a business critical service

    failing that please explain to all our customers, that if their e-mail system only costs them a fraction of their mobile phone costs, then they are going to have to expect the manure to hit the air conditioning occasionally

    i’m presuming that all resellers will now have a complete refund of all email costs from August to December to enable us to try and restore our customer’s faith in our ability to choose competent suppliers ourselves

    thank goodness for global warming and international banks proving that not everything is 100% predictable / reliable

    Comment by unknown — October 9, 2008 @ 6:29 pm

  47. As I mentioned before, we will publish a detailed incident report but we won’t do that until we have the all clear and people have had a chance to regroup.

    The bug involves an interaction between Linux NFS and NetApp that was not publicly documented. We worked with one of the top Linux contributors who recognized the interactions we were experiencing as something he’d developed a patch for but it had not yet been submitted.

    Therefore it was pretty much impossible for us to be aware of the bug.

    We’ll include more details on the bug in the incident report.

    Comment by Ken Schafer — October 9, 2008 @ 7:33 pm

  48. ROFL - status page update @ 23:39 UTC says
    >>>
    We will continue to provide regular updates as more information becomes available.The next update is at approximately 25:30 UTC.
    <<<

    are you also working on an 8 day week?

    has your tech support team been moonlighting / running any investment banks inbetween these major outages?

    i posted a comment to this page a few hours ago and notice that it still hasn’t appeared yet - perhaps i need to click on a confirmation link in an e-mail being sent to my mailbox on cluster A - hopefully it will appear within the next 6 days that this article is still open for comments to be posted to (at least somebody is optomistic enought to think that everything will be OK by next week)

    is there any way of reading the comments posted to the August 13-18 blogs - as it looks like they are unavailable now except for
    http://opensrs.com/blog/2008/08/closing-notes-on-the-cluster-a-email-service-interruption/#comments

    Comment by unknown — October 9, 2008 @ 8:17 pm

  49. @unknown @46 - We use the OpenSRS Email Service for all Tucows communication. That means we’re experiencing exactly what you’re customers are experiencing.

    Comment by Ken Schafer — October 9, 2008 @ 11:01 pm

  50. @unknown @48 - You’re not seeing comments because there were no comments. We don’t delete comments.

    Comment by Ken Schafer — October 9, 2008 @ 11:02 pm

  51. I’d just like to reiterate that we hear you and we completely understand why you are so angry. We are aware of the problems this creates for you, our Resellers, and for your customers.

    We agree that this is not an acceptable level of service and we will work hard to ensure that the service meets and exceeds your expectations.

    Most of your requests for details on the issues at hand and our response will be addressed but that will have to wait until we release our incident report which we’ll publish after the service is stable and people have had a chance to catch their breath.

    Comment by Ken Schafer — October 9, 2008 @ 11:22 pm

  52. Well in section V of the Services Agreement,it states that if service level is below 98% for three months, then Tucows is in Breach of Contract. Already service level is below that in this one incident.

    This is the way out for all sane customers to jump ship as soon as possible. Too bad it is nearly as painful to change email servers as it is to stay here. But I guess it will only sting for a little while

    Comment by Paul Johnson — October 9, 2008 @ 11:26 pm

  53. “We agree that this is not an acceptable level of service and we will work hard to ensure that the service meets and exceeds your expectations.”

    We all can agree with that sentiment. However, after this incident it will be your actions that determine the trustworthiness of that statement.

    I’ve been with OpenSRS since almost the beginning and with the exception of a few bumps early on and email issues this year, it has been a good relationship. The bad thing is that these problems with email shake the overall confidence in related services; instead of renewals they become transfers.

    “Most of your requests for details on the issues at hand and our response will be addressed but that will have to wait until we release our incident report which we’ll publish after the service is stable and people have had a chance to catch their breath.”

    Consider the shoe on the other foot. If your data center lost Internet connectivity and you were down. Completely. Then you are only given updates that said little more than “we’re aware of the problem and are working on it” every 4 to 8 hours, would you be satisified with that?

    How about not being told any real details until after 3-4 days of downtime?

    At this point, you need to be much more forthcoming with what’s going on. Your incident report needs to include real details about the problem and how it is being addressed. It also needs to include details about how recovering from problems in the future will be handled in order to prevent extended outages in the future. And, finally, there needs to be some serious compensation to everyone affected. And not just token credits — your resellers are losing customers and more. If they lose their customers then what need do they have for your services?

    From what I gleaned in the video you are running your clusters so loaded that they can not survive minor spikes in changes in behavior or issues. Perhaps it just IO bottlenecks, or perhaps its something else.

    You need to have some way to restore email accounts to gets new messages nearly immediately if something happens in the future. Then allow old messages to be fed in to the accounts as they are able to be recovered. It wouldn’t be good, but at least during the recovery people users would be able to communicate instead of being totally down. Even if that switch isn’t thrown until the outages reaches 2 hours, that would be an imrovement. Even if it means more work on the backend for your team to finish the recovery process, you can’t leave people languishing for days.

    In August it reportedly took 3+ days (I think, perhaps it was just 2+) for the reindexing. It sounds like it is taking similar time now. Seems like there should be a better way.

    On another front, what Linux distro and version are you using? Is it old? Or bleeding edge? Perhaps you should take a look at Solaris. Even though I’ve been primarily using Linux for 14+ years, I’ve seen places where Solaris’ NFSv4 works better than Linux’s. If you’re a heavy NFS shop, perhaps you should consider it, or at least evaluate it and see how it works out.

    Comment by gsyoungblood — October 10, 2008 @ 12:02 am

  54. Sorry, I couldn’t resist. For some reason when I saw this I couldn’t help but laugh.

    i·ro·ny –noun, plural -nies. 1. the use of words to convey a meaning that is the opposite of its literal meaning.

    OpenSRS Reseller Services - Email

    A competitive email service will make your customers more loyal, and improve your chances of selling additional services to them.

    Reliability: Going down or losing email is not an option.

    Comment by gsyoungblood — October 10, 2008 @ 12:36 am

  55. As I’m a small business who is relying on e-mail communications with customers, i desperately NEED MY MAIL NOW !!!!! It is MY MAIL, which has been sent TO ME and I can’t reach it because of some sh***y problems in your datacenter. Still, 24 hours later, I can’t login to my mailbox. Do I need to contact my lawyers and sue you ?

    Comment by Mark — October 10, 2008 @ 2:24 am

  56. 24 hours? That’s nothing. We haven’t had email in days. I can kiss the rest of my customers that I didn’t loose after the August outage good-bye. Thanks a lot Tucows.

    Two multi-day outages in less then 3 months. Inexcusable.

    Comment by john — October 10, 2008 @ 2:44 am

  57. if the latest status update at 09:44 UTC says

    We continue to monitor the e-mail service very closely and are emphasizing mailbox access over speed of delivery for new messages at this time.

    does it mean the same as

    Thank you for buying your new car from us - please note the tank is empty and the nearest petrol station is nowhere near here

    or

    Please take a seat in our restaurant - we don’t have any food and nobody knows if or when the chef might turn up

    it might be OK for you in the wild west to downgrade your e-mail systems overnight - but here in Europe we are in peak office hours and losing customers every hour

    you are killing our reputation and our business

    Comment by unknown — October 10, 2008 @ 6:47 am

  58. Sense of humour is priceless in times like these.

    Once this ordeal is over, I would really like to see this blog to develop in the direction of what actions and changes Tucows will make to ensure a higher level of service in the future. From our part, we need to be convinced that this will not happen again any time soon. I think it would be the minimum to expect out of this, short of any form of compensation.

    Comment by TheSentinel — October 10, 2008 @ 8:07 am

  59. Ken,

    Your comment that you don’t delete Comments is a lie. You have someone monitor the comments and delete ones that are strongly worded. I should know because a comment I placed before has disappeared. If you can’t be truthful about that, how can we believe you about the other statements you’ve made?

    Comment by JP — October 10, 2008 @ 8:22 am

  60. can you stop queuing e-mails and start bouncing them

    our customers are complaining that they are being chased for replies to e-mails their own customers have sent to them - but of course they haven’t received them yet

    if these e-mails were bounced back then at least the sender would know our customer had an e-mail problem and wasn’t ignoring them

    what can be done by openSRS about secondary MX records

    whilst e-mails are being queued by the primary MX i presume that nothing would go to a secondary MX
    this just appears to be an ill thought out strategy for dealing with a major crisis - perhaps things will speed up now that the Americans are awake now and will start complaining again soon

    is there anyway of telling us which of our customers are in the 40% that still have the problem - i’m fed up of finding out who they are when they phone up asking to speak to the clown running the show and demanding to be transferred away immediately

    this is no way to run a business - at least gmail / hotmail are free - oh to have understood your SLA before we signed up all our customers with you - you must be heading for a 5% downtime by now - we’ve never seen anybody offer less than 99.5% uptime for anything else

    Comment by unknown — October 10, 2008 @ 9:24 am

  61. The majority of my clients were in the “40%”. Telling them that the service is restored, but that they are receiving no new email is not news … its more of the same junk I have been telling them all week. An email account that can’t receive new mail is not of much use to my clients who are trying to submit bids for new work or trying to get closing documents for real estate transactions.
    Please, for God’s sake, put the resources in this and get a structure that can handle the volume. Your promises from the last incident seem to have been empty.

    Comment by Steven Bryan — October 10, 2008 @ 10:22 am

  62. This is my last post on this blog. We have just lost our biggest corporate customer, with significant financial damage to my businesess.

    Quote “… this incident has critically impacted on our perception of your company as our supplier, and although we are aware you have been let down in turn by your supplier, however this shows poor judgment in your choice of technical services and contingency planning.”

    Having lived through the last days and the way Tucows has handled the incident, I honestly believe that this company is not up to the job and is too far from even starting to deliver the services it promises. It has clearly not fullfilled its obligations and has cheated us and all its clients to the point that the damages incurred are now beyond any remedy.

    As of this moment, we have started procedures to transfer our services to another provider and will no longer represent Tucows. I urge all resellers who care about their own brand and customers to follow suite.

    Thanks Tucows… thanks a lot!

    Comment by TheSentinel — October 10, 2008 @ 10:34 am

  63. We have over 100 domains, representing over 400 accounts with opensrs, and have had complaints from the majority of our customers - not the 40% that is mentioned, but closer to 80%.

    Since this weeks’ outage began - and many of our clients were reporting problems well before anything was listed on the status page - we’ve been inundated with phone calls demanding to know when the system will be operational. Most of them have had no emails for three days, some of the lucky ones have managed to get some earlier emails, but none in the last twelve hours.

    This is obviously not just affecting our client’s business, but our own business too - we’ve now spent the last three days answering support calls, and we’ll be spending the next two to three weeks explaining to our clients what happened, and assisting with them moving elsewhere, none of which we’re going to be paid for. For every day of downtime at opensrs we’re looking at losing the equivalent of a week’s work - for a small business this is the difference between continuing in business or being forced to close.

    Is there any chance of getting some better indication as to how quickly things are actually progressing?

    The status updates don’t contain enough information to explain to clients what is going wrong, what is being done, or what progress is being made. Even knowing that the inbound mail queue is reducing in size would be something positive to pass on to client; an estimate as to how long it should be before the queues are running normally would be helpful.

    We have clients who are calling on an hourly basis to demand an update, and they’re getting increasingly frustrated that I can’t tell them anything different, especially as we’ve now been telling them the same thing for the last three days…

    Comment by David Masters — October 10, 2008 @ 10:38 am

  64. On the Status Page is says “All users continue to have access to their mailboxes.”
    No we don’t. Can’t get access though webmail or POP3.

    As far as we are concerned, Tucows you have not fixed anything yet.

    Comment by john — October 10, 2008 @ 10:55 am

  65. Customers are reporting that they can login but when they click on a message they’re returned to the login screen. It’s still broken.

    Comment by Mike Masin — October 10, 2008 @ 11:05 am

  66. Repeating Mike Masin’s comments — affected accounts still not functional. Login may or may not work, and if it does access to email does not.

    Sentinel, Sorry to hear about the loss of that customer. I know that hurts.

    Tucows, I would assume everyone here is dealing with situations like Sentinel’s in one way or another. This is what I alluded to when I said renewals were turning into transfers. Not just for email, but for everything.

    Fortunately, email is not a core service I offer, so while this is painful for me and my users, it’s not as devasating as it could be. As for myself, I’m still on the fence. I’ve run my own mail servers for 17 years (yes, since 1991, remember bang path addressing?) until I outsourced it late last year/early this year to OpenSRS. While I miss the reliability I used to have, I don’t miss the occasional headaches or dealing with insanely increasing spam volume.

    Comment by gsyoungblood — October 10, 2008 @ 11:42 am

  67. @JP @59 - We moderate first-time posts by new commenters but we let everything through that is not offensive, illegal or abusive. Pretty much everything goes through.

    Your message may have been awaiting moderation or may have been flagged as offensive but was definitely not removed because we didn’t like the content (see above for confirmation we’re okay with negative comments).

    Comment by Ken Schafer — October 10, 2008 @ 11:58 am

  68. I would like to comment on what Ken Schafer said about launching a new Status Page before the end of the year. Who cares about that. If the service was rock solid, it wouldn’t need a status page. I say fix this crapy service or we will have to find a more reliable email service some where else. I really would not have expected this out of Tucows. My customers demand more therefor I must demand more, and I don’t get a warm and fuzzy about all of this.

    Comment by Walt Dix — October 10, 2008 @ 12:03 pm

  69. Hi Everyone,

    Thanks for the feedback on your end-user experience details. It’s best if you submit those to Support rather than posting here as it will get into the proper workflow MUCH quicker.

    Similarly, don’t count on my comments here as representative of CURRENT status - it’s best to use Status Page for that.

    Thanks for the technical suggestions as well. Most of what you’re suggesting has been investigated but I’ll make sure the engineers know of any new suggestions I find in the comments.

    Once again, I’m very sorry that we haven’t been able to resolve this more quickly. We know it is a big problem for you and your customers and we sincerely apologize.

    Comment by Ken Schafer — October 10, 2008 @ 12:15 pm

  70. @Ken @67 >>>we’re okay with negative comments<<<

    that seems to be the problem - your just getting flack - whilst we’re losing our customers and our reputations

    i doubt if openSRS will even notice (let alone care about) 1,000 customers leaving - but for some of us even 1 major customer leaving can be a disaster - especially as they will tell lots of other current & potential customers how bad our (not your) service has been

    Comment by unknown — October 10, 2008 @ 12:16 pm

  71. @Walt @68 Agreed.

    Introducing a new status page is by no means a replacement for having everything run the way it should.

    The status page we use now is pretty primitive and I’m really excited about the new approach we’re taking. It’s going to allow us to give you much better visibility and access to historic information about particular incidents.

    Comment by Ken Schafer — October 10, 2008 @ 12:22 pm

  72. Ken, request you get your counterpart from the Reseller/Business management or higher on this blog thread to address the business issues and questions being listed. The technical issues are one thing, literally allowing Tucows to trash our business, livelihoods and reputations is yet another.

    Failurng to address the business impact is exposing Tucows to serious legal implications.

    Comment by DashOne — October 10, 2008 @ 12:35 pm

  73. For anyone on LinkedIn I just started an OpenSRS Resellers and Users group.

    http://www.linkedin.com/e/gis/1012737

    I thought it would be good for there to be an independent place for us to collaborate, network, and generally discuss things of interest.

    Comment by gsyoungblood — October 10, 2008 @ 1:15 pm

  74. This is mind-boggling. When Gmail or Hotmail goes down for an hour or two, it’s a major headline.

    This isn’t a hardware issue, a software issue, a load issue, a security issue, or in any way a technical issue.

    This is a MAJOR design flaw. Let me introduce you to a new concept (careful–long, technical term coming…): REDUNDANCY.

    I don’t pretend to be able to design the level of redundancy required to reliably support tens of thousands of mailboxes. This is why I’m not in the business of hosting tens of thousands of mailboxes.

    Let me introduce you to another concept: goodbye.

    Comment by Wayne Saucier — October 10, 2008 @ 4:44 pm

  75. Another reindexing? Is it going to be another week before we can get email?

    The level of incompetence being exhibited by Tucows is award winning.

    The sad thing is that for a lot of us its a moot point because we’ve lost the bulk of our customers or will be welcomed with nasty emails by our customers notifying us that they are transferring to another provider when we finally are able to get access to our emails.

    Comment by john — October 10, 2008 @ 4:47 pm

  76. @John @75 - Yes, another reindexing.

    What we found today is that the software we use for our mail server had an undocumented bug in it that made the reindexing we did last night to fix the undocumented bug in the Linux kernel ineffective.

    Had I not lived through it I would not have believed it possible for two such incredibly obscure bugs to interact like that.

    We ended up working with the actual developers of both pieces of software to get patches developed.

    We now have the mail server patch tested and in place. We’ll know in the next few hours what the pace will be for the reindexing and once we have some confidence around that number we’ll post an ETA on Status.

    Comment by Ken Schafer — October 10, 2008 @ 5:43 pm

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.