Archive for the 'Disruptions & Downtime' Category

18
Apr
15

Asset Cluster Maintenance Complete

CLUSTER MAINTENANCE
On April 16th, OSG undertook an emergency cluster maintenance to update core cluster software packages to resolve critical bugs.
INVENTORY AND ASSETS
After the cluster maintenance and restart – many of you began noticing “missing asset” errors when rezzing recent items.
From evidence and investigation – these missing asset messages are not related to the updates applied on the 16th.
It appears that for the last month or so – the asset server was unable to write assets to the database, in a way that did *not* cause an exception and alert the asset server of trouble.
Asset updates appear to have been silently discarded by one or both of the asset clusters during that time.
These silent discards may have been “masked” by the various viewer, region, and nginx caches so that these “ghost assets” still appeared viable… until the asset server was restarted for emergency patching.
CHANGES
While investigating what happened, and trying to determine how this was even possible, several changes have been made to the asset cluster to attempt to prevent a repeat of this type of event, or detect and report it IF it starts to happen again.
In response, OSG has changed the database indexing system, added new logging and exception reporting, and additional database write reporting.
Additional asset write-and-verify tests are being implemented going forward.
At this point, the asset cluster appears to be 100% functioning correctly.
GOING FORWARD
Please examine your viewer and region inventory.
Clear viewer caches and reload them, and delete (or move to another folder) any inventory items that report missing assets.
Regions can use the fcache assets, fcache status, and backup commands to check their region cache and restart your region, watching for new or different asset errors.
Any assets which still appear missing may need to be reloaded from their source, or an IAR/OAR made before or during the time the asset database was silently discarding writes.
FINAL NOTE
We feel your pain with these “missing asset” messages, as the OSG admin inventories and assets were not immune and are experiencing the same issues you are.

We deeply apologize for any inventory or asset issues that you have encountered.

01
Mar
15

OSgrid Logins Reopen

Melanie from Avination investigated the stuck asset service process and found an old SRAS asset was giving the asset code fits.
The offending asset was corrected and OSgrid logins have been reopened.
We now know what to look for and where if there are any future recurrences.
Thankfully, the problem looks to be very rare or not very likely based on the grid’s behavior so far.
As reopening tests continue we are working on how to safely scan the entire recovered SRAS assets to see if any more of these surprises are hiding.
This is exactly like finding one kind of needle in a stack of millions of similar needles.
Please bear with us as the reopening continues.

And, again, we are sorry for the inconvenience.

01
Mar
15

OSgrid Logins Temporarily Closed for Maintenance

We are sorry for the inconvenience, but we are temporarily shutting off logins for maintenance on the grid.

We have a stuck asset server process and have shut off logins as a precaution to prevent any possible asset damage.

Also, until we give the all clear, please do not load or save IAR or OARs.

Check back with us Sunday March 1 for additional updates.

25
Feb
15

OSgrid ONLINE!

OSgrid is back online and open!

We know its been a long, painful, frustrating outage, and we do appreciate your patience, support, and encouragement through some rather bleak months.

But the wait is over – logins to OSgrid are now open again and the grid is back online!

Many people today have noticed our initial testing, and initial admin logins, plaza and region restarts, and hypergrid logins have all been working as expected.

Given the large number of people and testing, it was decided to try and hold an office hours meeting on Wright Plaza, and despite a funky garbage collector – 17 people or more were able to join.

With these initial successes, it was decided that OSgrid was healthy enough to open up for wider testing.

We’re not anticipating any further major outages or issues, but there may still be an occasional short outage or  downtime over the coming weeks as OSgrid settles back into normal usage, for us to adjust a setting here or there with the new asset services or updated plazas.

REGION RESTARTS
As you begin reconnecting regions, you may notice an initial failure when trying to teleport to them.

You may need to purge the region using your login to the osgrid.org website, then restart the region itself in order to fully reconnect it after the outage.

INVENTORY AND ASSETS
While we have done large amount of testing already, there may still be an issue here or there with specific assets.

In our testing so far, the few asset glitches found do not indicate a problem with the asset recovery, new asset cluster, or new FSassets service.

Initial investigation points to small data changes or format issues present with specific assets before the outage which are now found by the XML serializer.

If you see widespread issues, please report them – but also try to have patience as we may be swamped for a while getting everything fully rolling for everyone again.

In closing, many thanks again to everyone who helped OSgrid through this, and continued to support and encourage us.

Let me be the first to say … welcome back!

OSgrid is now open!

23
Feb
15

OSgrid Update for 2015-02-23

ASSET SERVER STATUS
Melanie Thielker of Avination donated considerable code, design, time, and effort to build a new asset cluster for OSgrid, based on her FSasset service which she is also contributing to OpenSim core.
After Hiro requested additional datacenter networking changes and Melanie’s final cluster configuration and testing, both cluster nodes are up and running with the recovered assets.
Melanie pronounced her initial asset requests and the new cluster a complete success!

OSGRID RESTART
The final steps to re-opening are finally underway.
The inventory table was cleared as part of an initial attempt to return the grid to service with a new asset and inventory database when it looked like the recovery service was not able to restore the drives.
With the recovered assets in place, the matching inventory tables are being loaded, so that your login will have everything it needs to find your assets again.
Once the inventory tables complete their restore, return-to-service testing will begin with Hiro announcing an ETA for reopening announced shortly after!

SPECIAL THANKS
OSgrid would like to offer thanks to everyone such as Melanie, Justin, Diva, our supporters, and everyone who has gotten behind OSgrid during this catastrophe.
OSgrid would like to extend an extra thanks to Melanie from Avination for her stellar asset code contribution and cluster design and configuration to pull OSgrid back from the abyss.
OSgrid’s return would not have been possible without her continued support!

Next Update 2015-03-01 or 2015-03-02

16
Feb
15

OSgrid Update for 2015-02-16

DRIVE RECOVERY STATUS
The recovered drive assets have been fully copied back over to the new asset servers without further NIC driver weirdness.

ASSET SERVER STATUS
Now that the recovery loading is done, tickets are open with the data center to reconfigure the underlying network connectivity to finish the asset server cluster setup.
We had to hold off reconfiguration until the asset recovery from drives were complete to avoid possible outage during the loading.

OSGRID RESTART
Still no ETA yet, sorry! Getting closer, step by step, though! Keep checking this space!

OSGRID RESTRUCTURING
Background discussions on OSG restructuring and service changes continue but no definite plans. Other than removing the jump regions, most discussions revolve around changing how existing spaces are managed, with an eye to consolidation and some modernization, rather than something more drastic.
More news here once plans are definitive.

SPECIAL THANKS
OSgrid would like to offer thanks to everyone such as Melanie, Justin, Diva, our supporters, and everyone who has gotten behind OSgrid during this catastrophe.
The assistance, the patience, and the good wishes are all very much the rays of light we need to keep pushing forward in an otherwise very awful time.

Thank you all! Next Update 2015-02-22 or 23

09
Feb
15

OSgrid Update for 2015-02-09

DRIVE RECOVERY STATUS
The recovered drive was not able to be directly copied into the database due to USB weirdness causing the disk to offline.
However, once moved to a different server and exported over the network there, it has proceeded to restore pretty smoothly until yesterday…
The database restore was well past 90% before a pair of ethernet NIC outages struck.

ASSET SERVER STATUS
The constant load of the recovery of the assets from disk to database cluster over the network exposed an ethernet driver bug, delaying the completion of the copy.
It was not expected that bug would have impacted production use, only that we hit it hard enough during recovery to uncover it.
BIOS settings to disable power management for the NIC, and an updated OS NIC driver were put in place today and will hopefully permanently resolve that NIC error.
The recovery to database has been restarted and we’re all watching to see if there are further NIC driver issues.

OSGRID RESTART
Everyone is getting pretty hopeful that we’re close enough to announce an actual ETA, but we’re still just not quite there yet.
But, thanks to Melanie from AviNation’s constant, patient help, the recovery and restart is looking very good.

OSGRID RESTRUCTURING
Background discussions on OSG restructuring and service changes continue but no definite plans yet other than to remove the jump regions once the grid is live and otherwise consolidate and shuffle regions as needed.
More news here once plans are more definite.

SPECIAL THANKS
OSgrid would like to offer thanks to everyone such as Melanie, Justin, Diva, our supporters, and everyone who has gotten behind OSgrid during this catastrophe.
The assistance, the patience, and the good wishes are all very much the rays of light we need to keep pushing forward in an otherwise very awful time.
Thank you all!

Next Update 2015-02-15 or 2015-02-16




Latest Twitter Update

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 288 other followers

Copyright © 2007-2016 OSGrid, Inc. - All rights reserved, except where noted.

The OSgrid Logo, and the word 'OSgrid' are trademarks of OSGrid, Inc. Usage of these terms elsewhere is allowed under certain conditions.