OpenSolaris

Discussions Communities Projects Download Source Browser

Home » OpenSolaris Forums » zfs » discuss

Thread: zpool unimportable (corrupt zpool metadata??) but no zdb -l device problems

Welcome, Guest Help
Login Login
Guest Settings Guest Settings
Reply to this Thread Reply to this Thread Search Forum Search Forum Back to Thread List Back to Thread List

Permlink Replies: 75 - Last Post: Feb 7, 2009 5:54 AM by: gino
gmvasile

Posts: 6
From:

Registered: 10/1/08
zpool unimportable (corrupt zpool metadata??) but no zdb -l device problems
Posted: Oct 1, 2008 2:20 AM
To: Communities » zfs » discuss
  Click to reply to this thread Reply

Hi,
I am running snv90. I have a pool that is 6x1TB, config raidz. After a computer crash (root is NOT on the pool - only data) the pool showed FAULTED status.
I exported and tried to reimport it, with the result as follows:
================
# zpool import
pool: ztank
id: 12125153257763159358
state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
see: http://www.sun.com/msg/ZFS-8000-72
config:

ztank FAULTED corrupted data
raidz1 ONLINE
c1t6d0 ONLINE
c1t5d0 ONLINE
c1t4d0 ONLINE
c1t3d0 ONLINE
c1t2d0 ONLINE
c1t1d0 ONLINE
================

I searched google and run zdb -l for every pool device. Results follow below... to me it appears that all disks are ok and zdb can see the zpool structure off of each of them. (at least this is how I can interpret the messages, but the zpool still says corrupt zpool metadata :-(

Any ideas as to what I might be able to do to salvage the data? restoring from backup is not an option (yes, I know :() - as this is a personal project I hoped the raidz would be enough :-(

The output for each of the disks is more or less identical, all labels are accessible.

# zdb -l /dev/dsk/c1t6d0s0
--------------------------------------------
LABEL 0
--------------------------------------------
version=10
name='ztank'
state=0
txg=207161
pool_guid=12125153257763159358
hostid=628051022
hostname='zfssrv'
top_guid=763279656890868029
guid=10947029755543026189
vdev_tree
type='raidz'
id=0
guid=763279656890868029
nparity=1
metaslab_array=14
metaslab_shift=35
ashift=9
asize=6001149345792
is_log=0
children[0]
type='disk'
id=0
guid=10947029755543026189
path='/dev/dsk/c1t1d0s0'
devid='id1,sd@f0000000048455c81000880330000/a'
phys_path='/pci@0,0/pci1000,30@10/sd@1,0:a'
whole_disk=1
DTL=193
children[1]
type='disk'
id=1
guid=2640926618230776740
path='/dev/dsk/c1t2d0s0'
devid='id1,sd@f0000000048455c81000992690001/a'
phys_path='/pci@0,0/pci1000,30@10/sd@2,0:a'
whole_disk=1
DTL=192
children[2]
type='disk'
id=2
guid=8982722125061616789
path='/dev/dsk/c1t3d0s0'
devid='id1,sd@f0000000048455c81000ae8610002/a'
phys_path='/pci@0,0/pci1000,30@10/sd@3,0:a'
whole_disk=1
DTL=191
children[3]
type='disk'
id=3
guid=7263648809970512976
path='/dev/dsk/c1t4d0s0'
devid='id1,sd@f0000000048455c81000bb2cf0003/a'
phys_path='/pci@0,0/pci1000,30@10/sd@4,0:a'
whole_disk=1
DTL=190
children[4]
type='disk'
id=4
guid=5275414937202266822
path='/dev/dsk/c1t5d0s0'
devid='id1,sd@f0000000048455c81000ca3c40004/a'
phys_path='/pci@0,0/pci1000,30@10/sd@5,0:a'
whole_disk=1
DTL=189
children[5]
type='disk'
id=5
guid=8503895341004279533
path='/dev/dsk/c1t6d0s0'
devid='id1,sd@f0000000048455c81000d49220005/a'
phys_path='/pci@0,0/pci1000,30@10/sd@6,0:a'
whole_disk=1
DTL=188
--------------------------------------------
LABEL 1
--------------------------------------------
version=10
name='ztank'
state=0
txg=207161
pool_guid=12125153257763159358
hostid=628051022
hostname='zfssrv'
top_guid=763279656890868029
guid=10947029755543026189
vdev_tree
type='raidz'
id=0
guid=763279656890868029
nparity=1
metaslab_array=14
metaslab_shift=35
ashift=9
asize=6001149345792
is_log=0
children[0]
type='disk'
id=0
guid=10947029755543026189
path='/dev/dsk/c1t1d0s0'
devid='id1,sd@f0000000048455c81000880330000/a'
phys_path='/pci@0,0/pci1000,30@10/sd@1,0:a'
whole_disk=1
DTL=193
children[1]
type='disk'
id=1
guid=2640926618230776740
path='/dev/dsk/c1t2d0s0'
devid='id1,sd@f0000000048455c81000992690001/a'
phys_path='/pci@0,0/pci1000,30@10/sd@2,0:a'
whole_disk=1
DTL=192
children[2]
type='disk'
id=2
guid=8982722125061616789
path='/dev/dsk/c1t3d0s0'
devid='id1,sd@f0000000048455c81000ae8610002/a'
phys_path='/pci@0,0/pci1000,30@10/sd@3,0:a'
whole_disk=1
DTL=191
children[3]
type='disk'
id=3
guid=7263648809970512976
path='/dev/dsk/c1t4d0s0'
devid='id1,sd@f0000000048455c81000bb2cf0003/a'
phys_path='/pci@0,0/pci1000,30@10/sd@4,0:a'
whole_disk=1
DTL=190
children[4]
type='disk'
id=4
guid=5275414937202266822
path='/dev/dsk/c1t5d0s0'
devid='id1,sd@f0000000048455c81000ca3c40004/a'
phys_path='/pci@0,0/pci1000,30@10/sd@5,0:a'
whole_disk=1
DTL=189
children[5]
type='disk'
id=5
guid=8503895341004279533
path='/dev/dsk/c1t6d0s0'
devid='id1,sd@f0000000048455c81000d49220005/a'
phys_path='/pci@0,0/pci1000,30@10/sd@6,0:a'
whole_disk=1
DTL=188
--------------------------------------------
LABEL 2
--------------------------------------------
version=10
name='ztank'
state=0
txg=207161
pool_guid=12125153257763159358
hostid=628051022
hostname='zfssrv'
top_guid=763279656890868029
guid=10947029755543026189
vdev_tree
type='raidz'
id=0
guid=763279656890868029
nparity=1
metaslab_array=14
metaslab_shift=35
ashift=9
asize=6001149345792
is_log=0
children[0]
type='disk'
id=0
guid=10947029755543026189
path='/dev/dsk/c1t1d0s0'
devid='id1,sd@f0000000048455c81000880330000/a'
phys_path='/pci@0,0/pci1000,30@10/sd@1,0:a'
whole_disk=1
DTL=193
children[1]
type='disk'
id=1
guid=2640926618230776740
path='/dev/dsk/c1t2d0s0'
devid='id1,sd@f0000000048455c81000992690001/a'
phys_path='/pci@0,0/pci1000,30@10/sd@2,0:a'
whole_disk=1
DTL=192
children[2]
type='disk'
id=2
guid=8982722125061616789
path='/dev/dsk/c1t3d0s0'
devid='id1,sd@f0000000048455c81000ae8610002/a'
phys_path='/pci@0,0/pci1000,30@10/sd@3,0:a'
whole_disk=1
DTL=191
children[3]
type='disk'
id=3
guid=7263648809970512976
path='/dev/dsk/c1t4d0s0'
devid='id1,sd@f0000000048455c81000bb2cf0003/a'
phys_path='/pci@0,0/pci1000,30@10/sd@4,0:a'
whole_disk=1
DTL=190
children[4]
type='disk'
id=4
guid=5275414937202266822
path='/dev/dsk/c1t5d0s0'
devid='id1,sd@f0000000048455c81000ca3c40004/a'
phys_path='/pci@0,0/pci1000,30@10/sd@5,0:a'
whole_disk=1
DTL=189
children[5]
type='disk'
id=5
guid=8503895341004279533
path='/dev/dsk/c1t6d0s0'
devid='id1,sd@f0000000048455c81000d49220005/a'
phys_path='/pci@0,0/pci1000,30@10/sd@6,0:a'
whole_disk=1
DTL=188
--------------------------------------------
LABEL 3
--------------------------------------------
version=10
name='ztank'
state=0
txg=207161
pool_guid=12125153257763159358
hostid=628051022
hostname='zfssrv'
top_guid=763279656890868029
guid=10947029755543026189
vdev_tree
type='raidz'
id=0
guid=763279656890868029
nparity=1
metaslab_array=14
metaslab_shift=35
ashift=9
asize=6001149345792
is_log=0
children[0]
type='disk'
id=0
guid=10947029755543026189
path='/dev/dsk/c1t1d0s0'
devid='id1,sd@f0000000048455c81000880330000/a'
phys_path='/pci@0,0/pci1000,30@10/sd@1,0:a'
whole_disk=1
DTL=193
children[1]
type='disk'
id=1
guid=2640926618230776740
path='/dev/dsk/c1t2d0s0'
devid='id1,sd@f0000000048455c81000992690001/a'
phys_path='/pci@0,0/pci1000,30@10/sd@2,0:a'
whole_disk=1
DTL=192
children[2]
type='disk'
id=2
guid=8982722125061616789
path='/dev/dsk/c1t3d0s0'
devid='id1,sd@f0000000048455c81000ae8610002/a'
phys_path='/pci@0,0/pci1000,30@10/sd@3,0:a'
whole_disk=1
DTL=191
children[3]
type='disk'
id=3
guid=7263648809970512976
path='/dev/dsk/c1t4d0s0'
devid='id1,sd@f0000000048455c81000bb2cf0003/a'
phys_path='/pci@0,0/pci1000,30@10/sd@4,0:a'
whole_disk=1
DTL=190
children[4]
type='disk'
id=4
guid=5275414937202266822
path='/dev/dsk/c1t5d0s0'
devid='id1,sd@f0000000048455c81000ca3c40004/a'
phys_path='/pci@0,0/pci1000,30@10/sd@5,0:a'
whole_disk=1
DTL=189
children[5]
type='disk'
id=5
guid=8503895341004279533
path='/dev/dsk/c1t6d0s0'
devid='id1,sd@f0000000048455c81000d49220005/a'
phys_path='/pci@0,0/pci1000,30@10/sd@6,0:a'
whole_disk=1
DTL=188
================

gmvasile

Posts: 6
From:

Registered: 10/1/08
Re: zpool unimportable (corrupt zpool metadata??) but no zdb -l device problems
Posted: Oct 1, 2008 3:42 AM   in response to: gmvasile
To: Communities » zfs » discuss
  Click to reply to this thread Reply

an update to the above: I tried to run zdb -e on the pool id and here's the result:
# zdb -e 12125153257763159358
zdb: can't open 12125153257763159358: I/O error

NB zdb seems to recognize the ID because runnig it with an incorrect ID gives me an error
# zdb -e 12125153257763159354
zdb: can't open 12125153257763159354: No such file or directory

Also zdb -e with the ID of the syspool works:
# zdb -e 8843238790372298114
Uberblock

magic = 0000000000bab10c
version = 10
txg = 317369
guid_sum = 14131844542001965925
timestamp = 1222857640 UTC = Wed Oct 1 12:40:40 2008

Dataset mos [META], ID 0, cr_txg 4, 2.76M, 244 objects
Dataset 8843238790372298114/export/home [ZPL], ID 60, cr_txg 721, 1.21G, 55 objects
Dataset 8843238790372298114/export [ZPL], ID 54, cr_txg 718, 19.0K, 5 objects
Dataset 8843238790372298114/swap [ZVOL], ID 28, cr_txg 15, 519M, 3 objects
Dataset 8843238790372298114/ROOT/snv_90 [ZPL], ID 48, cr_txg 710, 6.85G, 254748 objects
Dataset 8843238790372298114/ROOT [ZPL], ID 22, cr_txg 12, 18.0K, 4 objects
Dataset 8843238790372298114/dump [ZVOL], ID 34, cr_txg 18, 512M, 3 objects
Dataset 8843238790372298114 [ZPL], ID 5, cr_txg 4, 39.5K, 13 objects

etc etc.
=============

Any ideas? Could this be a hardware problem? I have no idea what to do next :-(

thanks for your help!
Vasile

gmvasile

Posts: 6
From:

Registered: 10/1/08
Re: one step forward - pinging Lukas pool: ztankKarwacki (kangurek)
Posted: Oct 1, 2008 11:24 AM   in response to: gmvasile
To: Communities » zfs » discuss
  Click to reply to this thread Reply

on the advice of Okana in the freenode.net #opensolaris channel I tried to run the latest opensolaris livecd and try to import the pool. No luck, however I tried the trick in Lukas's post that allowed him to import the pool and I had a beginning of luck.

By doing the mdb wizardry he indicated I was able to run zpool import with the following result:
pool: ztank
id: whatever
state: ONLINE
status: The pool was last accessed by another system.
see http://www.sun.com/msg/ZFS-8000-EY

config:
ztank ONLINE
raidz1 ONLINE
c4t0d0 ONLINE
c4t1d0 ONLINE
c4t2d0 ONLINE
c4t3d0 ONLINE
c4t4d0 ONLINE
c4t5d0 ONLINE

HOWEVER.
When I attempt again to import using zdb -e ztank
I still get zdb: can't open ztank: I/O error
and zpool import -f, whilst it starts and seems to access the disks sequentially, it stops al the 3rd one (no sure which precisely - it spins it up and the process stops right there, and the system will not reboot when asked to (shutdown -g0 -y -i5)
so there's some slight progress here.

I would really appreciate ideas from you guys!

Thanks
Vasile

okona

Posts: 7
From:

Registered: 6/9/08
Re: one step forward - pinging Lukas pool: ztankKarwacki (kangurek)
Posted: Oct 2, 2008 7:37 AM   in response to: gmvasile
To: Communities » zfs » discuss
  Click to reply to this thread Reply

> When I attempt again to import using zdb -e ztank
> I still get zdb: can't open ztank: I/O error
> and zpool import -f, whilst it starts and seems to
> access the disks sequentially, it stops al the 3rd
> one (no sure which precisely - it spins it up and the
> process stops right there, and the system will not
> reboot when asked to (shutdown -g0 -y -i5)
> so there's some slight progress here.

How about just removing that disk and try importing?

gmvasile

Posts: 6
From:

Registered: 10/1/08
Re: one step forward - pinging Lukas pool: ztankKarwacki (kangurek)
Posted: Oct 2, 2008 1:32 PM   in response to: okona
To: Communities » zfs » discuss
  Click to reply to this thread Reply

Thanks Martin,
Yeah, tried it but no luck :-( I do not think it is a hardware problem - in fact I tried removing every disk one by one with no luck - this is why I think it is not in fact a hardware problem...
Kind regards
Vasile

gmvasile

Posts: 6
From:

Registered: 10/1/08
Re: Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 3, 2008 7:42 AM   in response to: okona
To: Communities » zfs » discuss
  Click to reply to this thread Reply

Hi folks,

I just wanted to share the end of my "adventure" here and especially take the time to thank Victor for helping me out of this mess.

I will let him explain the technical details (I am out of my depth here) but bottom line he spent a couple of hours with me on the machine and sorted me out. His explanation: he invalidated the incorrect uberblocks and forced zfs to revert to an earlier state that was consistent.

The machine is now in the process of doing a full scrub and the first order of business tomorrow will be to do a full backup :-)

According to his explanation, the reason for the troubles I had was that Solaris was running in a VM on my Debian server and it was not shut down properly when the Debian server did a controlled shutdown following a UPS event.

The Solaris machine was abruptly shut down but because it was not in control of the entire chain till bare hardware, it appears that some writes were in fact still with Debian when Solaris thought them safely executed.

This left the zpool in question in a state that even raidz1 did not help with.

Anyway, again, lots and lots of thanks to Victor!!!

kind regards
Vasile

Darren J Moffat
darrenm@opensolaris....
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 3, 2008 7:50 AM   in response to: gmvasile

  Click to reply to this thread Reply

Vasile Dumitrescu wrote:
> Hi folks,
>
> I just wanted to share the end of my "adventure" here and especially take the time to thank Victor for helping me out of this mess.
>
> I will let him explain the technical details (I am out of my depth here) but bottom line he spent a couple of hours with me on the machine and sorted me out. His explanation: he invalidated the incorrect uberblocks and forced zfs to revert to an earlier state that was consistent.
>
> The machine is now in the process of doing a full scrub and the first order of business tomorrow will be to do a full backup :-)
>
> According to his explanation, the reason for the troubles I had was that Solaris was running in a VM on my Debian server and it was not shut down properly when the Debian server did a controlled shutdown following a UPS event.

Which VM solution was this ? VMware, VirtualBox, Xen, other ? How were
the "disks" presented to the guest ? What are the "disks" in the host,
real disks, files, something else ?


--
Darren J Moffat
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


gmvasile

Posts: 6
From:

Registered: 10/1/08
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 3, 2008 8:37 AM   in response to: Darren J Moffat
To: Communities » zfs » discuss
  Click to reply to this thread Reply

>
> Which VM solution was this ? VMware, VirtualBox, Xen,
> other ? How were
> the "disks" presented to the guest ? What are the
> "disks" in the host,
> real disks, files, something else ?
>
>
> --
> Darren J Moffat
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris dot org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss

VMWare 6.0.4 running on Debian unstable,
Linux bigsrv 2.6.26-1-amd64 #1 SMP Wed Sep 24 13:59:41 UTC 2008 x86_64 GNU/Linux

Solaris is vanilla snv_90 installed with no GUI.

Here is the content of the .vmx file in question:
================================================
#!/usr/bin/vmware
config.version = "8"
virtualHW.version = "6"
scsi0.present = "TRUE"
scsi0.virtualDev = "lsilogic"

memsize = "4096"
MemAllowAutoScaleDown = "FALSE"
MemTrimRate = "0"
sched.mem.pshare.enable = "FALSE"
sched.mem.minsize = "3062"
sched.mem.max = "7000"
sched.mem.maxmemctl = "0"
sched.mem.shares = "100000"

scsi0:0.present = "TRUE"
scsi0:0.fileName = "/home/vasile/vmware/solsrv/OpenSolaris64.vmdk"
ide1:0.present = "TRUE"
ide1:0.autodetect = "TRUE"
ide1:0.deviceType = "cdrom-image"
floppy0.startConnected = "FALSE"
floppy0.autodetect = "TRUE"
ethernet0.present = "TRUE"
ethernet0.virtualDev = "e1000"
ethernet0.wakeOnPcktRcv = "TRUE"
sound.present = "FALSE"
sound.fileName = "-1"
sound.autodetect = "TRUE"
svga.autodetect = "FALSE"
pciBridge0.present = "TRUE"
displayName = "zfssrv"
guestOS = "solaris10-64"
nvram = "Solaris 10 64-bit.nvram"
deploymentPlatform = "windows"
virtualHW.productCompatibility = "hosted"
RemoteDisplay.vnc.port = "0"
tools.upgrade.policy = "useGlobal"

floppy0.fileName = "/dev/fd0"
extendedConfigFile = "Solaris 10 64-bit.vmxf"

ide1:0.fileName = ""
floppy0.present = "FALSE"
gui.powerOnAtStartup = "TRUE"

ide1:0.startConnected = "TRUE"
ethernet0.addressType = "generated"
uuid.location = "56 4d da 02 a4 a0 78 74-2e 09 90 62 45 bb c4 94"
uuid.bios = "56 4d da 02 a4 a0 78 74-2e 09 90 62 45 bb c4 94"
scsi0:0.redo = ""
pciBridge0.pciSlotNumber = "17"
scsi0.pciSlotNumber = "16"
ethernet0.pciSlotNumber = "32"
sound.pciSlotNumber = "-1"
ethernet0.generatedAddress = "00:0c:29:bb:c4:94"
ethernet0.generatedAddressOffset = "0"
tools.syncTime = "FALSE"

svga.maxWidth = "1024"
svga.maxHeight = "768"
svga.vramSize = "3145728"

scsi0:1.present = "TRUE"
scsi0:1.fileName = "ztank-sda.vmdk"
scsi0:1.mode = "independent-persistent"
scsi0:1.deviceType = "rawDisk"
scsi0:2.present = "TRUE"
scsi0:2.fileName = "ztank-sdb.vmdk"
scsi0:2.mode = "independent-persistent"
scsi0:2.deviceType = "rawDisk"
scsi0:3.present = "TRUE"
scsi0:3.fileName = "ztank-sdc.vmdk"
scsi0:3.mode = "independent-persistent"
scsi0:3.deviceType = "rawDisk"
scsi0:4.present = "TRUE"
scsi0:4.fileName = "ztank-sdd.vmdk"
scsi0:4.mode = "independent-persistent"
scsi0:4.deviceType = "rawDisk"
scsi0:5.present = "TRUE"
scsi0:5.fileName = "ztank-sde.vmdk"
scsi0:5.mode = "independent-persistent"
scsi0:5.deviceType = "rawDisk"
scsi0:6.present = "TRUE"
scsi0:6.fileName = "ztank-sdf.vmdk"
scsi0:6.mode = "independent-persistent"
scsi0:6.deviceType = "rawDisk"

scsi0:1.redo = ""
scsi0:2.redo = ""
scsi0:3.redo = ""
scsi0:4.redo = ""
scsi0:5.redo = ""
scsi0:6.redo = ""

isolation.tools.dnd.disable = "TRUE"
snapshot.disabled = "TRUE"

scsi0:0.mode = "independent-persistent"

isolation.tools.copy.disable = "FALSE"
isolation.tools.paste.disable = "FALSE"

tools.remindInstall = "TRUE"
================================================

in summary: physical disks, assigned 100% to the VM

HTH

kind regards
Vasile

fajar

Posts: 226
From:

Registered: 5/14/08
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 4, 2008 12:19 AM   in response to: gmvasile

  Click to reply to this thread Reply

On Fri, Oct 3, 2008 at 10:37 PM, Vasile Dumitrescu
<vasiledumitrescu at gmail dot com> wrote:

> VMWare 6.0.4 running on Debian unstable,
> Linux bigsrv 2.6.26-1-amd64 #1 SMP Wed Sep 24 13:59:41 UTC 2008 x86_64 GNU/Linux
>
> Solaris is vanilla snv_90 installed with no GUI.


>
> in summary: physical disks, assigned 100% to the VM

That's weird. I thought one of the point of using physical disks
instead of files was to avoid problems caused by caching on host/dom0?
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Darren J Moffat
darrenm@opensolaris....
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 6, 2008 2:39 AM   in response to: fajar

  Click to reply to this thread Reply

Fajar A. Nugraha wrote:
> On Fri, Oct 3, 2008 at 10:37 PM, Vasile Dumitrescu
> <vasiledumitrescu at gmail dot com> wrote:
>
>> VMWare 6.0.4 running on Debian unstable,
>> Linux bigsrv 2.6.26-1-amd64 #1 SMP Wed Sep 24 13:59:41 UTC 2008 x86_64 GNU/Linux
>>
>> Solaris is vanilla snv_90 installed with no GUI.
>
>
>> in summary: physical disks, assigned 100% to the VM
>
> That's weird. I thought one of the point of using physical disks
> instead of files was to avoid problems caused by caching on host/dom0?

The data still flows through the host/dom0 device drivers and is thus at
the mercy of the commands they issue to the physical devices.

--
Darren J Moffat
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


rwarner2

Posts: 2
From:

Registered: 4/28/07
Re: Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 9, 2008 2:53 AM   in response to: gmvasile
To: Communities » zfs » discuss
  Click to reply to this thread Reply

> His explanation: he invalidated the incorrect
> uberblocks and forced zfs to revert to an earlier
> state that was consistent.

Would someone be willing to document the steps required in order to do this please?

I have a disk in a similar state:

# zpool import
pool: tank
id: 13234439337856002730
state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
see: http://www.sun.com/msg/ZFS-8000-72
config:

tank FAULTED corrupted data
c7d0 ONLINE

This happened after I foolishly began trusting zfs-fuse with some large but relatively unimportant data on a big, empty single disk zpool in my home machine and then suffered a power cut before I got around to backing it up.

OpenSolaris can't import the pool either, so the drive is sat on a shelf waiting till a method for fixing it is published.

While it's clearly my own fault for taking the risks I did, it's still pretty frustrating knowing that all my data is likely still intact and nicely checksummed on the disk but that none of it is accessible due to some tiny filesystem inconsistency. With pretty much any other FS I think I could get most of it back.

Clearly such a small number of occurrences in what were admittedly precarious configurations aren't going to be particularly convincing motivators to provide a general solution, but I'd feel a whole lot better about using ZFS if I knew that there were some documented steps or a tool (zfsck? ;) that could help to recover from this kind of metadata corruption in the unlikely event of it happening.

cheers,

Rob

mgerdts

Posts: 1,361
From: US

Registered: 8/5/05
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 9, 2008 4:37 AM   in response to: rwarner2

  Click to reply to this thread Reply

On Thu, Oct 9, 2008 at 4:53 AM, . <osl at boymonkey dot com> wrote:
> While it's clearly my own fault for taking the risks I did, it's
> still pretty frustrating knowing that all my data is likely still
> intact and nicely checksummed on the disk but that none of it is
> accessible due to some tiny filesystem inconsistency. ?With pretty
> much any other FS I think I could get most of it back.
>
> Clearly such a small number of occurrences in what were admittedly
> precarious configurations aren't going to be particularly convincing
> motivators to provide a general solution, but I'd feel a whole lot
> better about using ZFS if I knew that there were some documented
> steps or a tool (zfsck? ;) that could help to recover from this kind
> of metadata corruption in the unlikely event of it happening.

Well said. You have hit on my #1 concern with deploying ZFS.

FWIW, I belive that I have hit the same type of bug as the OP in the
following combinations:

- T2000, LDoms 1.0, various builds of Nevada in control and guest
domains.
- Laptop, VirtualBox 1.6.2, Windows XP SP2 host, OpenSolaris 2008.05 @
build 97 guest

In the past year I've lost more ZFS file systems than I have any other
type of file system in the past 5 years. With other file systems I
can almost always get some data back. With ZFS I can't get any back.

--
Mike Gerdts
http://mgerdts.blogspot.com/
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Wilkinson, Alex
alex.wilkinson@dsto....
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 9, 2008 4:46 AM   in response to: mgerdts

  Click to reply to this thread Reply


0n Thu, Oct 09, 2008 at 06:37:23AM -0500, Mike Gerdts wrote:

>FWIW, I belive that I have hit the same type of bug as the OP in the
>following combinations:
>
>- T2000, LDoms 1.0, various builds of Nevada in control and guest
> domains.
>- Laptop, VirtualBox 1.6.2, Windows XP SP2 host, OpenSolaris 2008.05 @
> build 97 guest
>
>In the past year I've lost more ZFS file systems than I have any other
>type of file system in the past 5 years. With other file systems I
>can almost always get some data back. With ZFS I can't get any back.

Thats scary to hear!

-aW

IMPORTANT: This email remains the property of the Australian Defence Organisation and is subject to the jurisdiction of section 70 of the CRIMES ACT 1914. If you have received this email in error, you are requested to contact the sender and delete the email.


_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Ahmed Kamal
email.ahmedkamal@goo...
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 9, 2008 5:44 AM   in response to: Wilkinson, Alex

  Click to reply to this thread Reply


   >
   >In the past year I've lost more ZFS file systems than I have any other
   >type of file system in the past 5 years.  With other file systems I
   >can almost always get some data back.  With ZFS I can't get any back.

Thats scary to hear!

 
I am really scared now! I was the one trying to quantify ZFS reliability, and that is surely bad to hear!
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


mgerdts

Posts: 1,361
From: US

Registered: 8/5/05
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 9, 2008 6:22 AM   in response to: Ahmed Kamal

  Click to reply to this thread Reply

On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal
<email dot ahmedkamal at googlemail dot com> wrote:
>
> >
> >In the past year I've lost more ZFS file systems than I have any other
> >type of file system in the past 5 years. With other file systems I
> >can almost always get some data back. With ZFS I can't get any back.
>
>> Thats scary to hear!
>>
>
> I am really scared now! I was the one trying to quantify ZFS reliability,
> and that is surely bad to hear!

The circumstances where I have lost data have been when ZFS has not
handled a layer of redundancy. However, I am not terribly optimistic
of the prospects of ZFS on any device that hasn't committed writes
that ZFS thinks are committed. Mirrors and raidz would also be
vulnerable to such failures.

I also have run into other failures that have gone unanswered on the
lists. It makes me wary about using zfs without a support contract
that allows me to escalate to engineering. Patching only support
won't help.

http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html
Hang only after I mirrored the zpool, no response on the list

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html
I think this is fixed around snv_98, but the zfs-discuss list was
surprisingly silent on acknowledging it as a problem - I had no
idea that it was being worked until I saw the commit. The panic
seemed to be caused by dtrace - core developers of dtrace
were quite interested in the kernel crash dump.

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html
Panic during ON build. Pool was lost, no response from list.

--
Mike Gerdts
http://mgerdts.blogspot.com/
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Timh Bergström
timh.bergstrom@diino...
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 9, 2008 7:50 AM   in response to: mgerdts

  Click to reply to this thread Reply

Unfortunely I can only agree to the doubts about running ZFS in
production environments, i've lost ditto-blocks, i''ve gotten
corrupted pools and a bunch of other failures even in
mirror/raidz/raidz2 setups with or without hardware mirrors/raid5/6.
Plus the insecurity of a sudden crash/reboot will corrupt or even
destroy the pools with "restore from backup" as the only advice. I've
been lucky so far about getting my pools back thanks to people like
Victor.

What would be needed is a proper fsck for ZFS which can resolv "minor"
data corruptions, tools for rebuilding, resizing and moving the data
about on pools is also needed, even recover of data from faulted
pools, like there is for ext2/3/ufs/ntfs.

All in all, great FS but not production ready until the tools are in
place or it gets really really resillient to minor failures and/or
crashes in both software and hardware. For now i'll stick to XFS/UFS
and sw/hw-raid and live with the restrictions of such fs.

//T

2008/10/9 Mike Gerdts <mgerdts at gmail dot com>:
> On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal
> <email dot ahmedkamal at googlemail dot com> wrote:
>>
>> >
>> >In the past year I've lost more ZFS file systems than I have any other
>> >type of file system in the past 5 years. With other file systems I
>> >can almost always get some data back. With ZFS I can't get any back.
>>
>>> Thats scary to hear!
>>>
>>
>> I am really scared now! I was the one trying to quantify ZFS reliability,
>> and that is surely bad to hear!
>
> The circumstances where I have lost data have been when ZFS has not
> handled a layer of redundancy. However, I am not terribly optimistic
> of the prospects of ZFS on any device that hasn't committed writes
> that ZFS thinks are committed. Mirrors and raidz would also be
> vulnerable to such failures.
>
> I also have run into other failures that have gone unanswered on the
> lists. It makes me wary about using zfs without a support contract
> that allows me to escalate to engineering. Patching only support
> won't help.
>
> http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html
> Hang only after I mirrored the zpool, no response on the list
>
> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html
> I think this is fixed around snv_98, but the zfs-discuss list was
> surprisingly silent on acknowledging it as a problem - I had no
> idea that it was being worked until I saw the commit. The panic
> seemed to be caused by dtrace - core developers of dtrace
> were quite interested in the kernel crash dump.
>
> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html
> Panic during ON build. Pool was lost, no response from list.
>
> --
> Mike Gerdts
> http://mgerdts.blogspot.com/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris dot org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>



--
Timh Bergström
System Administrator
Diino AB - www.diino.com
:wq
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


shawga

Posts: 102
From: Louisville, CO

Registered: 3/12/06
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 9, 2008 8:10 AM   in response to: Timh Bergström

  Click to reply to this thread Reply

Perhaps I mis-understand, but the below issues are all based on Nevada,
not Solaris 10.

Nevada isn't production code. For real ZFS testing, you must use a
production release, currently Solaris 10 (update 5, soon to be update 6).

In the last 2 years, I've stored everything in my environment (home
directory, builds, etc.) on ZFS on multiple types of storage subsystems
without issues. All of this has been on Solaris 10, however.

Btw, I completely agree on the panic issue. If I have a large DB
server with many pools, and one inconsequential pool fails, I lose the
entire DB server. I'd really like to see an option at the zpool level
directing what to do in a panic for a particular pool. Perhaps this
is in the latest bits; if so, sorry, I'm running old stuff. :-)

I also run ZFS on my mac. While not production quality, some of the
panic errors dealing with external (firewire, usb, esata) are very
irritating. A hiccup due to a jostled cable, and the entire box
panics. That's frustrating.

Timh Bergström wrote:
> Unfortunely I can only agree to the doubts about running ZFS in
> production environments, i've lost ditto-blocks, i''ve gotten
> corrupted pools and a bunch of other failures even in
> mirror/raidz/raidz2 setups with or without hardware mirrors/raid5/6.
> Plus the insecurity of a sudden crash/reboot will corrupt or even
> destroy the pools with "restore from backup" as the only advice. I've
> been lucky so far about getting my pools back thanks to people like
> Victor.
>
> What would be needed is a proper fsck for ZFS which can resolv "minor"
> data corruptions, tools for rebuilding, resizing and moving the data
> about on pools is also needed, even recover of data from faulted
> pools, like there is for ext2/3/ufs/ntfs.
>
> All in all, great FS but not production ready until the tools are in
> place or it gets really really resillient to minor failures and/or
> crashes in both software and hardware. For now i'll stick to XFS/UFS
> and sw/hw-raid and live with the restrictions of such fs.
>
> //T
>
> 2008/10/9 Mike Gerdts <mgerdts at gmail dot com>:
>
>> On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal
>> <email dot ahmedkamal at googlemail dot com> wrote:
>>
>>> >
>>> >In the past year I've lost more ZFS file systems than I have any other
>>> >type of file system in the past 5 years. With other file systems I
>>> >can almost always get some data back. With ZFS I can't get any back.
>>>
>>>
>>>> Thats scary to hear!
>>>>
>>>>
>>> I am really scared now! I was the one trying to quantify ZFS reliability,
>>> and that is surely bad to hear!
>>>
>> The circumstances where I have lost data have been when ZFS has not
>> handled a layer of redundancy. However, I am not terribly optimistic
>> of the prospects of ZFS on any device that hasn't committed writes
>> that ZFS thinks are committed. Mirrors and raidz would also be
>> vulnerable to such failures.
>>
>> I also have run into other failures that have gone unanswered on the
>> lists. It makes me wary about using zfs without a support contract
>> that allows me to escalate to engineering. Patching only support
>> won't help.
>>
>> http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html
>> Hang only after I mirrored the zpool, no response on the list
>>
>> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html
>> I think this is fixed around snv_98, but the zfs-discuss list was
>> surprisingly silent on acknowledging it as a problem - I had no
>> idea that it was being worked until I saw the commit. The panic
>> seemed to be caused by dtrace - core developers of dtrace
>> were quite interested in the kernel crash dump.
>>
>> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html
>> Panic during ON build. Pool was lost, no response from list.
>>
>> --
>> Mike Gerdts
>> http://mgerdts.blogspot.com/
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris dot org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>>
>
>
>
>
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


mgerdts

Posts: 1,361
From: US

Registered: 8/5/05
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 9, 2008 8:18 AM   in response to: shawga

  Click to reply to this thread Reply

On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw <Greg dot Shaw at sun dot com> wrote:
> Nevada isn't production code. For real ZFS testing, you must use a
> production release, currently Solaris 10 (update 5, soon to be update 6).

I misstated before in my LDoms case. The corrupted pool was on
Solaris 10, with LDoms 1.0. The control domain was SX*E, but the
zpool there showed no problems. I got into a panic loop with dangling
dbufs. My understanding is that this was caused by a bug in the LDoms
manager 1.0 code that has been fixed in a later release. It was a
supported configuration, I pushed for and got a fix. However, that
pool was still lost.

--
Mike Gerdts
http://mgerdts.blogspot.com/
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


mgerdts

Posts: 1,361
From: US

Registered: 8/5/05
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 9, 2008 8:33 PM   in response to: mgerdts

  Click to reply to this thread Reply

On Thu, Oct 9, 2008 at 10:18 AM, Mike Gerdts <mgerdts at gmail dot com> wrote:
> On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw <Greg dot Shaw at sun dot com> wrote:
>> Nevada isn't production code. For real ZFS testing, you must use a
>> production release, currently Solaris 10 (update 5, soon to be update 6).
>
> I misstated before in my LDoms case. The corrupted pool was on
> Solaris 10, with LDoms 1.0. The control domain was SX*E, but the
> zpool there showed no problems. I got into a panic loop with dangling
> dbufs. My understanding is that this was caused by a bug in the LDoms
> manager 1.0 code that has been fixed in a later release. It was a
> supported configuration, I pushed for and got a fix. However, that
> pool was still lost.

Or maybe it wasn't fixed yet. I see that this was committed just today.

6684721 file backed virtual i/o should be synchronous

http://hg.genunix.org/onnv-gate.hg/rev/eb40ff0c92ec

--
Mike Gerdts
http://mgerdts.blogspot.com/
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


mgerdts

Posts: 1,361
From: US

Registered: 8/5/05
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 13, 2008 9:58 AM   in response to: mgerdts

  Click to reply to this thread Reply

On Thu, Oct 9, 2008 at 10:33 PM, Mike Gerdts <mgerdts at gmail dot com> wrote:
> On Thu, Oct 9, 2008 at 10:18 AM, Mike Gerdts <mgerdts at gmail dot com> wrote:
>> On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw <Greg dot Shaw at sun dot com> wrote:
>>> Nevada isn't production code. For real ZFS testing, you must use a
>>> production release, currently Solaris 10 (update 5, soon to be update 6).
>>
>> I misstated before in my LDoms case. The corrupted pool was on
>> Solaris 10, with LDoms 1.0. The control domain was SX*E, but the
>> zpool there showed no problems. I got into a panic loop with dangling
>> dbufs. My understanding is that this was caused by a bug in the LDoms
>> manager 1.0 code that has been fixed in a later release. It was a
>> supported configuration, I pushed for and got a fix. However, that
>> pool was still lost.
>
> Or maybe it wasn't fixed yet. I see that this was committed just today.
>
> 6684721 file backed virtual i/o should be synchronous
>
> http://hg.genunix.org/onnv-gate.hg/rev/eb40ff0c92ec

The related information from the LDoms Manager 1.1 Early Access
release notes (820-4914-10):

Data Might Not Be Written Immediately to the Virtual Disk Backend If
Virtual I/O Is Backed by a File or Volume

Bug ID 6684721: When a file or volume is exported as a virtual disk,
then the service domain exporting that file or volume is acting as a
storage cache for the virtual disk. In that case, data written to the
virtual disk might get cached into the service domain memory instead
of being immediately written to the virtual disk backend. Data are not
cached if the virtual disk backend is a physical disk or slice, or if
it is a volume device exported as a single-slice disk.

Workaround: If the virtual disk backend is a file or a volume device
exported as a full disk, then you can prevent data from being cached
into the service domain memory and have data written immediately to
the virtual disk backend by adding the following line to the
/etc/system file on the service domain.

set vds:vd_file_write_flags = 0

Note – Setting this tunable flag does have an impact on performance
when writing to a virtual disk, but it does ensure that data are
written immediately to the virtual disk backend.

--
Mike Gerdts
http://mgerdts.blogspot.com/
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Miles Nordin
carton@Ivy.NET
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 9, 2008 11:38 AM   in response to: shawga

  Click to reply to this thread Reply

>>>>> "gs" == Greg Shaw <Greg dot Shaw at Sun dot COM> writes:

gs> Nevada isn't production code. For real ZFS testing, you must
gs> use a production release, currently Solaris 10 (update 5, soon
gs> to be update 6).

based on list feedback, my impression is that the results of a
``test'' confined to s10, particularly s10u4 (the latest available
during most of Mike's experience), would be worse than Nevada
experience over the same period. but I doubt either matches UFS+SVM
or ext3+LVM2. The on-disk format with ``ditto blocks'' and ``always
consistent'' may be fantastic, but the code for reading it is not.

Maybe the code is stellar, and the problem really is underlying
storage stacks that fail to respect write barriers. If so, ZFS needs
to include a storage stack qualification tool. For me it doesn't
strain credibility to believe these problems might be rampant in VM
stacks and SAN's, nor do I find it unacceptable if ZFS is vastly more
sensitive to them than any other filesystem. If this speculation
turns out to really be the case, I imagine the two going together: the
problems are rampant because they don't bother other filesystems too
catastrophically. If this is really the situation, then ZFS needs to
give the sysadmin a way to isolate and fix the problems
deterministically before filling the pool with data, not just blame
the sysadmin based on nebulous speculatory hindsight gremlins.

And if it's NOT the case, the ZFS problems need to be acknowledged and
fixed.

To my view, the above is *IN ADDITION* to developing a
recovery/forensic/``fsck'' tool, not either/or. The pools should not
be getting corrupt in the first place, and pulling the cord should not
mean you have to settle for best-effort. None of the modern
filesystems demand an fsck after unclean shutdown.

The current procedure for qualifying a platform seems to be: (1)
subject it to heavy write activity, (2) pull the cord, (3) repeat.
Ahmed, maybe you should use that test to ``quantify'' filesystem
reliability. You can try it with ZFS, then reinstall the machine with
CentOS and try the same test with ext3+LVM2 or xfs+areca. The numbers
you get are how many times can you pull the cord before you lose
something, and how much do you lose. Here's a really old test of that
sort comparing Linux filesystems which is something like what I have
in mind:

https://www.redhat.com/archives/fedora-list/2004-July/msg00418.html

so you see he got two sets of numbers---number of reboots and amount
of corruption. For reiserfs and JFS he lost their equivalent of ``the
whole pool'', and for ext3 and XFS he got corruption but never lost
the pool. It's not clear to me the filesystems ever claimed to
prevent corruption in his test scenario (was he calling fsync() after
each log write? syslog does that sometimes, and if so, they do claim
it, but if he's just writing with some silly script they don't), but
definitely they do all claim you won't lose the whole pool in a power
outage, and only two out of four delivered on that. I base my choice
of Linux filesystem on this test, and wish I'd done such a test before
converting things to ZFS.
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


bfriesen

Posts: 874
From: US

Registered: 8/19/08
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 9, 2008 12:06 PM   in response to: Miles Nordin

  Click to reply to this thread Reply

On Thu, 9 Oct 2008, Miles Nordin wrote:
>
> catastrophically. If this is really the situation, then ZFS needs to
> give the sysadmin a way to isolate and fix the problems
> deterministically before filling the pool with data, not just blame
> the sysadmin based on nebulous speculatory hindsight gremlins.
>
> And if it's NOT the case, the ZFS problems need to be acknowledged and
> fixed.

Can you provide any supportive evidence that ZFS is as fragile as you
describe?

>From recent opinions expressed here, properly-designed ZFS pools must
be inexplicably permanently cratering each and every day.

Bob
======================================
Bob Friesenhahn
bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Timh Bergström
timh.bergstrom@diino...
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 10, 2008 12:38 AM   in response to: bfriesen

  Click to reply to this thread Reply

2008/10/9 Bob Friesenhahn <bfriesen at simple dot dallas dot tx dot us>:
> On Thu, 9 Oct 2008, Miles Nordin wrote:
>>
>> catastrophically. If this is really the situation, then ZFS needs to
>> give the sysadmin a way to isolate and fix the problems
>> deterministically before filling the pool with data, not just blame
>> the sysadmin based on nebulous speculatory hindsight gremlins.
>>
>> And if it's NOT the case, the ZFS problems need to be acknowledged and
>> fixed.
>
> Can you provide any supportive evidence that ZFS is as fragile as you
> describe?

The hundreds of sysadmins seeing their pools go byebye after normal
operations in a production environment is evidence enough. And the
number of times people like Victor have saved our asses.

>
> >From recent opinions expressed here, properly-designed ZFS pools must
> be inexplicably permanently cratering each and every day.
>
> Bob
> ======================================
> Bob Friesenhahn
> bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris dot org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>



--
Timh Bergström
System Administrator
Diino AB - www.diino.com
:wq
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


jnickels

Posts: 22
From:

Registered: 2/1/08
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Posted: Oct 11, 2008 11:06 AM   in response to: Timh Bergström

  Click to reply to this thread Reply

"Timh Bergström" <timh dot bergstrom at diino dot net> writes:

> Unfortunely I can only agree to the doubts about running ZFS in
> production environments, i've lost ditto-blocks, i''ve gotten
> corrupted pools and a bunch of other failures even in
> mirror/raidz/raidz2 setups with or without hardware mirrors/raid5/6.
> Plus the insecurity of a sudden crash/reboot will corrupt or even
> destroy the pools with "restore from backup" as the only advice. I've
> been lucky so far about getting my pools back thanks to people like
> Victor.

With which release was that? Solaris 10 or OpenSolaris?

Regards, Juergen.
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


bonwick

Posts: 129
From: US

Registered: 3/9/05
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 10, 2008 1:26 AM   in response to: mgerdts

  Click to reply to this thread Reply

> The circumstances where I have lost data have been when ZFS has not
> handled a layer of redundancy. However, I am not terribly optimistic
> of the prospects of ZFS on any device that hasn't committed writes
> that ZFS thinks are committed.

FYI, I'm working on a workaround for broken devices. As you note,
some disks flat-out lie: you issue the synchronize-cache command,
they say "got it, boss", yet the data is still not on stable storage.
Why do they do this? Because "it performs better". Well, duh --
you can make stuff *really* fast if it doesn't have to be correct.

Before I explain how ZFS can fix this, I need to get something off my
chest: people who knowingly make such disks should be in federal prison.
It is *fraud* to win benchmarks this way. Doing so causes real harm
to real people. Same goes for NFS implementations that ignore sync.
We have specifications for a reason. People assume that you honor them,
and build higher-level systems on top of them. Change the mass of
the proton by a few percent, and the stars explode. It is impossible
to build a functioning civil society in a culture that tolerates lies.
We need a little more Code of Hammurabi in the storage industry.

Now:

The uberblock ring buffer in ZFS gives us a way to cope with this,
as long as we don't reuse freed blocks for a few transaction groups.
The basic idea: if we can't read the pool startign from the most
recent uberblock, then we should be able to use the one before it,
or the one before that, etc, as long as we haven't yet reused any
blocks that were freed in those earlier txgs. This allows us to
use the normal load on the pool, plus the passage of time, as a
displacement flush for disk caches that ignore the sync command.

If we go back far enough in (txg) time, we will eventually find an
uberblock all of whose dependent data blocks have make it to disk.
I'll run tests with known-broken disks to determine how far back we
need to go in practice -- I'll bet one txg is almost always enough.

Jeff
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


myxiplx

Posts: 883
From: GB

Registered: 10/24/07
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 10, 2008 2:29 AM   in response to: bonwick
To: Communities » zfs » discuss
  Click to reply to this thread Reply

That sounds like a great idea for a tool Jeff. Would it be possible to build that in as a "zpool recover" command?

Being able to run a tool like that and see just how bad the corruption is, but know it's possible to recover an older version would be great. Is there any chance of outputting details so the sysadmin can know roughly how much was lost?

My thoughts are going to be very rough (I don't know much about zfs internals), but I'm wondering if something like this would work, where all bad blocks are reported, along with the latest 3 good ones:

*************************************8
# zpool recover <pool>
......... pool details ...........

Finding and testing uberblocks...
1. block a date/time: xxxxx/xxxx
CORRUPTED
2. block b date/time: yyyyy/yyyy
CORRUPTED
3. block c date/time: zzzzz/zzzz
Appears OK
4. block d date/time: zzzzz/zzzz
Appears OK
5. block e date/time: zzzzz/zzzz
Appears OK

>
*************************************8

Victor was talking in another thread about using zdb to check the pool before doing an import of a damaged pool. Might it be possible for the next stage of the recovery process to give the user an option of testing or importing the pool for any particular uberblock?

It does sound like testing can take a long time, so this would need to be something that can be cancelled, and you would also need a way to mark uberblocks as bad should problems be found with either the test or import.

This would be a great addition to ZFS though, and would hopefully save Victor a bit of time ;-)

Ross

Ricardo M. Corr...
Ricardo.M.Correia@Su...
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 10, 2008 2:48 AM   in response to: bonwick

  Click to reply to this thread Reply

Hi Jeff,

On Sex, 2008-10-10 at 01:26 -0700, Jeff Bonwick wrote:
> > The circumstances where I have lost data have been when ZFS has not
> > handled a layer of redundancy. However, I am not terribly optimistic
> > of the prospects of ZFS on any device that hasn't committed writes
> > that ZFS thinks are committed.
>
> FYI, I'm working on a workaround for broken devices. As you note,
> some disks flat-out lie: you issue the synchronize-cache command,
> they say "got it, boss", yet the data is still not on stable storage.

It's not just about ignoring the synchronize-cache command, there's also
another weak spot.

ZFS is quite resilient against so-called phantom writes, provided that
they occur sporadically - let's say, if the disk decides to _randomly_
ignore writes 10% of the time, ZFS could probably survive that pretty
well even on single-vdev pools, due to ditto blocks.

However, it is not so resilient when the storage system suffers hiccups
which cause phantom writes to occur continuously, even if for a small
period of time (say less than 10 seconds), and then return to normal.
This could happen for several reasons, including network problems, bugs
in software or even firmware, etc.

I think in this case, going back to a previous uberblock could also be
enough to recover from such a scenario most of the times, unless perhaps
the error occurred too long ago, and the unwritten metadata got flushed
out of the ARC and didn't have a chance to get rewritten.

In any case, a more generic solution to repair all kinds of metadata
corruption, such as (e.g.) space map corruption, would be very
desirable, as I think everyone can agree.

Best regards,
Ricardo


_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


byleal

Posts: 417
From: BR

Registered: 7/18/06
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 10, 2008 6:15 AM   in response to: Ricardo M. Corr...
To: Communities » zfs » discuss
  Click to reply to this thread Reply

Hello all,
I think the problem here is the ZFS´ capacity for recovery from a failure. Forgive me, but thinking about creating a code "without failures", maybe the hackers did forget that other people can make mistakes (if they can´t).
- "ZFS does not need fsck".
Ok, that´s a great statement, but i think ZFS needs one. Really does. And in my opinion a enhanced zdb would be the solution. Flexibility. Options.
- "I have 90% of something i think is your filesystem, do you want it"?
I think a software is as good as it can recovery from failures. And i don´t want to know who failed, i´m not going to send anyone to jail, i´m not a lawyer. I agree with Jeff, really do, but that is "another" problem...
The solution Jeff is working one, i think is really great, since it does NOT be the "all or nothing" again... I don´t know about you, but A LOT of times i was saved by the "Lost and Found" directory! All the beauty of a UNIX system is "rm /etc/passwd" after have edited it, and get the whole file doing a "cat /dev/mem". ;-)
I think there are a lot of parts in ZFS design that remembers me when you see something left on the floor at home, so you ask for your son why he did not get it, and he says "it was not me".
peace.

Leal.

eschrock

Posts: 804
From: Menlo Park, CA

Registered: 3/9/05
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 10, 2008 11:23 AM   in response to: byleal

  Click to reply to this thread Reply

On Fri, Oct 10, 2008 at 06:15:16AM -0700, Marcelo Leal wrote:
> - "ZFS does not need fsck".
> Ok, that?s a great statement, but i think ZFS needs one. Really does.
> And in my opinion a enhanced zdb would be the solution. Flexibility.
> Options.

About 99% of the problems reported as "I need ZFS fsck" can be summed up
by two ZFS bugs:

1. If a toplevel vdev fails to open, we should be able to pull
information from necessary ditto blocks to open the pool and make
what progress we can. Right now, the root vdev code assumes "can't
open = faulted pool," which results in failure scenarios that are
perfectly recoverable most of the time. This needs to be fixed
so that pool failure is only determined by the ability to read
critical metadata (such as the root of the DSL).

2. If an uberblock ends up with an inconsistent view of the world (due
to failure of DKIOCFLUSHWRITECACHE, for example), we should be able
to go back to previous uberblocks to find a good view of our pool.
This is the failure mode described by Jeff.

These are both bugs in ZFS and will be fixed. The other 1% of the
complaints are usually of the form "I created my pool on top of my old
one" or "I imported a LUN on two different systems at the same time".
It's unclear what a 'fsck' tool could do in this scenario, if anything.
Due to a variety of reasons (hierarchical nature of ZFS, variable block
sizes, RAIDZ-Z, compression, etc), it's difficult to even *identify* a
ZFS block, let alone determine its validity and associate it in some
larger construct.

There are some interesting possibilities for limited forensic tools - in
particular, I like the idea of a mdb backend for reading and writing ZFS
pools[1]. But I haven't actually heard a reasonable proposal for what a
fsck-like tool (i.e. one that could "repair" things automatically) would
actually *do*, let alone how it would work in the variety of situations
it needs to (compressed RAID-Z?) where the standard ZFS infrastructure
fails.

- Eric

[1] http://mbruning.blogspot.com/2008/08/recovering-removed-file-on-zfs-disk.html

--
Eric Schrock, Fishworks http://blogs.sun.com/eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


iktorn

Posts: 221
From: RU

Registered: 3/27/06
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 10, 2008 12:48 PM   in response to: eschrock

  Click to reply to this thread Reply

Eric Schrock wrote:
> On Fri, Oct 10, 2008 at 06:15:16AM -0700, Marcelo Leal wrote:
>> - "ZFS does not need fsck".
>> Ok, that?s a great statement, but i think ZFS needs one. Really does.
>> And in my opinion a enhanced zdb would be the solution. Flexibility.
>> Options.
>
> About 99% of the problems reported as "I need ZFS fsck" can be summed up
> by two ZFS bugs:
>
> 1. If a toplevel vdev fails to open, we should be able to pull
> information from necessary ditto blocks to open the pool and make
> what progress we can. Right now, the root vdev code assumes "can't
> open = faulted pool," which results in failure scenarios that are
> perfectly recoverable most of the time. This needs to be fixed
> so that pool failure is only determined by the ability to read
> critical metadata (such as the root of the DSL).
>
> 2. If an uberblock ends up with an inconsistent view of the world (due
> to failure of DKIOCFLUSHWRITECACHE, for example), we should be able
> to go back to previous uberblocks to find a good view of our pool.
> This is the failure mode described by Jeff.

I've mostly seen (2), because despite all the best practices out there,
single vdev pools are quite common. In all such cases that I had my
hands on it was possible to recover pool by going back by one or two txgs.

> These are both bugs in ZFS and will be fixed. The other 1% of the
> complaints are usually of the form "I created my pool on top of my old
> one" or "I imported a LUN on two different systems at the same time".

Of these two former is not easy because it requires searching through
the entire disk space for root block candidates and trying each of them.
Latter one is not catastrophic in case there were little to no activity
from one system. In this case one of the first things to suffer is pool
config object, and corruption of it prevents pool open.

Fortunately enough, after putback of

6733970 assertion failure in dbuf_dirty() via spa_sync_nvlist()

in build 99 corrupted pool config object is written in such a way during
open that prevents reading in old corrupted copy, and in most cases this
allows to import pool and save most of the data. zdb is useful to
understand how much is corrupted and how much is recovered. If nothing
else is corrupted, then pool may be available for further use without
recreation. Again, in every case I had my hands on it was possible to
either recover pool completely or at least save most of the data.

> It's unclear what a 'fsck' tool could do in this scenario, if anything.
> Due to a variety of reasons (hierarchical nature of ZFS, variable block
> sizes, RAIDZ-Z, compression, etc), it's difficult to even *identify* a
> ZFS block, let alone determine its validity and associate it in some
> larger construct.

Indeed. In "more ZFS recovery" case involving 42TB pool with about 8TB
used, zdb -bv alone took several hours to walk the block tree and verify
consistency of block pointers, and zdb -bcv took couple of days to
verify all user data blocks as well. And different checksums and gang
blocks in addition to all other dynamic features mentioned complicate
the task of identifying ZFS blocks and linking those blocks into tree
and make it really time (and space) consuming.

> There are some interesting possibilities for limited forensic tools - in
> particular, I like the idea of a mdb backend for reading and writing ZFS
> pools[1]. But I haven't actually heard a reasonable proposal for what a
> fsck-like tool (i.e. one that could "repair" things automatically) would
> actually *do*, let alone how it would work in the variety of situations
> it needs to (compressed RAID-Z?) where the standard ZFS infrastructure
> fails.

There are a number of bugs and rfes to improve usefulness of zdb for
field use, e.g.

6720637 want zdb -l option to dump uberblock arrays as well
6709782 issues running zdb with -p and -e options
6736356 zdb -R needs to work with exported pools
6720907 zdb should handle errors while dumping datasets and objects
6746101 zdb command to search for ZFS labels in a device
6757444 want zdb -R to supoprt decompression, checksumming and raid-z
6757430 want an option for zdb to disable space map loading and leak
tracking

Hth,
Victor

> - Eric
>
> [1] http://mbruning.blogspot.com/2008/08/recovering-removed-file-on-zfs-disk.html
>
> --
> Eric Schrock, Fishworks http://blogs.sun.com/eschrock
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris dot org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


David Magda
dmagda@ee.ryerson.ca
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 10, 2008 6:55 PM   in response to: iktorn

  Click to reply to this thread Reply

On Oct 10, 2008, at 15:48, Victor Latushkin wrote:

> I've mostly seen (2), because despite all the best practices out
> there,
> single vdev pools are quite common. In all such cases that I had my
> hands on it was possible to recover pool by going back by one or two
> txgs.

For better or worse this is the case where I work.

Most of our storage is on SANs (EMC and NetApp), and so if we need
more space we ask for it and we get a giant LUN given to us (usually
multi-pathed). We also have a lot of Veritas VxVM and VxFS for Oracle,
and so even if we're running Solaris 10, we're not using ZFS in that
case.

SAN space is also allocated to Windows and VMware ESX machines as
well, so it's not like we can ask for the disks in the SAN to be
exported raw, as that would mess up managing of things with the other
OSes. (We have a very small global storage / back up team, and I
really don't want to add more to their workload.)

If someone finds themselves in this position, what advice can be
followed to minimize risks?

For example, is having checksums enabled a good idea? If you have no
redundancy and an error occurs, the system will panic by default
(configurable in newer builds of OpenSolaris, but not in Solaris
'proper' yet). But if the system is ignoring checksums, you're no
worse off than most other file systems (but still get all the other
features of ZFS).

Or is there a way to mitigate a checksum error on non-redundant zpool?

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


bonwick

Posts: 129
From: US

Registered: 3/9/05
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 10, 2008 7:14 PM   in response to: David Magda

  Click to reply to this thread Reply

> Or is there a way to mitigate a checksum error on non-redundant zpool?

It's just like the difference between non-parity, parity, and ECC memory.
Most filesystems don't have checksums (non-parity), so they don't even
know when they're returning corrupt data. ZFS without any replication
can detect errors, but can't fix them (like parity memory). ZFS with
mirroring or RAID-Z can both detect and correct (like ECC memory).

Note: even in a single-device pool, ZFS metadata is replicated via
ditto blocks at two or three different places on the device, so that
a localized media failure can be both detected and corrected.
If you have two or more devices, even without any mirroring
or RAID-Z, ZFS metadata is mirrored (again via ditto blocks)
across those devices.

Jeff
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


mgerdts

Posts: 1,361
From: US

Registered: 8/5/05
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 10, 2008 8:59 PM   in response to: bonwick

  Click to reply to this thread Reply

On Fri, Oct 10, 2008 at 9:14 PM, Jeff Bonwick <Jeff dot Bonwick at sun dot com> wrote:
> Note: even in a single-device pool, ZFS metadata is replicated via
> ditto blocks at two or three different places on the device, so that
> a localized media failure can be both detected and corrected.
> If you have two or more devices, even without any mirroring
> or RAID-Z, ZFS metadata is mirrored (again via ditto blocks)
> across those devices.

And in the event that you have a pool that is mostly not very
important but some of it is important, you can have data mirrored on a
per dataset level via copies=n.

If we can avoid losing an entire pool by rolling back a txg or two,
the biggest source of data loss and frustration is taken care of.
Ditto blocks for metadata should take care of most other cases that
would result in wide spread loss. Normal bit rot that causes you to
lose blocks here and there are somewhat likely to take out a small
minority of files and spit warnings along the way. If there are some
files that are more important to you than others (e.g. losing files in
rpool/home may have more impact than than rpool/ROOT) copies=2 can
help there.

And for those places where losing a txg or two is a mortal sin, don't
use flaky hardware and allow zfs to handle a layer of redundancy.

This gets me thinking that it may be worthwhile to have a small (<100
MB x 2) rescue boot environment with copies=2 (as well as rpool/b4 GB) boot environment from booting.

--
Mike Gerdts
http://mgerdts.blogspot.com/
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Miles Nordin
carton@Ivy.NET
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 13, 2008 10:50 AM   in response to: bonwick

  Click to reply to this thread Reply

>>>>> "dm" == David Magda <dmagda at ee dot ryerson dot ca> writes:
>>>>> "jb" == Jeff Bonwick <Jeff dot Bonwick at sun dot com> writes:
>>>>> "mg" == Mike Gerdts <mgerdts at gmail dot com> writes:

dm> If you have no redundancy and an error occurs, the system will
dm> panic by default (configurable in newer builds of OpenSolaris,
dm> but not in Solaris 'proper' yet). But if the system is
dm> ignoring checksums, you're no worse off than most other file
dm> systems

It's not safe to assume the checksum errors are silent corruption.
Most or all of the checksum errors I've seen on my system come from
ZFS failing to fully resilver a temporarily-broken mirror.

It's not safe to assume failmode=<!panic> will stop your box from
freezing. Problems with one zpool can cause problems with other
unaffected pools. Problems at the storage driver level can cause one
bad disk to freeze other good disks. Problems with the user interface
generally make it impossible to offline a known-bad device because the
user interface is frozen, or you get some catchall error like ``no
valid replicas'' because who-knows-what, or ``I/O error'' because the
user interface can't mark the failed drive as offline in the copy of
the label stored on the failed drive---if metastat behaved that way?!

I've also had problems with iscsiadm and format pausing for minutes
because a discovery-address is not responding, which could turn into
hours if I had a hundred iSCSI targets---if I could just edit a damned
text file like on a real Unix, I wouldn't have to put up with these
needlessly-complex state machines and multiplicative timeouts. NFS
can freeze entirely if any exported filesystem has problems.

Yes, some of the panics reported may come from failmode, but if you
look through bugs.opensolaris.org and the list you'll see many
different kinds of assertion-failure panics that aren't controlled by
the failmode knob, usually panic-on-import or freeze-on-import, but
sometimes other kinds.

To my view, the good news for ZFS is that most other things suck
almost as much, so there is only a little catching-up to do before
it's competitive. OTOH it looks like an unworkable disaster
w.r.t. the promised future environment where pools have hundreds of
disks, always some of them failing. The exception handling is a mess,
the timers are attached to accidental hodge-podge ``layered'' state
machines for which no one will accept ultimate responsibility, and the
locking of various user interfaces and subsystems is coarse because
it's built either for correctness/simplicity/deadlines, or for a
mistaken, outdated goal: high-performance,
assuming-a-fully-working-system, otherwise-fix-your-hardware.

jb> ditto blocks
mg> copies=n.

neither of which applies to the situations Victor helped recover from.
It's possible ditto blocks are quietly helping people, but I've not
read on the list of one scenario where something bad happened and the
resolution was ``you should have used copies=n''.

The OP is asking about best practices that mitigate known problems,
not a repeat of the standard list of bullet point features and their
hypothetical virtues.

mg> And for those places where losing a txg or two is a mortal
mg> sin, don't use flaky hardware and allow zfs to handle a layer
mg> of redundancy.

It is a mortal sin for a filesystem in all places. It's just much
less bad than losing the entire pool. To be a safe backing-store for
databases or email, ZFS needs to have implementable best-practices
that stop this from happening, not just recover from it. Whatever
recovery there is, certainly should not be silent and maybe should not
be automatic.
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


khb

Posts: 121
From:

Registered: 4/27/05
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 11, 2008 7:36 PM   in response to: David Magda

  Click to reply to this thread Reply


On Oct 10, 2008, at 7:55 PM 10/10/, David Magda wrote:

>
> If someone finds themselves in this position, what advice can be
> followed to minimize risks?

Can you ask for two LUNs on different physical SAN devices and have
an expectation of getting it?

>

--
Keith H. Bierman khbkhb at gmail dot com | AIM kbiermank
5430 Nassau Circle East |
Cherry Hills Village, CO 80113 | 303-997-2749
<speaking for myself*> Copyright 2008




_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


wstuart

Posts: 125
From: MPLS

Registered: 1/5/07
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 13, 2008 8:46 AM   in response to: khb

  Click to reply to this thread Reply



zfs-discuss-bounces at opensolaris dot org wrote on 10/11/2008 09:36:02 PM:

>
> On Oct 10, 2008, at 7:55 PM 10/10/, David Magda wrote:
>
> >
> > If someone finds themselves in this position, what advice can be
> > followed to minimize risks?
>
> Can you ask for two LUNs on different physical SAN devices and have
> an expectation of getting it?

Better yet also ask for multiple paths over different SAN infrastructure to
each. Then again, I would hope you don't need to ask your SAN folks for
that?

-Wade

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


byleal

Posts: 417
From: BR

Registered: 7/18/06
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 10, 2008 1:29 PM   in response to: eschrock
To: Communities » zfs » discuss
  Click to reply to this thread Reply

> On Fri, Oct 10, 2008 at 06:15:16AM -0700, Marcelo
> Leal wrote:
> > - "ZFS does not need fsck".
> > Ok, that?s a great statement, but i think ZFS
> needs one. Really does.
> > And in my opinion a enhanced zdb would be the
> solution. Flexibility.
> > Options.
>
> About 99% of the problems reported as "I need ZFS
> fsck" can be summed up
> by two ZFS bugs:
>
> 1. If a toplevel vdev fails to open, we should be
> able to pull
> information from necessary ditto blocks to open
> the pool and make
> what progress we can. Right now, the root vdev
> code assumes "can't
> open = faulted pool," which results in failure
> scenarios that are
> perfectly recoverable most of the time. This needs
> to be fixed
> so that pool failure is only determined by the
> ability to read
> critical metadata (such as the root of the DSL).
> . If an uberblock ends up with an inconsistent view
> of the world (due
> to failure of DKIOCFLUSHWRITECACHE, for example),
> we should be able
> to go back to previous uberblocks to find a good
> view of our pool.
> This is the failure mode described by Jeff.
> hese are both bugs in ZFS and will be fixed.

That´s it! It´s 100% for me! ;-)
One is the "all-or-nothing" problem, and the other is about guilty... ;-))

>
> There are some interesting possibilities for limited
> forensic tools - in
> particular, I like the idea of a mdb backend for
> reading and writing ZFS
> pools[1].
In my opinion would be great the whole functionality in zdb. it´s simple, and the concepts are clear on the tool. mdb is a debugger, needs concepts that i think is different in a tool for read/fix filesystems. Just an opinion... What does not mean we can not have both. Like i said, flexibility, options... ;-)


But I haven't actually heard a reasonable
> proposal for what a
> fsck-like tool

I think we must NOT stuck in the word "fsck", i have used it just as an example (Lost and Found). And i think other users used just as an example too. The important is the two points you have described very *well*.

(i.e. one that could "repair" things
> automatically) would
> actually *do*, let alone how it would work in the
> variety of situations
> it needs to (compressed RAID-Z?) where the standard
> ZFS infrastructure
> fails.
>
> - Eric
>
> [1]
> http://mbruning.blogspot.com/2008/08/recovering-remove
> d-file-on-zfs-disk.html
>
> --
> Eric Schrock, Fishworks
> http://blogs.sun.com/eschrock
> ________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris dot org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss

Many thanks for your answer!
Leal.

Ricardo M. Corr...
Ricardo.M.Correia@Su...
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 10, 2008 1:42 PM   in response to: eschrock

  Click to reply to this thread Reply

On Sex, 2008-10-10 at 11:23 -0700, Eric Schrock wrote:
> But I haven't actually heard a reasonable proposal for what a
> fsck-like tool (i.e. one that could "repair" things automatically) would
> actually *do*, let alone how it would work in the variety of situations
> it needs to (compressed RAID-Z?) where the standard ZFS infrastructure
> fails.

I'd say an fsck-like tool for ZFS should not worry much compression,
checksums, RAID-Z and whatnot. In essence, it would try to do what an
fsck tool does for a typical filesystem, and so would be mostly
oblivious to the layout or encoding of the blocks, perhaps treating
blocks with failed checksums as blocks full of zeros.

Here's how it could work (of course, this is all easier said than done):

1) Open all the devices specified by the user. Optionally, take just a
pool name/guid and scan for the right devices in /dev/[r]dsk.

2) Verify if the pool configuration read from the devices is sane -- if
not, try to generate a consistent configuration. Some elements of the
pool configuration, such as the correct pool version, could be checked
in later steps, depending on features that were found.

3) Starting from the last uberblock, fully traverse a few levels down
the tree. If less than 100% of the blocks could be read without errors,
do the same for previous uberblocks and offer the user the choice to
which uberblock to use, or if running non-interactively, choose the one
with the best success rate.

4) Traverse the list/tree of filesystems, snapshots and clones. Make
sure that they are well-connected. For each filesystem, try to replay
the ZILs, clean them out.

5) Now fully traverse the pool. Compute the space maps and FS space
usage on-the-go, as blocks are read.

6) For each metadata block read, check whether the fields are sane, fix
them/zero them out if they're not. Basically we're assuming here that we
may have corrupted metadata with correct checksums.

If some metadata block can not be read due to a failed checksum, assume
the block is full of zeros, and fix it.

By the way, this includes every field of every kind of metadata block,
including ZAPs, ACLs, FID maps, znode fields, everything.

For fields that reference other objects, make sure that the object they
reference is of the correct type and that the object itself is correct.

For objects that are missing, create empty ones if necessary.

7) Check that every object is referenced somewhere and link unreferenced
objects to /lost+found/object-type/, or similar.

8) Probably do other things that I'm forgetting.

9) In the end, check if the space maps are consistent with the ones
computed, write correct ones if not. Check that space
usage/reservations/quotas are correct.

Essentially, the goal is that at the end of this process, the pool
should contain consistent information, should have as much data as could
be recovered and should never cause any further errors in ZFS due to
invalid metadata/fields; either when importing it, reading from it or
writing/modifying it (except that it would still return EIO errors when
trying to read corrupted file data blocks, of course).

Now, a problem with fsck-like tools, and perhaps especially with ZFS, is
that some of these steps may either require lots of memory or multiple
filesystem/pool traversals.

I'd say having such a tool, even if it required additional temporary
storage for operation (hopefully not a very large fraction of the pool
size), would be *very* useful and would clear up any worries that people
currently have.

Kind regards,
Ricardo

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


gino

Posts: 168
From:

Registered: 7/20/06
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Nov 29, 2008 3:49 AM   in response to: eschrock
To: Communities » zfs » discuss
  Click to reply to this thread Reply

> About 99% of the problems reported as "I need ZFS
> fsck" can be summed up
> by two ZFS bugs:
>
> 1. If a toplevel vdev fails to open, we should be
> able to pull
> information from necessary ditto blocks to open
> the pool and make
> what progress we can. Right now, the root vdev
> code assumes "can't
> open = faulted pool," which results in failure
> scenarios that are
> perfectly recoverable most of the time. This needs
> to be fixed
> so that pool failure is only determined by the
> ability to read
> critical metadata (such as the root of the DSL).
> . If an uberblock ends up with an inconsistent view
> of the world (due
> to failure of DKIOCFLUSHWRITECACHE, for example),
> we should be able
> to go back to previous uberblocks to find a good
> view of our pool.
> This is the failure mode described by Jeff.
> [b]These are both bugs in ZFS and will be fixed. [/b]

I totally agree these covers most of the corruptions we had in past.
Any news about that bugs in recent Nevada release?

Anyone can provide us a detailed procedure to "go back to previous uberblocks to find a good view of our pool" as described by Jeff?

Thanks
gino

Miles Nordin
carton@Ivy.NET
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Oct 10, 2008 10:58 AM   in response to: Ricardo M. Corr...

  Click to reply to this thread Reply

>>>>> "jb" == Jeff Bonwick <Jeff dot Bonwick at sun dot com> writes:
>>>>> "rmc" == Ricardo M Correia <Ricardo dot M dot Correia at Sun dot COM> writes:

jb> We need a little more Code of Hammurabi in the storage
jb> industry.

It seems like most of the work people have to do now is cleaning up
after the sloppyness of others. At least it takes the longest.

You could always mention which disks you found ignoring the
command---wouldn't that help the overall problem? I understand
there's a pervasive ``i don' wan' any trouble, mistah'' attitude, but
I don't understand where it comes from.

http://www.ferris.edu/news/jimcrow/tom/

jb> displacement flush for disk caches that ignore the sync
jb> command.

Sounds like a good idea but:

(1) won't this break the NFS guarantees you were just saying should
never be broken?

I get it, someone else is breaking a standard so how can ZFS be
expected to yadda yadda yadad. But I fear it will just push
``blame the sysadmin'' one step further out. ex., Q. ``with ZFS
all my NFS clients become unstable after the server reboots,'' or
``I'm getting silent corruption with NFS''. A. ``your drives
might have gremlins in them, no way to know,'' and ``well what do
you expect without a single integrity domain and TCP's weak
checksums. / no i'm using a crossover cable, and FCS is not
weak. / ZFS managing a layer of redundancy it is probably your
RAM or corruption on the uh, between the Ethernet MAC chip and
the PCI slot''

(1a) I'm concerned about how it'll be reported when it happens.

(a) if it's not reported at all, then ZFS is hiding the fact
that fsync() is not working. Also, other journaling
filesystems sometimes report when they find
``unexpected'' corruption, which is useful for finding
both hardware and software problems.

I'm already concerned ZFS is not reporting enough, like
when it says a vdev component is ONLINE, but 'zpool
offline pool <component>' says 'no valid replicas', then
after a scrub there is no change to zpool status, but
zpool offline works again.

ZFS should not ``simplify'' the user interface to the
point that it's hiding problems with itself and its
environment to the ends of avoiding discussion.

(b) if it is reported, then whenever the reporter-blob
raises its hand it will have the effect of exonerating
ZFS in most people's minds, like the stupid CKSUM column
does right now. ``ZFS-FEED-B33F error? oh yeah that's
the new ueberblock search code. that means your disks
are ignoring the SYNCHRONIZE CACHE command. thank GOD
you have ZFS with ANY OTHER FILESYSTEM all bets would be
totally off. lucky you. / I have tried ten different
models from all four brands. / yeah sucks don't it?
flagrant violation of the standard, industry wide. / my
linux testing tool says they're obeying the command fine
/ linux is **** / i added a patch to solaris to block
the SYNC CACHE command and the disks got faster so I
think it's not being ignored / well the stack is
complicated and flushing happens at many levels, like
think about controller performance, and that's
completely unsupported you are doing something REALLY
UNSAFE there you should NOT DO THAT it is STUPID'' and
so on, stalling the actual fix literally for years.

The right way to exonerate ZFS is to make a diagnosis
tool for the disks which proves they're broken, and then
don't buy those disks. not to make a new class of ZFS
fault report that could potentially capture all kinds of
problems, then hazily assign blame to an untestable
quantity.

(2) disks are probably not the only thing dropping the write
barriers. So far, we're also suspecting (unproven!) iSCSI
targets/initiators, particularly around a TCP reconnection event
or target reboot. and VM stacks, both VirtualBox and the HVM in
UltraSPARC T1. probably other stuff.

I'm concerned that assumptions you'll find safe to make about
disks after you get started, like nothing is more than 1s stale,
or send a CDB to size the on-disk cache and imagine it's a FIFO
and it'll be no worse than that, or ``you can get an fsync by
pausing reads for 500ms'' or whatever, will add robustness for
current and future broken disks but won't apply to other types of
broken storage layer.

rmc> However, it is not so resilient when the storage system
rmc> suffers hiccups which cause phantom writes to occur
rmc> continuously, even if for a small period of time (say less
rmc> than 10 seconds), and then return to normal.

ha! that is a great idea. temporal ditto blocks: Important writes
should be written, aged in RAM for 1 minute, then rewritten. :) This
will help with latent sector errors caused by powersag/vibration
too. but...Even I will admit at some point you have to give up and
let the filesystem get corrupted.

actually I'm more in the camp of making ZFS fragile to incorrect
storage stacks, and offering an offline recovery tool that treats the
corrupt pool as read-only and copies it into a new filesystem (so you
need a second same-size empty pool to use the tool). I like this
painful way better than fsck-like things, and much better than silent
workarounds. but i'm probably in the wrong camp on this one.

My reasoning is, we will not be ultimately happy with a fileystem
where fsync() is broken, and that's the best you can do. To compete
with Netapp, we need to bang on this thing until it's actually
working. So far I think sysadmins are receptive to the idea they need
to fix <...> about their setup, or make purchases with extreme care,
or do testing before production. We are not lazy and do not expect an
appliance-on-a-CD.

it's just that pass-the-buck won't ever deliver something useful.
When ext3 was corrupting filesystems on laptops, ext3 got blamed, and
ext3 was not at the root of the problem. But no one _accepted_ that
ext3 was correctly-coded until the overall problem was fixed. (IIRC
it was: you need to send drives a stop-unit command before sending the
ACPI powerdown, because even if they ignore synchronize-cache they do
still flush when told to stop-unit)

It's proper to have a strict separation between ``unclean shutdown''
and ``recovery from corruption''. UFS does have the separation
between log-rolling and fsck-ing, but ZFS could detect the difference
between unclean shutdown and corruption a lot better than UFS, and
that's good. Currently ZFS seems to detect it by telling you ``pool's
corrupt. <shrug>, destroy it.''---the fact that the recovery tool is
entirely absent isn't good, but keeping recovery actions like this
ueberblock-search strictly separate makes delivering something truly
correct on the ``unclean shutdown'' front more likely.

I think, if iSCSI target/initiator combinations are silently
discarding 10sec worth of writes (ex., when they drop and reconnect
their TCP session), then this needs to be proven and their
implementation can be and needs to be corrected, not speculated on and
then worked around.

And I bet this same beefing-up performance numbers by discarding cache
flushes is as rampant in the virtualization game as in the hard disk
game.
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


rnsc

Posts: 130
From:

Registered: 12/20/07
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Nov 30, 2008 8:22 AM   in response to: bonwick
To: Communities » zfs » discuss
  Click to reply to this thread Reply

It would be extremely helpful to know what brands/models of disks lie and which don't. This information could be provided diplomatically simply as threads documenting problems you are working on, stating the facts. Use of a specific string of words would make searching for it easy. There should be no liability, since you are simply documenting compatibility with zfs.

Or perhaps if the lawyers let you, you could simply publish a compatibility/incompatibility list. These ARE facts.

If there is a way to make a detection tool, that would be very useful too, although after the purchase is made, it could be hard to send it back. However that info could be fed into the database as that drive/model being incompatible with zfs.

As Solaris / zfs gains ground, this could become a strong driver in the industry.

Re: I'll run tests with known-broken disks to determine how far back we
need to go in practice -- I'll bet one txg is almost always enough.

So go back three - we are using zfs because we want absolute reliability (or at least as close as we can get).

--Ray

gino

Posts: 168
From:

Registered: 7/20/06
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted: Feb 7, 2009 5:54 AM   in response to: bonwick
To: Communities » zfs » discuss
  Click to reply to this thread Reply

> FYI, I'm working on a workaround for broken devices.
> As you note,
> ome disks flat-out lie: you issue the
> synchronize-cache command,
> they say "got it, boss", yet the data is still not on
> stable storage.
> Why do they do this? Because "it performs better".
> Well, duh --
> ou can make stuff *really* fast if it doesn't have to
> be correct.
>

> The uberblock ring buffer in ZFS gives us a way to
> cope with this,
> as long as we don't reuse freed blocks for a few
> transaction groups.
> The basic idea: if we can't read the pool startign
> from the most
> recent uberblock, then we should be able to use the
> one before it,
> or the one before that, etc, as long as we haven't
> yet reused any
> blocks that were freed in those earlier txgs. This
> allows us to
> use the normal load on the pool, plus the passage of
> time, as a
> displacement flush for disk caches that ignore the
> sync command.
>
> If we go back far enough in (txg) time, we will
> eventually find an
> uberblock all of whose dependent data blocks have
> make it to disk.
> I'll run tests with known-broken disks to determine
> how far back we
> need to go in practice -- I'll bet one txg is almost
> always enough.
>
> Jeff

Hi Jeff,
we just losed 2 pools on snv91.
Any news about your workaround to recover pools discarding last txg?

thanks
gino

Miles Nordin
carton@Ivy.NET
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 12, 2008 12:10 PM   in response to: mgerdts

  Click to reply to this thread Reply

>>>>> "tt" == Toby Thain <toby at telegraphics dot com dot au> writes:
>>>>> "mg" == Mike Gerdts <mgerdts at gmail dot com> writes:

tt> I think we have to assume Anton was joking - otherwise his
tt> measure is uselessly unscientific.

I think it's rude to talk about someone who's present in the third
person, especially when you're trying to minimize his view. Were you
joking, Anton? :)

0. The reports I read were not useless in the way some have stated,
because for example Mike sampled his own observations:

mg> In the past year I've lost more ZFS file systems than I have
mg> any other type of file system in the past 5 years. With other
mg> file systems I can almost always get some data back. With ZFS
mg> I can't get any back.

It's not just bloggers and pundits sampling mailing list traffic. I
thought there was at least one other post like this but could not
find it.


1. I don't think your impressions nor Anton's and mine are ``useless''


2. I don't think your positive impression is any more scientific than
his and my skeptical one.


3. I'm in general troubled by reports of corruption that aren't
well-investigated, because this will stop young, fragile
filesystems from becoming old and robust. BUT....


4. I'm less troubled by (3) because a few of the corruption reports
were well-investigated by Victor, and he recovered them manually
and posted a summary here:

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051643.html

and how the exprience might inform ZFS improvements:

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051667.html


5. I'm more troubled again because everyone seems to have forgotten
(4). Mike, Victor, and others can't necessarily repeat themselves
every time this thread's resurrected. If yapping mailing list
monkeys like me don't remember this experience, invested-wishing
and marketing white papers will drown out the experience we're
getting.

I've pointed straight at an unfixed corruption problem that's
biting ZFS users, and the discussion about where to place the blame
and how to fix it. It is not fixed now, yet pundits on-list and
all over the Interweb like here:

http://www.kev009.com/wp/2008/11/on-file-systems/

talk about corruption bugs hazily and say ``most of all that's been
fixed'' when it's not so hazy and hasn't been, then focus on
theoretical unrealized capabilities of the on-disk format and
mimimize this clear experience into ghostly distant-past rumor.

I don't see when the single-LUN SAN corruption problems were fixed. I
think the supposed ``silent FC bit flipping'' basis for the ``use
multiple SAN LUN's'' best-practice is revoltingly dishonest, that we
_know_ better. I'm not saying devices aren't guilty---Sun's sun4v IO
virtualizer was documented as guilty of ignoring cache flushes to
inflate performance just like the loomingly-unnamed models of lying
SATA drives:

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051735.html

Is a storage-stack-related version this problem the cause of lost
single-LUN SAN pools? maybe, maybe not, but either way we need an
end-to-end solution. I don't currently see an end-to-end solution to
this pervasive blame-the-device mantra every time a pool goes bad.

I keep digging through the archives to post messages like this because
I feel like everyone only wants to have happy memories, and that it's
going to bring about a sad end.
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


hartz

Posts: 208
From: Cape Town, South Africa

Registered: 11/18/06
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 12, 2008 12:38 PM   in response to: Miles Nordin

  Click to reply to this thread Reply



On Fri, Dec 12, 2008 at 10:10 PM, Miles Nordin <carton at ivy dot net> wrote:


0. The reports I read were not useless in the way some have stated,
  because for example Mike sampled his own observations:
[snip]
 
I don't see when the single-LUN SAN corruption problems were fixed.  I
think the supposed ``silent FC bit flipping'' basis for the ``use
multiple SAN LUN's'' best-practice is revoltingly dishonest, that we
_know_ better.  I'm not saying devices aren't guilty---Sun's sun4v IO
virtualizer was documented as guilty of ignoring cache flushes to
inflate performance just like the loomingly-unnamed models of lying
SATA drives:

 http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051735.html

Is a storage-stack-related version this problem the cause of lost
single-LUN SAN pools?  maybe, maybe not, but either way we need an
end-to-end solution.  I don't currently see an end-to-end solution to
this pervasive blame-the-device mantra every time a pool goes bad.

I keep digging through the archives to post messages like this because
I feel like everyone only wants to have happy memories, and that it's
going to bring about a sad end.

Thank you.

There is so much unsupported claims and noise on both sides that everybody is sounding like a bunch of fanboys.

The only bit that I understand about why HW raid "might" be bad is that if it had access to the disks behind a HW RAID LUN, then _IF_ zfs were to encounter corrupted data in a read, it will probably be able to re-construct that data.  This is at the cost of doing the parity calculations on a general purpose CPU, and then sending that parity data, as well as the data to write, across the wire.  Some of that cost may be offset against Raid-Z's optimizations over raid-5 in some situations, but all of this is pretty much if-then-maybe type situations.

I also understand that HW raid arrays have some vulnerabilities and weaknesses, but those seem to be offset against ZFS' notorious instability during error conditions.  I say notorious, because of all the open bug reports and reports on the list of I/O hanging and/or systems panicing while waiting for ZFS to realize that something has gone wrong.

I think if this last point can be addressed - make ZFS respond MUCH faster to failures, then it will go a long way to make ZFS  be more readily adopted.


--
Any sufficiently advanced technology is indistinguishable from magic.
   Arthur C. Clarke

My blog: http://initialprogramload.blogspot.com
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


qu1j0t3

Posts: 126
From: CA

Registered: 1/19/06
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 12, 2008 12:51 PM   in response to: hartz

  Click to reply to this thread Reply


On 12-Dec-08, at 3:38 PM, Johan Hartzenberg wrote:

...
The only bit that I understand about why HW raid "might" be bad is that if it had access to the disks behind a HW RAID LUN, then _IF_ zfs were to encounter corrupted data in a read, it will probably be able to re-construct that data.  This is at the cost of doing the parity calculations on a general purpose CPU, 

Except that it's not just parity - ZFS checksums where RAID-N does not (although I've heard that some RAID systems checksum "somewhere" - not end-to-end of course).

Call me a fanboy if you will, but ZFS is different from hw RAID. I am not an "automatic denier" of ZFS bugs or flaws, but I do acknowledge it's more revolution than evolution. It's software. We only need be patient while it matures. :)

--Toby

_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


bfriesen

Posts: 874
From: US

Registered: 8/19/08
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 12, 2008 1:16 PM   in response to: qu1j0t3

  Click to reply to this thread Reply

On Fri, 12 Dec 2008, Toby Thain wrote:
>
> Except that it's not just parity - ZFS checksums where RAID-N does not
> (although I've heard that some RAID systems checksum "somewhere" - not
> end-to-end of course).

It will soon be quite easy to build a RAID system like this using
OpenSolaris and a sub-project known as COMSTAR. The checksums will be
done using a storage technology called ZFS.

Bob
======================================
Bob Friesenhahn
bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


tcook

Posts: 649
From: US

Registered: 8/21/06
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 12, 2008 1:30 PM   in response to: qu1j0t3

  Click to reply to this thread Reply



On Fri, Dec 12, 2008 at 2:51 PM, Toby Thain <toby at telegraphics dot com dot au> wrote:

On 12-Dec-08, at 3:38 PM, Johan Hartzenberg wrote:

...
The only bit that I understand about why HW raid "might" be bad is that if it had access to the disks behind a HW RAID LUN, then _IF_ zfs were to encounter corrupted data in a read, it will probably be able to re-construct that data.  This is at the cost of doing the parity calculations on a general purpose CPU, 

Except that it's not just parity - ZFS checksums where RAID-N does not (although I've heard that some RAID systems checksum "somewhere" - not end-to-end of course).

Call me a fanboy if you will, but ZFS is different from hw RAID. I am not an "automatic denier" of ZFS bugs or flaws, but I do acknowledge it's more revolution than evolution. It's software. We only need be patient while it matures. :)

--Toby


I'm going to pitch in here as devil's advocate and say this is hardly revolution.  99% of what zfs is attempting to do is something NetApp and WAFL have been doing for 15 years+.  Regardless of the merits of their patents and prior art, etc., this is not something revolutionarily new.  It may be "revolution" in the sense that it's the first time it's come to open source software and been given away, but it's hardly "revolutionary" in file systems as a whole.

--Tim

_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


ian

Posts: 1,760
From: NZ

Registered: 4/27/05
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 12, 2008 1:36 PM   in response to: tcook

  Click to reply to this thread Reply

Tim wrote:
>
>
> On Fri, Dec 12, 2008 at 2:51 PM, Toby Thain <toby at telegraphics dot com dot au
> <mailto:toby at telegraphics dot com dot au>> wrote:
>
>
> On 12-Dec-08, at 3:38 PM, Johan Hartzenberg wrote:
>
>> ...
>> The only bit that I understand about why HW raid "might" be bad
>> is that if it had access to the disks behind a HW RAID LUN, then
>> _IF_ zfs were to encounter corrupted data in a read, it will
>> probably be able to re-construct that data. This is at the cost
>> of doing the parity calculations on a general purpose CPU,
>
> Except that it's /not just parity/ - ZFS checksums where RAID-N
> does not (although I've heard that some RAID systems checksum
> "somewhere" - not end-to-end of course).
>
> Call me a fanboy if you will, but ZFS is different from hw RAID. I
> am not an "automatic denier" of ZFS bugs or flaws, but I do
> acknowledge it's more /revolution/ than evolution. It's software.
> We only need be patient while it matures. :)
>
> --Toby
>
>
> I'm going to pitch in here as devil's advocate and say this is hardly
> revolution. 99% of what zfs is attempting to do is something NetApp
> and WAFL have been doing for 15 years+.

The ideas aren't new, but the combination of the ideas is. NetApp is
still a box at the end of a bit of wire that the OS has to blindly trust.

--
Ian.

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


tcook

Posts: 649
From: US

Registered: 8/21/06
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 12, 2008 2:00 PM   in response to: ian

  Click to reply to this thread Reply



On Fri, Dec 12, 2008 at 3:36 PM, Ian Collins <ian at ianshome dot com> wrote:


The ideas aren't new, but the combination of the ideas is.  NetApp is
still a box at the end of a bit of wire that the OS has to blindly trust.

--
Ian.



I'm not aware of many, if any large shops that are moving to a model of "all internal disk with applications running on them".  The sun box will just be "a box at the end of the wire", a-la storage 7000 when it's an nfs/cifs/iscsi target.  Centralized storage is a *good thing*.

--Tim
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


ian

Posts: 1,760
From: NZ

Registered: 4/27/05
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 12, 2008 2:11 PM   in response to: tcook

  Click to reply to this thread Reply

Tim wrote:
>
>
> On Fri, Dec 12, 2008 at 3:36 PM, Ian Collins <ian at ianshome dot com
> <mailto:ian at ianshome dot com>> wrote:
>
>
>
> The ideas aren't new, but the combination of the ideas is. NetApp is
> still a box at the end of a bit of wire that the OS has to blindly
> trust.
>
> --
> Ian.
>
>
>
> I'm not aware of many, if any large shops that are moving to a model
> of "all internal disk with applications running on them". The sun box
> will just be "a box at the end of the wire", a-la storage 7000 when
> it's an nfs/cifs/iscsi target. Centralized storage is a *good thing*.
>
Maybe, but I'm sure that will change as the performance of the storage
subsystems continue to exceed the performance of the bit of wire.

That's where the revolution bit comes in; applications can now coexist
with NetApp quality storage management.

--
Ian.

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


bonwick

Posts: 129
From: US

Registered: 3/9/05
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 12, 2008 6:16 PM   in response to: tcook

  Click to reply to this thread Reply

> I'm going to pitch in here as devil's advocate and say this is hardly
> revolution. 99% of what zfs is attempting to do is something NetApp and
> WAFL have been doing for 15 years+. Regardless of the merits of their
> patents and prior art, etc., this is not something revolutionarily new. It
> may be "revolution" in the sense that it's the first time it's come to open
> source software and been given away, but it's hardly "revolutionary" in file
> systems as a whole.

"99% of what ZFS is attempting to do?" Hmm, OK -- let's make a list:

end-to-end checksums
unlimited snapshots and clones
O(1) snapshot creation
O(delta) snapshot deletion
O(delta) incremental generation
transactionally safe RAID without NVRAM
variable blocksize
block-level compression
dynamic striping
intelligent prefetch with automatic length and stride detection
ditto blocks to increase metadata replication
delegated administration
scalability to many cores
scalability to huge datasets
hybrid storage pools (flash/disk mix) that optimize price/performance

How many of those does NetApp have? I believe the correct answer is 0%.

Jeff
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


tcook

Posts: 649
From: US

Registered: 8/21/06
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 13, 2008 1:15 AM   in response to: bonwick

  Click to reply to this thread Reply



On Fri, Dec 12, 2008 at 8:16 PM, Jeff Bonwick <Jeff dot Bonwick at sun dot com> wrote:
> I'm going to pitch in here as devil's advocate and say this is hardly
> revolution.  99% of what zfs is attempting to do is something NetApp and
> WAFL have been doing for 15 years+.  Regardless of the merits of their
> patents and prior art, etc., this is not something revolutionarily new.  It
> may be "revolution" in the sense that it's the first time it's come to open
> source software and been given away, but it's hardly "revolutionary" in file
> systems as a whole.

"99% of what ZFS is attempting to do?"  Hmm, OK -- let's make a list:

       end-to-end checksums
       unlimited snapshots and clones
       O(1) snapshot creation
       O(delta) snapshot deletion
       O(delta) incremental generation
       transactionally safe RAID without NVRAM
       variable blocksize
       block-level compression
       dynamic striping
       intelligent prefetch with automatic length and stride detection
       ditto blocks to increase metadata replication
       delegated administration
       scalability to many cores
       scalability to huge datasets
       hybrid storage pools (flash/disk mix) that optimize price/performance

How many of those does NetApp have?  I believe the correct answer is 0%.

Jeff

Seriously?  Do you know anything about the NetApp platform?  I'm hoping this is a genuine question...

Off the top of my head nearly all of them.  Some of them have artificial limitations because they learned the hard way that if you give customers enough rope they'll hang themselves.  For instance "unlimited snapshots".  Do I even need to begin to tell you what a horrible, HORRIBLE idea that is?  "Why can't I get my space back?"  Oh, just do a snapshot list and figure out which one is still holding the data.  What?  Your console locks up for 8 hours when you try to list out the snapshots?  Huh... that's weird.

It's sort of like that whole "unlimited filesystems" thing.  Just don't ever reboot your server, right?  Or "you can have 40pb in one pool!!!".  How do you back it up?  Oh, just mirror it to another system?  And when you hit a bug that toasts both of them you can just start restoring from tape for the next 8 years, right?  Or if by some luck we get a zfsiron, you can walk the metadata for the next 5 years.

NVRAM has been replaced by flash drives in a ZFS world to get any kind of performance... so you're trading one high priced storage for another.  Your snapshot creation and deletion is identical.  Your incremental generations is identical.  End-to-end checksums?  Yup.

Let's see... they don't have block-level compression, they chose dedup instead which nets better results.  "Hybrid storage pool" is achieved through PAM modules.  Outside of that... I don't see ANYTHING in your list they didn't do first.


--Tim
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


bonwick

Posts: 129
From: US

Registered: 3/9/05
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 13, 2008 2:01 AM   in response to: tcook

  Click to reply to this thread Reply

> Off the top of my head nearly all of them. Some of them have artificial
> limitations because they learned the hard way that if you give customers
> enough rope they'll hang themselves. For instance "unlimited snapshots".

Oh, that's precious! It's not an arbitrary limit, it's a safety feafure!

> Outside of that... I don't see ANYTHING in your list they didn't do first.

Then you don't know ANYTHING about either platform. Constant-time
snapshots, for example. ZFS has them; NetApp's are O(N), where N is
the total number of blocks, because that's how big their bitmaps are.
If you think O(1) is not a revolutionary improvement over O(N),
then not only do you not know much about either snapshot algorithm,
you don't know much about computing.

Sorry, everyone else, for feeding the troll. Chum the water all you like,
I'm done with this thread.

Jeff
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Bryan Cantrill
bmc@eng.sun.com
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 13, 2008 7:54 AM   in response to: tcook

  Click to reply to this thread Reply


> Seriously? Do you know anything about the NetApp platform? I'm hoping this
> is a genuine question...
>
> Off the top of my head nearly all of them. Some of them have artificial
> limitations because they learned the hard way that if you give customers
> enough rope they'll hang themselves. For instance "unlimited snapshots".
> Do I even need to begin to tell you what a horrible, HORRIBLE idea that is?
> "Why can't I get my space back?" Oh, just do a snapshot list and figure out
> which one is still holding the data. What? Your console locks up for 8
> hours when you try to list out the snapshots? Huh... that's weird.
>
> It's sort of like that whole "unlimited filesystems" thing. Just don't ever
> reboot your server, right? Or "you can have 40pb in one pool!!!". How do
> you back it up? Oh, just mirror it to another system? And when you hit a
> bug that toasts both of them you can just start restoring from tape for the
> next 8 years, right? Or if by some luck we get a zfsiron, you can walk the
> metadata for the next 5 years.
>
> NVRAM has been replaced by flash drives in a ZFS world to get any kind of
> performance... so you're trading one high priced storage for another. Your
> snapshot creation and deletion is identical. Your incremental generations
> is identical. End-to-end checksums? Yup.
>
> Let's see... they don't have block-level compression, they chose dedup
> instead which nets better results. "Hybrid storage pool" is achieved
> through PAM modules. Outside of that... I don't see ANYTHING in your list
> they didn't do first.

Wow -- I've spoken to many NetApp partisans over the years, but you might
just take the cake. Of course, most of the people I talk to are actually
_using_ NetApp's technology, a practice that tends to leave even the most
stalwart proponents realistic about the (many) limitations of NetApp's
technology...

For example, take the PAM. Do you actually have one of these, or are you
basing your thoughts on reading whitepapers? I ask because (1) they are
horrifically expensive (2) they don't perform that well (especially
considering that they're DRAM!) (3) they're grossly undersized (a 6000
series can still only max out at a paltry 96G -- and that's with virtually
no slots left for I/O) and (4) they're not selling well. So if you
actually bought a PAM, that already puts you in a razor-thin minority of
NetApp customers (most of whom see through the PAM and recognize it for
the kludge that it is); if you bought a PAM and think that it's somehow a
replacement for the ZFS hybrid storage pool (which has an order of magnitude
more cache), then I'm sure NetApp loves you: you must be the dumbest,
richest customer that ever fell in their lap!

- Bryan

--------------------------------------------------------------------------
Bryan Cantrill, Sun Microsystems Fishworks. http://blogs.sun.com/bmc
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


bfriesen

Posts: 874
From: US

Registered: 8/19/08
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 13, 2008 8:03 AM   in response to: tcook

  Click to reply to this thread Reply

On Sat, 13 Dec 2008, Tim wrote:
>
> Seriously? Do you know anything about the NetApp platform? I'm hoping this
> is a genuine question...

I believe that esteemed Sun engineers like Jeff are quite familiar
with the NetApp platform. Besides NetApp being one of the primary
storage competitors, it is a virtual minefield out there and one must
take great care not to step on other company's patents.

> Off the top of my head nearly all of them. Some of them have artificial
> limitations because they learned the hard way that if you give customers
> enough rope they'll hang themselves. For instance "unlimited snapshots".
> Do I even need to begin to tell you what a horrible, HORRIBLE idea that is?
> "Why can't I get my space back?" Oh, just do a snapshot list and figure out
> which one is still holding the data. What? Your console locks up for 8
> hours when you try to list out the snapshots? Huh... that's weird.

I suggest that you retire to the safety of the rubber room while the
rest of us enjoy these zfs features. By the same measures, you would
advocate that people should never be allowed to go outside due to the
wide open spaces. Perhaps people will wander outside their homes and
forget how to make it back. Or perhaps there will be gravity failure
and some of the people outside will be lost in space.

There is some activity off the starboard bow, perhaps you should check
it out ...

Bob
======================================
Bob Friesenhahn
bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Joseph Zhou
jz@excelsioritsoluti...
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 13, 2008 3:14 PM   in response to: bfriesen

  Click to reply to this thread Reply

Hi Bob, Tim, Jeff, you are all my friends, and you all know what you are
talking about.
As a friend, and trusting your personal integrity, I ask you, please, don't
get mad, enjoy the open discussion.

(ok, ok, O(N) is revolutionary in tech thinking, just not revolutionary in
end customer value. And safety features are important in risk management
for enterprises.)

I have friends at NetApp, and there are people there that I don't give a
****.

I am an enterprise architect, I don't care about the little environments
that can be fulfilled most effectively by any one operating enviornment
applications. They are not enterprises and are risky in that business model
in economy downturns.

In that spirit, and looking at the NetApp virtual server support
architecture, I would say --
as much as the ONTAP/WAFL thing (even with GX integration) is elegant, it
would make more sense to utilize the file system capabilities with kernal
integration to hypervisors, in virtual server deployments, instead of
promoting a storage-device-based file system and data management solution
(more proprietary at the solution level).

So, in my position, NetApp PiT is not as good as ZFS PiT, because it is too
far from the hypervisor.
You can support me or attack me with more technical details (if you know
NetApp is developing an API for all server hypervisors, I don't).
And don't worry, I have the biggest eagle, but so far, no one has been able
to hurt that. ;-)

Best,
z

----- Original Message -----
From: "Bob Friesenhahn" <bfriesen at simple dot dallas dot tx dot us>
To: "Tim" <tim at tcsac dot net>
Cc: <zfs-discuss at opensolaris dot org>
Sent: Saturday, December 13, 2008 11:03 AM
Subject: Re: [zfs-discuss] Split responsibility for data with ZFS


> On Sat, 13 Dec 2008, Tim wrote:
>>
>> Seriously? Do you know anything about the NetApp platform? I'm hoping
>> this
>> is a genuine question...
>
> I believe that esteemed Sun engineers like Jeff are quite familiar
> with the NetApp platform. Besides NetApp being one of the primary
> storage competitors, it is a virtual minefield out there and one must
> take great care not to step on other company's patents.
>
>> Off the top of my head nearly all of them. Some of them have artificial
>> limitations because they learned the hard way that if you give customers
>> enough rope they'll hang themselves. For instance "unlimited snapshots".
>> Do I even need to begin to tell you what a horrible, HORRIBLE idea that
>> is?
>> "Why can't I get my space back?" Oh, just do a snapshot list and figure
>> out
>> which one is still holding the data. What? Your console locks up for 8
>> hours when you try to list out the snapshots? Huh... that's weird.
>
> I suggest that you retire to the safety of the rubber room while the
> rest of us enjoy these zfs features. By the same measures, you would
> advocate that people should never be allowed to go outside due to the
> wide open spaces. Perhaps people will wander outside their homes and
> forget how to make it back. Or perhaps there will be gravity failure
> and some of the people outside will be lost in space.
>
> There is some activity off the starboard bow, perhaps you should check
> it out ...
>
> Bob
> ======================================
> Bob Friesenhahn
> bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris dot org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


bfriesen

Posts: 874
From: US

Registered: 8/19/08
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 13, 2008 9:45 PM   in response to: Joseph Zhou

  Click to reply to this thread Reply

On Sat, 13 Dec 2008, Joseph Zhou wrote:
>
> In that spirit, and looking at the NetApp virtual server support
> architecture, I would say --
> as much as the ONTAP/WAFL thing (even with GX integration) is elegant, it
> would make more sense to utilize the file system capabilities with kernal
> integration to hypervisors, in virtual server deployments, instead of
> promoting a storage-device-based file system and data management solution
> (more proprietary at the solution level).

I am not an enterprise architect but I do agree that when multiple
client OSs are involved it is still useful if storage looks like a
legacy disk drive. Luckly Solaris already offers iSCSI in Solaris 10
and OpenSolaris is now able to offer high performance fiber channel
target and fiber channel over ethernet layers on top of reliable ZFS.
The full benefit of ZFS is not provided, but the storage is
successfully divorced from the client with a higher degree of data
reliability and performance than is available from current firmware
based RAID arrays.

Bob
======================================
Bob Friesenhahn
bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Miles Nordin
carton@Ivy.NET
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 15, 2008 2:04 PM   in response to: Joseph Zhou

  Click to reply to this thread Reply

>>>>> "bc" == Bryan Cantrill <bmc at eng dot sun dot com> writes:
>>>>> "jz" == Joseph Zhou <jz at excelsioritsolutions dot com> writes:

bc> most of the people I talk to are actually _using_ NetApp's
bc> technology, a practice that tends to leave even the most
bc> stalwart proponents realistic about the (many) limitations of
bc> NetApp's

same applies to ZFS pundits!

As Tim said, the one-filesystem-per-user thing is not working out.
O(1) for number of filesystems would be great but isn't there.

Maybe the format allows unlimited O(1) snapshots, but it's at best
O(1) to take them. All over the place it's probably O(n) or worse to
_have_ them. to boot with them, to scrub with them.

I think the winning snapshot architecture is more like source code
revision control: take infinitely-granular snapshots, a continuous
line, and run a cron service to trim the line into a series of points.

The management can be delegated, but inspection commands are not safe
and can lock the whole filesystem, and 'zfs recv'ing certain streams
panics the whole box so backup cannot really be safely delegated either.
The panic-on-import problems are bad for delegation because you can't
safely let users mount things, which to my view is where delegated
administration begins. It's too unstable to think of delegating
anything---it's all just UI baloney until the panics are fixed and
failures are contained within one pool.

The scalability to multiple cores goals are admirable, but only
certain things are parallelized. You can only replace one device at a
time, which some day will not be enough to keep up with natural
failure rates. I think 'zfs send' does not use multiple cores well,
right? AIUI people are getting non-scaling performance in send/recv
while the ordinary filesystem performance does scale, and thus getting
painted into a corner.

Yeah there's compression, but as Tim said people are getting more
savings from dedup, which goes naturally with writeable clones too.
Also the NetApp dedup is a background thread while the ZFS compression
is synchronous with writing. as well as not scaling to multiple cores
and seeming to have some bugs in the gzip version.

Yeah there is some heirarchical storage in it, but after half a year
still a slog cannot be removed?

In general I think ZFS pundits compliment the architecture and not the
implementation.

The big compliment I have for it is just that the ZFS piece is free
software, even though large chunks of OpenSolaris aren't. That's a
gigantic advantage, especially over NetApp, which probably has about
as much long-term future as Lisp.

jz> As a friend, and trusting your personal integrity, I ask you,
jz> please, don't get mad, enjoy the open discussion.

Joseph, I don't see the problem and think it's fine to excited so long
as actual information comes out. There's nothing ad-hominem in the
discussion yet, and being ordered not to get mad will make any normal
person furious, especially if you make the order based on ``trust''
and ``personal integrity''---why bring up such things at all? I
almost feel like you're baiting them! I know it's normal for
sysadmins to be dry and menial, but it's still a technical discussion,
so I hope it doesn't upset anyone because it's not boring.
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


nico

Posts: 3,468
From: US

Registered: 6/15/05
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 15, 2008 2:12 PM   in response to: Miles Nordin

  Click to reply to this thread Reply

On Mon, Dec 15, 2008 at 05:04:03PM -0500, Miles Nordin wrote:
> As Tim said, the one-filesystem-per-user thing is not working out.

For NFSv3 clients that truncate MOUNT protocol answers (and v4 clients
that still rely on the MOUNT protocol), yes, one-filesystem-per-user is
a problem. For NFSv4 clients that support mirror mounts its not a
problem at all. You're not required to go with one-filesystem-per-user
though! That's only if you want to approximate quotas.

> O(1) for number of filesystems would be great but isn't there.

It is O(1) for filesystems (parts of the system could be parallelized
more, but the on-disk data format is O(1) for filesystem creation and
mounting, just like it is for snapshots and clones).

> Maybe the format allows unlimited O(1) snapshots, but it's at best
> O(1) to take them. All over the place it's probably O(n) or worse to
> _have_ them. to boot with them, to scrub with them.

It's NOT O(N) to boot because of snapshots, nor to scrub. Scrub and
resilver are O(N) where N is the amount used (as opposed to O(N) where N
is the size of the volume, for HW RAID and the like).

Nico
--
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Miles Nordin
carton@Ivy.NET
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 16, 2008 11:00 AM   in response to: nico

  Click to reply to this thread Reply

>>>>> "nw" == Nicolas Williams <Nicolas dot Williams at sun dot com> writes:

nw> For NFSv4 clients that support mirror mounts its not a problem
nw> at all.

no, 3000 - 10,000 users is common for a large campus, and according to
posters here, sometimes that many users actually can fit into the
bandwidth of a single pool. But ZFS is not useable with that many
filesystems. booting, 'zfs create', 'zfs list', all take hours. see
list archives.

If the on-disk format is theoretically capable of achieving O(1) for
number of filesystems, that's nice! It's just not an advantage over
NetApp when it's not working yet. And, with any project, sometimes the
last 5% of the work never gets done.

so I'm making a desperate call to start basing punditry on experience
rather than white papers and optimistic architecture documents.
OpenSolaris could have an advantage here---it's much easier to get
experience with Solaris than NetApp because it's not (a) expensive and
(b) locked behind a bunch of licenses, agreements and contracts,
unshareable documentation, private censored web forums (NOW site),
u.s.w., so OpenSolaris punditry could one day become a lot more
trustworthy than NetApp punditry.

nw> You're not required to go with one-filesystem-per-user though!

It was pitched as an architectural advantage, but never fully
delivered, and worse, used to justify removing traditional Unix
quotas. Consequently, quota-wise, ZFS becomes a regression w.r.t. UFS
rather than an evolution, because of over-focusing on the virtues of
the architecture rather than the delivered implementation.

I don't use quotas and don't care, but it's a good example of broken
advocacy.

nw> It's NOT O(N) to boot because of snapshots, nor to scrub.

I think it is. try it and see. :/

That was Tim's point as I read it. Jeff claimed ``unlimited snapshots
and clones'' as a ZFS advantage over NetApp, and Tim said open bugs or
subtle limitations make the supposed advantage a fantasy, even a
liability:

``"unlimited snapshots". Do I even need to begin to tell you what a
horrible, HORRIBLE idea that is? "Why can't I get my space back?"
Oh, just do a snapshot list and figure out which one is still
holding the data. What? Your console locks up for 8 hours when
you try to list out the snapshots? Huh... that's weird.''

...and to add to that, the snapshot list in ZFS does a better job of
showing which one's using the space if there are fewer snapshots.
with hundreds of snapshots 'zfs list' shows a USED column full of
zeroes, correctly, because you won't save any space by deleting just
one---you have to delete a range of snapshots to get some space back.
Of course that's not the same thing as being O(N), that's just
annoying.

and I don't know that it's really O(N)---it could be better or worse
than O(N). It's not O(1) though, to boot, list, or scrub snapshots.

and if it's not O(1) because of some unnecessary high-level ioctl
accidentally called in some obscure, abstract library by the
``simple'' user interface, it's still not O(1)! For practical users,
that library could remain suboptimal for the next two years, and I
don't want to spend those two years enduring a bunch of blogging about
nonexistent O(1) snapshots just because the on-disk format
theoretically doesn't impede delivering them.
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


jkaitsch

Posts: 11
From: CA

Registered: 11/17/05
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 16, 2008 11:22 AM   in response to: Miles Nordin

  Click to reply to this thread Reply

Miles Nordin wrote:
>>>>>> "nw" == Nicolas Williams <Nicolas dot Williams at sun dot com> writes:
>
>
> nw> You're not required to go with one-filesystem-per-user though!
>
> It was pitched as an architectural advantage, but never fully
> delivered, and worse, used to justify removing traditional Unix
> quotas. Consequently, quota-wise, ZFS becomes a regression w.r.t. UFS
> rather than an evolution, because of over-focusing on the virtues of
> the architecture rather than the delivered implementation.
>
>



Precisely.

The issues for quotas, for ZFS on a per user basis was pointed
out several years ago at FAST, when some of the Sun folks showed
up to discuss ZFS in a late evening meeting. A file system per
user approach is not very viable when you have tens of thousands
of users.

It was my hope that Sun would get that message by now, as I
consider it one of the major problems with ZFS.
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


qu1j0t3

Posts: 126
From: CA

Registered: 1/19/06
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 15, 2008 5:43 PM   in response to: Miles Nordin

  Click to reply to this thread Reply

>
> Maybe the format allows unlimited O(1) snapshots, but it's at best
> O(1) to take them. All over the place it's probably O(n) or worse to
> _have_ them. to boot with them, to scrub with them.

Why would a scrub be O(n snapshots)?

The O(n filesystems) effects reported from time to time in
OpenSolaris seem due to code that iterates over them. The new ability
to create huge numbers of them puts stress on assumptions valid in
more traditional UNIX configurations, right?

--Toby
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


rang

Posts: 295
From: US

Registered: 3/9/05
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 13, 2008 10:20 PM   in response to: qu1j0t3
To: Communities » zfs » discuss
  Click to reply to this thread Reply

Some RAID systems compare checksums on reads, though this is usually only for RAID-4 configurations (e.g. DataDirect) because of the performance hit otherwise.

End-to-end checksums are not yet common. The SCSI committee recently ratified T10 DIF, which allows either an operating system or application to supply checksums and have them stored and retrieved with data. Oracle has been working to add support for this to Linux, and several array and drive vendors have committed to implementing it. So one could say that ZFS is ahead of the curve here.

ZFS is not particularly revolutionary: software RAID has been around since the invention of the term; end-to-end checksums to disk have been used since the 1960s (though more often in databases, tape, and optical media); WAFL-like file structures may pre-date NetApp. It does put these together for the first time in a widely available system, though, which is certainly innovative and useful. It will be more useful when it has a more complete disaster recovery model than 'restore from backup.'

relling

Posts: 2,083
From: US

Registered: 6/17/05
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 13, 2008 11:11 PM   in response to: rang

  Click to reply to this thread Reply

Anton B. Rang wrote:
> Some RAID systems compare checksums on reads, though this is usually only for RAID-4 configurations (e.g. DataDirect) because of the performance hit otherwise.
>

For the record, Solaris had a (mirrored) RAID system which would compare
data from both sides of the mirror upon read. It never achieved significant
market penetration and was subsequently scrapped. Many of the reasons that
the market did not accept it are solved by the method used by ZFS, which is
far superior.

> End-to-end checksums are not yet common. The SCSI committee recently ratified T10 DIF, which allows either an operating system or application to supply checksums and have them stored and retrieved with data. Oracle has been working to add support for this to Linux, and several array and drive vendors have committed to implementing it. So one could say that ZFS is ahead of the curve here.
>

Oracle also has data checksumming enabled by default for later releases.
I look forward to any field data analysis they may publish :-)

> ZFS is not particularly revolutionary: software RAID has been around since the invention of the term; end-to-end checksums to disk have been used since the 1960s (though more often in databases, tape, and optical media); WAFL-like file structures may pre-date NetApp. It does put these together for the first time in a widely available system, though, which is certainly innovative and useful. It will be more useful when it has a more complete disaster recovery model than 'restore from backup.'
>

If you wish to implement a disaster recovery model, then you should look far
beyond what ZFS (or any file system) can provide. Effective disaster
recovery
requires significant attention to process.
-- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


relling

Posts: 2,083
From: US

Registered: 6/17/05
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 12, 2008 1:30 PM   in response to: hartz

  Click to reply to this thread Reply

Johan Hartzenberg wrote:
> There is so much unsupported claims and noise on both sides that
> everybody is sounding like a bunch of fanboys.

I don't think there are two sides. Anyone who has been around computing
for any length of time has lost data due to various failures. The
question isn't
about losing data, it is about how to proceed when your data is damaged.

>
> The only bit that I understand about why HW raid "might" be bad is
> that if it had access to the disks behind a HW RAID LUN, then _IF_ zfs
> were to encounter corrupted data in a read, it will probably be able
> to re-construct that data. This is at the cost of doing the parity
> calculations on a general purpose CPU, and then sending that parity
> data, as well as the data to write, across the wire. Some of that
> cost may be offset against Raid-Z's optimizations over raid-5 in some
> situations, but all of this is pretty much if-then-maybe type situations.

OK, repeat after me: there is no such thing as hardware RAID, there is no
such thing as hardware RAID, there is no such thing as hardware RAID.
There is only software RAID. If you believe any software is infallible,
then
you will be hurt. Even beyond RAID, there is quite sophisticated software
on your disks, and anyone who has had to upgrade disk firmware will
attest that disk firmware is not infallible.

> I also understand that HW raid arrays have some vulnerabilities and
> weaknesses, but those seem to be offset against ZFS' notorious
> instability during error conditions. I say notorious, because of all
> the open bug reports and reports on the list of I/O hanging and/or
> systems panicing while waiting for ZFS to realize that something has
> gone wrong.
>
> I think if this last point can be addressed - make ZFS respond MUCH
> faster to failures, then it will go a long way to make ZFS be more
> readily adopted.

However, you can't respond too fast -- something which seems to get lost
in these conversations. If you declare a disk dead too fast, then you get
caught in a bind by things like Seagate disks which "freeze" for a few
seconds. It may be much better to ride through such things than initiate a
reconfiguration action (as described in the article below).
http://blogs.zdnet.com/storage/?p=369&tag=nl.e539

Note: as of b97, it is now possible to set per-device retries in the sd and
ssd drivers. This is a good start towards satisfying those who are
fed up with the default sd/ssd retry logic. See sd(7d)
http://opensolaris.org/os/community/arc/caselog/2007/505/

-- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


qu1j0t3

Posts: 126
From: CA

Registered: 1/19/06
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 12, 2008 12:44 PM   in response to: Miles Nordin

  Click to reply to this thread Reply


On 12-Dec-08, at 3:10 PM, Miles Nordin wrote:

>>>>>> "tt" == Toby Thain <toby at telegraphics dot com dot au> writes:
>>>>>> "mg" == Mike Gerdts <mgerdts at gmail dot com> writes:
>
> tt> I think we have to assume Anton was joking - otherwise his
> tt> measure is uselessly unscientific.
>
> I think it's rude to talk about someone who's present in the third
> person, especially when you're trying to minimize his view. Were you
> joking, Anton? :)
> ....
>
> 1. I don't think your impressions nor Anton's and mine are ``useless''

Alright, I agree I should retract the 'useless' but I would keep the
'unscientific'.

--Toby
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


bfriesen

Posts: 874
From: US

Registered: 8/19/08
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 12, 2008 1:11 PM   in response to: qu1j0t3

  Click to reply to this thread Reply

On Fri, 12 Dec 2008, Toby Thain wrote:
>>
>> 1. I don't think your impressions nor Anton's and mine are ``useless''
>
> Alright, I agree I should retract the 'useless' but I would keep the
> 'unscientific'.

There is no need to retract the 'useless'. By the same useless
measure, George Bush Jr has done a fantastic job at dealing with world
terror since there has not been a serious attack on US soil by islamic
terrorists since 2002. One might think that this impression is
significant yet it is not since the previous attack on US soil was in
1993, which was about 9 years and we have only gone about 6 thus far.
By statistical measures, George Bush Jr could have done absolutely
nothing and it is likely that nothing bad would have happened at all.
There is insufficient evidence to suggest one conclusion vs another.

This example shows the dangers of using illogical thinking to
presumably reach a logical conclusion. It is particularly dangerous
to exhibit illogical thinking in public where everyone can see.

Bob
======================================
Bob Friesenhahn
bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


rang

Posts: 295
From: US

Registered: 3/9/05
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 13, 2008 10:07 PM   in response to: Miles Nordin
To: Communities » zfs » discuss
  Click to reply to this thread Reply

I wasn't joking, though as is well known, the plural of anecdote is not data.

Both UFS and ZFS, in common with all file system, have design flaws and bugs.

To lose an entire UFS file system (barring the loss of the entire underlying storage) requires a great deal of corruption; there are multiple copies of the superblock, cylinder headers and their inodes are stored in a regular pattern and easily found by recovery tools, and the UFS file system check utility, while not perfect, can repair almost any corruption. There are third party tools which can perform much more analysis and recovery in a worst-case scenario. A single bad bloc

To lose an entire ZFS pool requires that the most recent uberblock, or one of the top-level blocks to which it points, be damaged. There are currently no recovery tools (at least, none of which I am aware).

I find it naïve to imagine that Sun customers "expect" their UFS (or other) file systems to be unrecoverable. Any case where fsck failed quickly became an escalation to the sustaining engineering organization. Restoring from backup is almost never a satisfactory answer for a commercial enterprise.

As usual, the disclaimer; I now work for another storage company, and while I've been on the teams developing and maintaining a number of commercial file systems (including two of Sun's), ZFS has not been one of them.

relling

Posts: 2,083
From: US

Registered: 6/17/05
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 13, 2008 11:02 PM   in response to: rang

  Click to reply to this thread Reply

Anton B. Rang wrote:
> I find it naïve to imagine that Sun customers "expect" their UFS (or other) file systems to be unrecoverable.

OK, I'll bite. If we believe the disk vendors who rate their disks as
having
an unrecoverable error rate of 1 bit per 10^14 bits read, and knowing that
UFS has absolutely no data protection of its data, why would you think that
it is naive to think that a disk system with UFS cannot lose data?
Rather, I
would say it has a distinctly calculable probability. Similarly, for
ZFS, the
checksum is not perfect, so there is a calculable probability that the ZFS
checksum will not detect an unrecoverable (read) error. The difference is
that the probability that ZFS will not detect an error is considerably
smaller
than that of UFS (or FAT, or HSFS, or ...)
> Any case where fsck failed quickly became an escalation to the sustaining engineering organization. Restoring from backup is almost never a satisfactory answer for a commercial enterprise.
>

I agree. However, I've personally experienced well over 100 fsck failures
over the years, and while I was always unsatisfied, I didn't always lose
data[1].
When I did lose data, perhaps it was data I could live without, but that
was my
call. Would you rather that ZFS should simply say, "hey you lost some
data, but
we won't tell you where... ?"

[1] once upon a time, I used a [vendor-name-elided] disk for a 2,300
user e-mail
message store. I upgraded the OS, which implemented some new SCSI
options.
The disk's firmware didn't handle those options properly and would wait
about
7 hours before corrupting the UFS file system containing the message store,
requiring a full restore. So, how many shifts do you think it took to
fail, recover,
and ultimately resolve the disk firmware issue? Hint: the firmware rev
arrived via
UPS.

Personally, I'm very glad that a file system has come along that
verifies data... and
that feature seems to be catching, as other file systems seem to be
doing the same.
Hopefully, in a few years silent data corruption will be a footnote on
the lore of
computing.
-- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


myxiplx

Posts: 883
From: GB

Registered: 10/24/07
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 15, 2008 2:13 AM   in response to: relling
To: Communities » zfs » discuss
  Click to reply to this thread Reply

I think the problem for me is not that there's a risk of data loss if a pool becomes corrupt, but that there are no recovery tools available. With UFS, people expect that if the worst happens, fsck will be able to recover their data in most cases.

With ZFS you have no such tools, yet Victor has on at least two occasions shown that it's quite possible to recover pools that were completely unusable (I believe by making use of old / backup copies of the uberblock).

My concern is that ZFS has all this information on disk, it has the ability to know exactly what is and isn't corrupted, and it should (at least for a system with snapshots) have many, many potential uberblocks to try. It should be far, far better than UFS at recovering from these things, but for a certain class of faults, when it hits a problem it just stops dead.

That's what frustrates me - knowing that there's potential to have all my data there, stored safely away, but having it completely inaccessible due to a lack of recovery tools.

casper

Posts: 3,458
From: NL

Registered: 3/9/05
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 15, 2008 2:30 AM   in response to: myxiplx

  Click to reply to this thread Reply


>I think the problem for me is not that there's a risk of data loss if
>a pool becomes corrupt, but that there are no recovery tools
>available. With UFS, people expect that if the worst happens, fsck
>will be able to recover their data in most cases.

Except, of course, that fsck lies. In "fixes" the meta data and the
quality of the rest is unknown.

Anyone using UFS knows that UFS file corruption are common; specifically,
when using a "UFS root" and the system panic's when trying to
install a device driver, there's a good chance that some files in
/etc are corrupt. Some were application problems (some code used
fsync(fileno(fp)); fclose(fp); it doesn't guarantee anything)


>With ZFS you have no such tools, yet Victor has on at least two occasions
>shown that it's quite possible to recover pools that were completely unusable
>(I believe by making use of old / backup copies of the uberblock).

True; and certainly ZFS should be able backtrack. But it's
much more likely to happen "automatically" then using a recovery
tool.

See, fsck could only be written because specific corruption are known
and the patterns they have. With ZFS, you can only backup to
a certain uberblock and the pattern will be a surprise.

Casper
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


myxiplx

Posts: 883
From: GB

Registered: 10/24/07
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 15, 2008 2:59 AM   in response to: casper

  Click to reply to this thread Reply

Forgive me for not understanding the details, but couldn't you also
work backwards through the blocks with ZFS and attempt to recreate the
uberblock?

So if you lost the uberblock, could you (memory and time allowing)
start scanning the disk, looking for orphan blocks that aren't
refernced anywhere else and piece together the top of the tree?

Or roll back to a previous uberblock (or a snapshot uberblock), and
then look to see what blocks are on the disk but not referenced
anywhere. Is there any way to intelligently work out where those
blocks would be linked by looking at how they interact with the known
data?

Of course, rolling back to a previous uberblock would still be a
massive step forward, and something I think would do much to improve
the perception of ZFS as a tool to reliably store data.

You cannot understate the difference to the end user between a file
system that on boot says:
"Sorry, can't read your data pool."

With one that says:
"Whoops, the uberblock, and all the backups are borked. Would you
like to roll back to a backup uberblock, or leave the filesystem
offline to repair manually?"

As much as anything else, a simple statement explaining *why* a pool
is inaccessible, and saying just how badly things have gone wrong
helps tons. Being able to recover anything after that is just the
icing on the cake, especially if it can be done automatically.

Ross

PS. Sorry for the duplicate Casper, I forgot to cc the list.



On Mon, Dec 15, 2008 at 10:30 AM, <Casper.***@sun.com> wrote:
>
>>I think the problem for me is not that there's a risk of data loss if
>>a pool becomes corrupt, but that there are no recovery tools
>>available. With UFS, people expect that if the worst happens, fsck
>>will be able to recover their data in most cases.
>
> Except, of course, that fsck lies. In "fixes" the meta data and the
> quality of the rest is unknown.
>
> Anyone using UFS knows that UFS file corruption are common; specifically,
> when using a "UFS root" and the system panic's when trying to
> install a device driver, there's a good chance that some files in
> /etc are corrupt. Some were application problems (some code used
> fsync(fileno(fp)); fclose(fp); it doesn't guarantee anything)
>
>
>>With ZFS you have no such tools, yet Victor has on at least two occasions
>>shown that it's quite possible to recover pools that were completely unusable
>>(I believe by making use of old / backup copies of the uberblock).
>
> True; and certainly ZFS should be able backtrack. But it's
> much more likely to happen "automatically" then using a recovery
> tool.
>
> See, fsck could only be written because specific corruption are known
> and the patterns they have. With ZFS, you can only backup to
> a certain uberblock and the pattern will be a surprise.
>
> Casper
>
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


bfriesen

Posts: 874
From: US

Registered: 8/19/08
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 15, 2008 10:34 AM   in response to: myxiplx

  Click to reply to this thread Reply

On Mon, 15 Dec 2008, Ross wrote:

> My concern is that ZFS has all this information on disk, it has the
> ability to know exactly what is and isn't corrupted, and it should
> (at least for a system with snapshots) have many, many potential
> uberblocks to try. It should be far, far better than UFS at
> recovering from these things, but for a certain class of faults,
> when it hits a problem it just stops dead.

While ZFS knows if a data block is retrieved correctly from disk, a
correctly retrieved data block does not indicate that the pool isn't
"corrupted". A block written in the wrong order is a form of
corruption.

Bob
======================================
Bob Friesenhahn
bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


myxiplx

Posts: 883
From: GB

Registered: 10/24/07
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 15, 2008 11:19 AM   in response to: bfriesen

  Click to reply to this thread Reply

I'm not sure I follow how that can happen, I thought ZFS writes were
designed to be atomic? They either commit properly on disk or they
don't?


On Mon, Dec 15, 2008 at 6:34 PM, Bob Friesenhahn
<bfriesen at simple dot dallas dot tx dot us> wrote:
> On Mon, 15 Dec 2008, Ross wrote:
>
>> My concern is that ZFS has all this information on disk, it has the
>> ability to know exactly what is and isn't corrupted, and it should (at least
>> for a system with snapshots) have many, many potential uberblocks to try.
>> It should be far, far better than UFS at recovering from these things, but
>> for a certain class of faults, when it hits a problem it just stops dead.
>
> While ZFS knows if a data block is retrieved correctly from disk, a
> correctly retrieved data block does not indicate that the pool isn't
> "corrupted". A block written in the wrong order is a form of corruption.
>
> Bob
> ======================================
> Bob Friesenhahn
> bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
>
>
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


bfriesen

Posts: 874
From: US

Registered: 8/19/08
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 15, 2008 11:36 AM   in response to: myxiplx

  Click to reply to this thread Reply

On Mon, 15 Dec 2008, Ross Smith wrote:

> I'm not sure I follow how that can happen, I thought ZFS writes were
> designed to be atomic? They either commit properly on disk or they
> don't?

Yes, this is true. One reason why people complain about corrupted ZFS
pools is because they have hardware which writes data in a different
order than what was requested. Some hardware claims to have written
the data but instead it has been secretly cached for later (or perhaps
for never) and data blocks get written in some other order. It seems
that ZFS is capable of working reliably with "cheap" hardware but not
with wrongly designed hardware.

Bob
======================================
Bob Friesenhahn
bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


nico

Posts: 3,468
From: US

Registered: 6/15/05
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted: Dec 15, 2008 11:46 AM   in response to: bfriesen

  Click to reply to this thread Reply

On Mon, Dec 15, 2008 at 01:36:46PM -0600, Bob Friesenhahn wrote:
> On Mon, 15 Dec 2008, Ross Smith wrote:
>
> > I'm not sure I follow how that can happen, I thought ZFS writes were
> > designed to be atomic? They either commit properly on disk or they
> > don't?
>
> Yes, this is true. One reason why people complain about corrupted ZFS
> pools is because they have hardware which writes data in a different
> order than what was requested. Some hardware claims to have written
> the data but instead it has been secretly cached for later (or perhaps
> for never) and data blocks get written in some other order. It seems
> that ZFS is capable of working reliably with "cheap" hardware but not
> with wrongly designed hardware.

Order of writes matters between transactions, not inside transactions,
and at the boundary is a cache flush. Thus what matters really isn't
write order as much as whether the devices lie about cache flushes.
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss





Terms of Use | Privacy | Trademarks | Copyright Policy | Site Guidelines
Your use of this web site or any of its content or software indicates your agreement to be bound by these Terms of Use.
© 2010, Oracle Corporation and/or its affiliates

Oracle