|
Replies:
75
-
Last Post:
Feb 7, 2009 5:54 AM
by: gino
|
|
|
Posts:
6
From:
Registered:
10/1/08
|
|
|
|
zpool unimportable (corrupt zpool metadata??) but no zdb -l device problems
Posted:
Oct 1, 2008 2:20 AM
To: Communities » zfs » discuss
|
|
Hi, I am running snv90. I have a pool that is 6x1TB, config raidz. After a computer crash (root is NOT on the pool - only data) the pool showed FAULTED status. I exported and tried to reimport it, with the result as follows: ================ # zpool import pool: ztank id: 12125153257763159358 state: FAULTED status: The pool metadata is corrupted. action: The pool cannot be imported due to damaged devices or data. The pool may be active on another system, but can be imported using the '-f' flag. see: http://www.sun.com/msg/ZFS-8000-72 config:
ztank FAULTED corrupted data raidz1 ONLINE c1t6d0 ONLINE c1t5d0 ONLINE c1t4d0 ONLINE c1t3d0 ONLINE c1t2d0 ONLINE c1t1d0 ONLINE ================
I searched google and run zdb -l for every pool device. Results follow below... to me it appears that all disks are ok and zdb can see the zpool structure off of each of them. (at least this is how I can interpret the messages, but the zpool still says corrupt zpool metadata :-(
Any ideas as to what I might be able to do to salvage the data? restoring from backup is not an option (yes, I know :() - as this is a personal project I hoped the raidz would be enough :-(
The output for each of the disks is more or less identical, all labels are accessible.
# zdb -l /dev/dsk/c1t6d0s0 -------------------------------------------- LABEL 0 -------------------------------------------- version=10 name='ztank' state=0 txg=207161 pool_guid=12125153257763159358 hostid=628051022 hostname='zfssrv' top_guid=763279656890868029 guid=10947029755543026189 vdev_tree type='raidz' id=0 guid=763279656890868029 nparity=1 metaslab_array=14 metaslab_shift=35 ashift=9 asize=6001149345792 is_log=0 children[0] type='disk' id=0 guid=10947029755543026189 path='/dev/dsk/c1t1d0s0' devid='id1,sd@f0000000048455c81000880330000/a' phys_path='/pci@0,0/pci1000,30@10/sd@1,0:a' whole_disk=1 DTL=193 children[1] type='disk' id=1 guid=2640926618230776740 path='/dev/dsk/c1t2d0s0' devid='id1,sd@f0000000048455c81000992690001/a' phys_path='/pci@0,0/pci1000,30@10/sd@2,0:a' whole_disk=1 DTL=192 children[2] type='disk' id=2 guid=8982722125061616789 path='/dev/dsk/c1t3d0s0' devid='id1,sd@f0000000048455c81000ae8610002/a' phys_path='/pci@0,0/pci1000,30@10/sd@3,0:a' whole_disk=1 DTL=191 children[3] type='disk' id=3 guid=7263648809970512976 path='/dev/dsk/c1t4d0s0' devid='id1,sd@f0000000048455c81000bb2cf0003/a' phys_path='/pci@0,0/pci1000,30@10/sd@4,0:a' whole_disk=1 DTL=190 children[4] type='disk' id=4 guid=5275414937202266822 path='/dev/dsk/c1t5d0s0' devid='id1,sd@f0000000048455c81000ca3c40004/a' phys_path='/pci@0,0/pci1000,30@10/sd@5,0:a' whole_disk=1 DTL=189 children[5] type='disk' id=5 guid=8503895341004279533 path='/dev/dsk/c1t6d0s0' devid='id1,sd@f0000000048455c81000d49220005/a' phys_path='/pci@0,0/pci1000,30@10/sd@6,0:a' whole_disk=1 DTL=188 -------------------------------------------- LABEL 1 -------------------------------------------- version=10 name='ztank' state=0 txg=207161 pool_guid=12125153257763159358 hostid=628051022 hostname='zfssrv' top_guid=763279656890868029 guid=10947029755543026189 vdev_tree type='raidz' id=0 guid=763279656890868029 nparity=1 metaslab_array=14 metaslab_shift=35 ashift=9 asize=6001149345792 is_log=0 children[0] type='disk' id=0 guid=10947029755543026189 path='/dev/dsk/c1t1d0s0' devid='id1,sd@f0000000048455c81000880330000/a' phys_path='/pci@0,0/pci1000,30@10/sd@1,0:a' whole_disk=1 DTL=193 children[1] type='disk' id=1 guid=2640926618230776740 path='/dev/dsk/c1t2d0s0' devid='id1,sd@f0000000048455c81000992690001/a' phys_path='/pci@0,0/pci1000,30@10/sd@2,0:a' whole_disk=1 DTL=192 children[2] type='disk' id=2 guid=8982722125061616789 path='/dev/dsk/c1t3d0s0' devid='id1,sd@f0000000048455c81000ae8610002/a' phys_path='/pci@0,0/pci1000,30@10/sd@3,0:a' whole_disk=1 DTL=191 children[3] type='disk' id=3 guid=7263648809970512976 path='/dev/dsk/c1t4d0s0' devid='id1,sd@f0000000048455c81000bb2cf0003/a' phys_path='/pci@0,0/pci1000,30@10/sd@4,0:a' whole_disk=1 DTL=190 children[4] type='disk' id=4 guid=5275414937202266822 path='/dev/dsk/c1t5d0s0' devid='id1,sd@f0000000048455c81000ca3c40004/a' phys_path='/pci@0,0/pci1000,30@10/sd@5,0:a' whole_disk=1 DTL=189 children[5] type='disk' id=5 guid=8503895341004279533 path='/dev/dsk/c1t6d0s0' devid='id1,sd@f0000000048455c81000d49220005/a' phys_path='/pci@0,0/pci1000,30@10/sd@6,0:a' whole_disk=1 DTL=188 -------------------------------------------- LABEL 2 -------------------------------------------- version=10 name='ztank' state=0 txg=207161 pool_guid=12125153257763159358 hostid=628051022 hostname='zfssrv' top_guid=763279656890868029 guid=10947029755543026189 vdev_tree type='raidz' id=0 guid=763279656890868029 nparity=1 metaslab_array=14 metaslab_shift=35 ashift=9 asize=6001149345792 is_log=0 children[0] type='disk' id=0 guid=10947029755543026189 path='/dev/dsk/c1t1d0s0' devid='id1,sd@f0000000048455c81000880330000/a' phys_path='/pci@0,0/pci1000,30@10/sd@1,0:a' whole_disk=1 DTL=193 children[1] type='disk' id=1 guid=2640926618230776740 path='/dev/dsk/c1t2d0s0' devid='id1,sd@f0000000048455c81000992690001/a' phys_path='/pci@0,0/pci1000,30@10/sd@2,0:a' whole_disk=1 DTL=192 children[2] type='disk' id=2 guid=8982722125061616789 path='/dev/dsk/c1t3d0s0' devid='id1,sd@f0000000048455c81000ae8610002/a' phys_path='/pci@0,0/pci1000,30@10/sd@3,0:a' whole_disk=1 DTL=191 children[3] type='disk' id=3 guid=7263648809970512976 path='/dev/dsk/c1t4d0s0' devid='id1,sd@f0000000048455c81000bb2cf0003/a' phys_path='/pci@0,0/pci1000,30@10/sd@4,0:a' whole_disk=1 DTL=190 children[4] type='disk' id=4 guid=5275414937202266822 path='/dev/dsk/c1t5d0s0' devid='id1,sd@f0000000048455c81000ca3c40004/a' phys_path='/pci@0,0/pci1000,30@10/sd@5,0:a' whole_disk=1 DTL=189 children[5] type='disk' id=5 guid=8503895341004279533 path='/dev/dsk/c1t6d0s0' devid='id1,sd@f0000000048455c81000d49220005/a' phys_path='/pci@0,0/pci1000,30@10/sd@6,0:a' whole_disk=1 DTL=188 -------------------------------------------- LABEL 3 -------------------------------------------- version=10 name='ztank' state=0 txg=207161 pool_guid=12125153257763159358 hostid=628051022 hostname='zfssrv' top_guid=763279656890868029 guid=10947029755543026189 vdev_tree type='raidz' id=0 guid=763279656890868029 nparity=1 metaslab_array=14 metaslab_shift=35 ashift=9 asize=6001149345792 is_log=0 children[0] type='disk' id=0 guid=10947029755543026189 path='/dev/dsk/c1t1d0s0' devid='id1,sd@f0000000048455c81000880330000/a' phys_path='/pci@0,0/pci1000,30@10/sd@1,0:a' whole_disk=1 DTL=193 children[1] type='disk' id=1 guid=2640926618230776740 path='/dev/dsk/c1t2d0s0' devid='id1,sd@f0000000048455c81000992690001/a' phys_path='/pci@0,0/pci1000,30@10/sd@2,0:a' whole_disk=1 DTL=192 children[2] type='disk' id=2 guid=8982722125061616789 path='/dev/dsk/c1t3d0s0' devid='id1,sd@f0000000048455c81000ae8610002/a' phys_path='/pci@0,0/pci1000,30@10/sd@3,0:a' whole_disk=1 DTL=191 children[3] type='disk' id=3 guid=7263648809970512976 path='/dev/dsk/c1t4d0s0' devid='id1,sd@f0000000048455c81000bb2cf0003/a' phys_path='/pci@0,0/pci1000,30@10/sd@4,0:a' whole_disk=1 DTL=190 children[4] type='disk' id=4 guid=5275414937202266822 path='/dev/dsk/c1t5d0s0' devid='id1,sd@f0000000048455c81000ca3c40004/a' phys_path='/pci@0,0/pci1000,30@10/sd@5,0:a' whole_disk=1 DTL=189 children[5] type='disk' id=5 guid=8503895341004279533 path='/dev/dsk/c1t6d0s0' devid='id1,sd@f0000000048455c81000d49220005/a' phys_path='/pci@0,0/pci1000,30@10/sd@6,0:a' whole_disk=1 DTL=188 ================
|
|
|
Posts:
6
From:
Registered:
10/1/08
|
|
|
|
Re: zpool unimportable (corrupt zpool metadata??) but no zdb -l device problems
Posted:
Oct 1, 2008 3:42 AM
in response to: gmvasile
To: Communities » zfs » discuss
|
|
an update to the above: I tried to run zdb -e on the pool id and here's the result: # zdb -e 12125153257763159358 zdb: can't open 12125153257763159358: I/O error
NB zdb seems to recognize the ID because runnig it with an incorrect ID gives me an error # zdb -e 12125153257763159354 zdb: can't open 12125153257763159354: No such file or directory
Also zdb -e with the ID of the syspool works: # zdb -e 8843238790372298114 Uberblock
magic = 0000000000bab10c version = 10 txg = 317369 guid_sum = 14131844542001965925 timestamp = 1222857640 UTC = Wed Oct 1 12:40:40 2008
Dataset mos [META], ID 0, cr_txg 4, 2.76M, 244 objects Dataset 8843238790372298114/export/home [ZPL], ID 60, cr_txg 721, 1.21G, 55 objects Dataset 8843238790372298114/export [ZPL], ID 54, cr_txg 718, 19.0K, 5 objects Dataset 8843238790372298114/swap [ZVOL], ID 28, cr_txg 15, 519M, 3 objects Dataset 8843238790372298114/ROOT/snv_90 [ZPL], ID 48, cr_txg 710, 6.85G, 254748 objects Dataset 8843238790372298114/ROOT [ZPL], ID 22, cr_txg 12, 18.0K, 4 objects Dataset 8843238790372298114/dump [ZVOL], ID 34, cr_txg 18, 512M, 3 objects Dataset 8843238790372298114 [ZPL], ID 5, cr_txg 4, 39.5K, 13 objects
etc etc. =============
Any ideas? Could this be a hardware problem? I have no idea what to do next :-(
thanks for your help! Vasile
|
|
|
|
Posts:
6
From:
Registered:
10/1/08
|
|
|
|
Re: one step forward - pinging Lukas pool: ztankKarwacki (kangurek)
Posted:
Oct 1, 2008 11:24 AM
in response to: gmvasile
To: Communities » zfs » discuss
|
|
on the advice of Okana in the freenode.net #opensolaris channel I tried to run the latest opensolaris livecd and try to import the pool. No luck, however I tried the trick in Lukas's post that allowed him to import the pool and I had a beginning of luck.
By doing the mdb wizardry he indicated I was able to run zpool import with the following result: pool: ztank id: whatever state: ONLINE status: The pool was last accessed by another system. see http://www.sun.com/msg/ZFS-8000-EY
config: ztank ONLINE raidz1 ONLINE c4t0d0 ONLINE c4t1d0 ONLINE c4t2d0 ONLINE c4t3d0 ONLINE c4t4d0 ONLINE c4t5d0 ONLINE
HOWEVER. When I attempt again to import using zdb -e ztank I still get zdb: can't open ztank: I/O error and zpool import -f, whilst it starts and seems to access the disks sequentially, it stops al the 3rd one (no sure which precisely - it spins it up and the process stops right there, and the system will not reboot when asked to (shutdown -g0 -y -i5) so there's some slight progress here.
I would really appreciate ideas from you guys!
Thanks Vasile
|
|
|
|
Posts:
7
From:
Registered:
6/9/08
|
|
|
|
Re: one step forward - pinging Lukas pool: ztankKarwacki (kangurek)
Posted:
Oct 2, 2008 7:37 AM
in response to: gmvasile
To: Communities » zfs » discuss
|
|
> When I attempt again to import using zdb -e ztank > I still get zdb: can't open ztank: I/O error > and zpool import -f, whilst it starts and seems to > access the disks sequentially, it stops al the 3rd > one (no sure which precisely - it spins it up and the > process stops right there, and the system will not > reboot when asked to (shutdown -g0 -y -i5) > so there's some slight progress here.
How about just removing that disk and try importing?
|
|
|
|
Posts:
6
From:
Registered:
10/1/08
|
|
|
|
Re: one step forward - pinging Lukas pool: ztankKarwacki (kangurek)
Posted:
Oct 2, 2008 1:32 PM
in response to: okona
To: Communities » zfs » discuss
|
|
Thanks Martin, Yeah, tried it but no luck :-( I do not think it is a hardware problem - in fact I tried removing every disk one by one with no luck - this is why I think it is not in fact a hardware problem... Kind regards Vasile
|
|
|
|
Posts:
6
From:
Registered:
10/1/08
|
|
|
|
Re: Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted:
Oct 3, 2008 7:42 AM
in response to: okona
To: Communities » zfs » discuss
|
|
Hi folks,
I just wanted to share the end of my "adventure" here and especially take the time to thank Victor for helping me out of this mess.
I will let him explain the technical details (I am out of my depth here) but bottom line he spent a couple of hours with me on the machine and sorted me out. His explanation: he invalidated the incorrect uberblocks and forced zfs to revert to an earlier state that was consistent.
The machine is now in the process of doing a full scrub and the first order of business tomorrow will be to do a full backup :-)
According to his explanation, the reason for the troubles I had was that Solaris was running in a VM on my Debian server and it was not shut down properly when the Debian server did a controlled shutdown following a UPS event.
The Solaris machine was abruptly shut down but because it was not in control of the entire chain till bare hardware, it appears that some writes were in fact still with Debian when Solaris thought them safely executed.
This left the zpool in question in a state that even raidz1 did not help with.
Anyway, again, lots and lots of thanks to Victor!!!
kind regards Vasile
|
|
|
|
Darren J Moffat
darrenm@opensolaris....
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun
/ Moscow
Posted:
Oct 3, 2008 7:50 AM
in response to: gmvasile
|
|
Vasile Dumitrescu wrote: > Hi folks, > > I just wanted to share the end of my "adventure" here and especially take the time to thank Victor for helping me out of this mess. > > I will let him explain the technical details (I am out of my depth here) but bottom line he spent a couple of hours with me on the machine and sorted me out. His explanation: he invalidated the incorrect uberblocks and forced zfs to revert to an earlier state that was consistent. > > The machine is now in the process of doing a full scrub and the first order of business tomorrow will be to do a full backup :-) > > According to his explanation, the reason for the troubles I had was that Solaris was running in a VM on my Debian server and it was not shut down properly when the Debian server did a controlled shutdown following a UPS event.
Which VM solution was this ? VMware, VirtualBox, Xen, other ? How were the "disks" presented to the guest ? What are the "disks" in the host, real disks, files, something else ?
-- Darren J Moffat _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
6
From:
Registered:
10/1/08
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted:
Oct 3, 2008 8:37 AM
in response to: Darren J Moffat
To: Communities » zfs » discuss
|
|
> > Which VM solution was this ? VMware, VirtualBox, Xen, > other ? How were > the "disks" presented to the guest ? What are the > "disks" in the host, > real disks, files, something else ? > > > -- > Darren J Moffat > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris dot org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss
VMWare 6.0.4 running on Debian unstable, Linux bigsrv 2.6.26-1-amd64 #1 SMP Wed Sep 24 13:59:41 UTC 2008 x86_64 GNU/Linux
Solaris is vanilla snv_90 installed with no GUI.
Here is the content of the .vmx file in question: ================================================ #!/usr/bin/vmware config.version = "8" virtualHW.version = "6" scsi0.present = "TRUE" scsi0.virtualDev = "lsilogic"
memsize = "4096" MemAllowAutoScaleDown = "FALSE" MemTrimRate = "0" sched.mem.pshare.enable = "FALSE" sched.mem.minsize = "3062" sched.mem.max = "7000" sched.mem.maxmemctl = "0" sched.mem.shares = "100000"
scsi0:0.present = "TRUE" scsi0:0.fileName = "/home/vasile/vmware/solsrv/OpenSolaris64.vmdk" ide1:0.present = "TRUE" ide1:0.autodetect = "TRUE" ide1:0.deviceType = "cdrom-image" floppy0.startConnected = "FALSE" floppy0.autodetect = "TRUE" ethernet0.present = "TRUE" ethernet0.virtualDev = "e1000" ethernet0.wakeOnPcktRcv = "TRUE" sound.present = "FALSE" sound.fileName = "-1" sound.autodetect = "TRUE" svga.autodetect = "FALSE" pciBridge0.present = "TRUE" displayName = "zfssrv" guestOS = "solaris10-64" nvram = "Solaris 10 64-bit.nvram" deploymentPlatform = "windows" virtualHW.productCompatibility = "hosted" RemoteDisplay.vnc.port = "0" tools.upgrade.policy = "useGlobal"
floppy0.fileName = "/dev/fd0" extendedConfigFile = "Solaris 10 64-bit.vmxf"
ide1:0.fileName = "" floppy0.present = "FALSE" gui.powerOnAtStartup = "TRUE"
ide1:0.startConnected = "TRUE" ethernet0.addressType = "generated" uuid.location = "56 4d da 02 a4 a0 78 74-2e 09 90 62 45 bb c4 94" uuid.bios = "56 4d da 02 a4 a0 78 74-2e 09 90 62 45 bb c4 94" scsi0:0.redo = "" pciBridge0.pciSlotNumber = "17" scsi0.pciSlotNumber = "16" ethernet0.pciSlotNumber = "32" sound.pciSlotNumber = "-1" ethernet0.generatedAddress = "00:0c:29:bb:c4:94" ethernet0.generatedAddressOffset = "0" tools.syncTime = "FALSE"
svga.maxWidth = "1024" svga.maxHeight = "768" svga.vramSize = "3145728"
scsi0:1.present = "TRUE" scsi0:1.fileName = "ztank-sda.vmdk" scsi0:1.mode = "independent-persistent" scsi0:1.deviceType = "rawDisk" scsi0:2.present = "TRUE" scsi0:2.fileName = "ztank-sdb.vmdk" scsi0:2.mode = "independent-persistent" scsi0:2.deviceType = "rawDisk" scsi0:3.present = "TRUE" scsi0:3.fileName = "ztank-sdc.vmdk" scsi0:3.mode = "independent-persistent" scsi0:3.deviceType = "rawDisk" scsi0:4.present = "TRUE" scsi0:4.fileName = "ztank-sdd.vmdk" scsi0:4.mode = "independent-persistent" scsi0:4.deviceType = "rawDisk" scsi0:5.present = "TRUE" scsi0:5.fileName = "ztank-sde.vmdk" scsi0:5.mode = "independent-persistent" scsi0:5.deviceType = "rawDisk" scsi0:6.present = "TRUE" scsi0:6.fileName = "ztank-sdf.vmdk" scsi0:6.mode = "independent-persistent" scsi0:6.deviceType = "rawDisk"
scsi0:1.redo = "" scsi0:2.redo = "" scsi0:3.redo = "" scsi0:4.redo = "" scsi0:5.redo = "" scsi0:6.redo = ""
isolation.tools.dnd.disable = "TRUE" snapshot.disabled = "TRUE"
scsi0:0.mode = "independent-persistent"
isolation.tools.copy.disable = "FALSE" isolation.tools.paste.disable = "FALSE"
tools.remindInstall = "TRUE" ================================================
in summary: physical disks, assigned 100% to the VM
HTH
kind regards Vasile
|
|
|
|
Posts:
226
From:
Registered:
5/14/08
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Oct 4, 2008 12:19 AM
in response to: gmvasile
|
|
On Fri, Oct 3, 2008 at 10:37 PM, Vasile Dumitrescu <vasiledumitrescu at gmail dot com> wrote:
> VMWare 6.0.4 running on Debian unstable, > Linux bigsrv 2.6.26-1-amd64 #1 SMP Wed Sep 24 13:59:41 UTC 2008 x86_64 GNU/Linux > > Solaris is vanilla snv_90 installed with no GUI.
> > in summary: physical disks, assigned 100% to the VM
That's weird. I thought one of the point of using physical disks instead of files was to avoid problems caused by caching on host/dom0? _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Darren J Moffat
darrenm@opensolaris....
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun
/ Moscow
Posted:
Oct 6, 2008 2:39 AM
in response to: fajar
|
|
Fajar A. Nugraha wrote: > On Fri, Oct 3, 2008 at 10:37 PM, Vasile Dumitrescu > <vasiledumitrescu at gmail dot com> wrote: > >> VMWare 6.0.4 running on Debian unstable, >> Linux bigsrv 2.6.26-1-amd64 #1 SMP Wed Sep 24 13:59:41 UTC 2008 x86_64 GNU/Linux >> >> Solaris is vanilla snv_90 installed with no GUI. > > >> in summary: physical disks, assigned 100% to the VM > > That's weird. I thought one of the point of using physical disks > instead of files was to avoid problems caused by caching on host/dom0?
The data still flows through the host/dom0 device drivers and is thus at the mercy of the commands they issue to the physical devices.
-- Darren J Moffat _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
2
From:
Registered:
4/28/07
|
|
|
|
Re: Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted:
Oct 9, 2008 2:53 AM
in response to: gmvasile
To: Communities » zfs » discuss
|
|
> His explanation: he invalidated the incorrect > uberblocks and forced zfs to revert to an earlier > state that was consistent.
Would someone be willing to document the steps required in order to do this please?
I have a disk in a similar state:
# zpool import pool: tank id: 13234439337856002730 state: FAULTED status: The pool metadata is corrupted. action: The pool cannot be imported due to damaged devices or data. The pool may be active on another system, but can be imported using the '-f' flag. see: http://www.sun.com/msg/ZFS-8000-72 config:
tank FAULTED corrupted data c7d0 ONLINE
This happened after I foolishly began trusting zfs-fuse with some large but relatively unimportant data on a big, empty single disk zpool in my home machine and then suffered a power cut before I got around to backing it up.
OpenSolaris can't import the pool either, so the drive is sat on a shelf waiting till a method for fixing it is published.
While it's clearly my own fault for taking the risks I did, it's still pretty frustrating knowing that all my data is likely still intact and nicely checksummed on the disk but that none of it is accessible due to some tiny filesystem inconsistency. With pretty much any other FS I think I could get most of it back.
Clearly such a small number of occurrences in what were admittedly precarious configurations aren't going to be particularly convincing motivators to provide a general solution, but I'd feel a whole lot better about using ZFS if I knew that there were some documented steps or a tool (zfsck? ;) that could help to recover from this kind of metadata corruption in the unlikely event of it happening.
cheers,
Rob
|
|
|
|
Posts:
1,361
From:
US
Registered:
8/5/05
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Oct 9, 2008 4:37 AM
in response to: rwarner2
|
|
On Thu, Oct 9, 2008 at 4:53 AM, . <osl at boymonkey dot com> wrote: > While it's clearly my own fault for taking the risks I did, it's > still pretty frustrating knowing that all my data is likely still > intact and nicely checksummed on the disk but that none of it is > accessible due to some tiny filesystem inconsistency. ?With pretty > much any other FS I think I could get most of it back. > > Clearly such a small number of occurrences in what were admittedly > precarious configurations aren't going to be particularly convincing > motivators to provide a general solution, but I'd feel a whole lot > better about using ZFS if I knew that there were some documented > steps or a tool (zfsck? ;) that could help to recover from this kind > of metadata corruption in the unlikely event of it happening.
Well said. You have hit on my #1 concern with deploying ZFS.
FWIW, I belive that I have hit the same type of bug as the OP in the following combinations:
- T2000, LDoms 1.0, various builds of Nevada in control and guest domains. - Laptop, VirtualBox 1.6.2, Windows XP SP2 host, OpenSolaris 2008.05 @ build 97 guest
In the past year I've lost more ZFS file systems than I have any other type of file system in the past 5 years. With other file systems I can almost always get some data back. With ZFS I can't get any back.
-- Mike Gerdts http://mgerdts.blogspot.com/ _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Wilkinson, Alex
alex.wilkinson@dsto....
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Oct 9, 2008 4:46 AM
in response to: mgerdts
|
|
0n Thu, Oct 09, 2008 at 06:37:23AM -0500, Mike Gerdts wrote:
>FWIW, I belive that I have hit the same type of bug as the OP in the >following combinations: > >- T2000, LDoms 1.0, various builds of Nevada in control and guest > domains. >- Laptop, VirtualBox 1.6.2, Windows XP SP2 host, OpenSolaris 2008.05 @ > build 97 guest > >In the past year I've lost more ZFS file systems than I have any other >type of file system in the past 5 years. With other file systems I >can almost always get some data back. With ZFS I can't get any back.
Thats scary to hear!
-aW
IMPORTANT: This email remains the property of the Australian Defence Organisation and is subject to the jurisdiction of section 70 of the CRIMES ACT 1914. If you have received this email in error, you are requested to contact the sender and delete the email.
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Ahmed Kamal
email.ahmedkamal@goo...
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Oct 9, 2008 5:44 AM
in response to: Wilkinson, Alex
|
|
>
>In the past year I've lost more ZFS file systems than I have any other
>type of file system in the past 5 years. With other file systems I
>can almost always get some data back. With ZFS I can't get any back.
I am really scared now! I was the one trying to quantify ZFS reliability, and that is surely bad to hear!
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
1,361
From:
US
Registered:
8/5/05
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Oct 9, 2008 6:22 AM
in response to: Ahmed Kamal
|
|
On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal <email dot ahmedkamal at googlemail dot com> wrote: > > > > >In the past year I've lost more ZFS file systems than I have any other > >type of file system in the past 5 years. With other file systems I > >can almost always get some data back. With ZFS I can't get any back. > >> Thats scary to hear! >> > > I am really scared now! I was the one trying to quantify ZFS reliability, > and that is surely bad to hear!
The circumstances where I have lost data have been when ZFS has not handled a layer of redundancy. However, I am not terribly optimistic of the prospects of ZFS on any device that hasn't committed writes that ZFS thinks are committed. Mirrors and raidz would also be vulnerable to such failures.
I also have run into other failures that have gone unanswered on the lists. It makes me wary about using zfs without a support contract that allows me to escalate to engineering. Patching only support won't help.
http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html Hang only after I mirrored the zpool, no response on the list
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html I think this is fixed around snv_98, but the zfs-discuss list was surprisingly silent on acknowledging it as a problem - I had no idea that it was being worked until I saw the commit. The panic seemed to be caused by dtrace - core developers of dtrace were quite interested in the kernel crash dump.
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html Panic during ON build. Pool was lost, no response from list.
-- Mike Gerdts http://mgerdts.blogspot.com/ _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Timh Bergström
timh.bergstrom@diino...
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Oct 9, 2008 7:50 AM
in response to: mgerdts
|
|
Unfortunely I can only agree to the doubts about running ZFS in production environments, i've lost ditto-blocks, i''ve gotten corrupted pools and a bunch of other failures even in mirror/raidz/raidz2 setups with or without hardware mirrors/raid5/6. Plus the insecurity of a sudden crash/reboot will corrupt or even destroy the pools with "restore from backup" as the only advice. I've been lucky so far about getting my pools back thanks to people like Victor.
What would be needed is a proper fsck for ZFS which can resolv "minor" data corruptions, tools for rebuilding, resizing and moving the data about on pools is also needed, even recover of data from faulted pools, like there is for ext2/3/ufs/ntfs.
All in all, great FS but not production ready until the tools are in place or it gets really really resillient to minor failures and/or crashes in both software and hardware. For now i'll stick to XFS/UFS and sw/hw-raid and live with the restrictions of such fs.
//T
2008/10/9 Mike Gerdts <mgerdts at gmail dot com>: > On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal > <email dot ahmedkamal at googlemail dot com> wrote: >> >> > >> >In the past year I've lost more ZFS file systems than I have any other >> >type of file system in the past 5 years. With other file systems I >> >can almost always get some data back. With ZFS I can't get any back. >> >>> Thats scary to hear! >>> >> >> I am really scared now! I was the one trying to quantify ZFS reliability, >> and that is surely bad to hear! > > The circumstances where I have lost data have been when ZFS has not > handled a layer of redundancy. However, I am not terribly optimistic > of the prospects of ZFS on any device that hasn't committed writes > that ZFS thinks are committed. Mirrors and raidz would also be > vulnerable to such failures. > > I also have run into other failures that have gone unanswered on the > lists. It makes me wary about using zfs without a support contract > that allows me to escalate to engineering. Patching only support > won't help. > > http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html > Hang only after I mirrored the zpool, no response on the list > > http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html > I think this is fixed around snv_98, but the zfs-discuss list was > surprisingly silent on acknowledging it as a problem - I had no > idea that it was being worked until I saw the commit. The panic > seemed to be caused by dtrace - core developers of dtrace > were quite interested in the kernel crash dump. > > http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html > Panic during ON build. Pool was lost, no response from list. > > -- > Mike Gerdts > http://mgerdts.blogspot.com/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris dot org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
-- Timh Bergström System Administrator Diino AB - www.diino.com :wq _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
102
From:
Louisville, CO
Registered:
3/12/06
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun
/ Moscow
Posted:
Oct 9, 2008 8:10 AM
in response to: Timh Bergström
|
|
Perhaps I mis-understand, but the below issues are all based on Nevada, not Solaris 10.
Nevada isn't production code. For real ZFS testing, you must use a production release, currently Solaris 10 (update 5, soon to be update 6).
In the last 2 years, I've stored everything in my environment (home directory, builds, etc.) on ZFS on multiple types of storage subsystems without issues. All of this has been on Solaris 10, however.
Btw, I completely agree on the panic issue. If I have a large DB server with many pools, and one inconsequential pool fails, I lose the entire DB server. I'd really like to see an option at the zpool level directing what to do in a panic for a particular pool. Perhaps this is in the latest bits; if so, sorry, I'm running old stuff. :-)
I also run ZFS on my mac. While not production quality, some of the panic errors dealing with external (firewire, usb, esata) are very irritating. A hiccup due to a jostled cable, and the entire box panics. That's frustrating.
Timh Bergström wrote: > Unfortunely I can only agree to the doubts about running ZFS in > production environments, i've lost ditto-blocks, i''ve gotten > corrupted pools and a bunch of other failures even in > mirror/raidz/raidz2 setups with or without hardware mirrors/raid5/6. > Plus the insecurity of a sudden crash/reboot will corrupt or even > destroy the pools with "restore from backup" as the only advice. I've > been lucky so far about getting my pools back thanks to people like > Victor. > > What would be needed is a proper fsck for ZFS which can resolv "minor" > data corruptions, tools for rebuilding, resizing and moving the data > about on pools is also needed, even recover of data from faulted > pools, like there is for ext2/3/ufs/ntfs. > > All in all, great FS but not production ready until the tools are in > place or it gets really really resillient to minor failures and/or > crashes in both software and hardware. For now i'll stick to XFS/UFS > and sw/hw-raid and live with the restrictions of such fs. > > //T > > 2008/10/9 Mike Gerdts <mgerdts at gmail dot com>: > >> On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal >> <email dot ahmedkamal at googlemail dot com> wrote: >> >>> > >>> >In the past year I've lost more ZFS file systems than I have any other >>> >type of file system in the past 5 years. With other file systems I >>> >can almost always get some data back. With ZFS I can't get any back. >>> >>> >>>> Thats scary to hear! >>>> >>>> >>> I am really scared now! I was the one trying to quantify ZFS reliability, >>> and that is surely bad to hear! >>> >> The circumstances where I have lost data have been when ZFS has not >> handled a layer of redundancy. However, I am not terribly optimistic >> of the prospects of ZFS on any device that hasn't committed writes >> that ZFS thinks are committed. Mirrors and raidz would also be >> vulnerable to such failures. >> >> I also have run into other failures that have gone unanswered on the >> lists. It makes me wary about using zfs without a support contract >> that allows me to escalate to engineering. Patching only support >> won't help. >> >> http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html >> Hang only after I mirrored the zpool, no response on the list >> >> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html >> I think this is fixed around snv_98, but the zfs-discuss list was >> surprisingly silent on acknowledging it as a problem - I had no >> idea that it was being worked until I saw the commit. The panic >> seemed to be caused by dtrace - core developers of dtrace >> were quite interested in the kernel crash dump. >> >> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html >> Panic during ON build. Pool was lost, no response from list. >> >> -- >> Mike Gerdts >> http://mgerdts.blogspot.com/ >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris dot org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> > > > > _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
1,361
From:
US
Registered:
8/5/05
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Oct 9, 2008 8:18 AM
in response to: shawga
|
|
On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw <Greg dot Shaw at sun dot com> wrote: > Nevada isn't production code. For real ZFS testing, you must use a > production release, currently Solaris 10 (update 5, soon to be update 6).
I misstated before in my LDoms case. The corrupted pool was on Solaris 10, with LDoms 1.0. The control domain was SX*E, but the zpool there showed no problems. I got into a panic loop with dangling dbufs. My understanding is that this was caused by a bug in the LDoms manager 1.0 code that has been fixed in a later release. It was a supported configuration, I pushed for and got a fix. However, that pool was still lost.
-- Mike Gerdts http://mgerdts.blogspot.com/ _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
1,361
From:
US
Registered:
8/5/05
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Oct 9, 2008 8:33 PM
in response to: mgerdts
|
|
On Thu, Oct 9, 2008 at 10:18 AM, Mike Gerdts <mgerdts at gmail dot com> wrote: > On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw <Greg dot Shaw at sun dot com> wrote: >> Nevada isn't production code. For real ZFS testing, you must use a >> production release, currently Solaris 10 (update 5, soon to be update 6). > > I misstated before in my LDoms case. The corrupted pool was on > Solaris 10, with LDoms 1.0. The control domain was SX*E, but the > zpool there showed no problems. I got into a panic loop with dangling > dbufs. My understanding is that this was caused by a bug in the LDoms > manager 1.0 code that has been fixed in a later release. It was a > supported configuration, I pushed for and got a fix. However, that > pool was still lost.
Or maybe it wasn't fixed yet. I see that this was committed just today.
6684721 file backed virtual i/o should be synchronous
http://hg.genunix.org/onnv-gate.hg/rev/eb40ff0c92ec
-- Mike Gerdts http://mgerdts.blogspot.com/ _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
1,361
From:
US
Registered:
8/5/05
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Oct 13, 2008 9:58 AM
in response to: mgerdts
|
|
On Thu, Oct 9, 2008 at 10:33 PM, Mike Gerdts <mgerdts at gmail dot com> wrote: > On Thu, Oct 9, 2008 at 10:18 AM, Mike Gerdts <mgerdts at gmail dot com> wrote: >> On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw <Greg dot Shaw at sun dot com> wrote: >>> Nevada isn't production code. For real ZFS testing, you must use a >>> production release, currently Solaris 10 (update 5, soon to be update 6). >> >> I misstated before in my LDoms case. The corrupted pool was on >> Solaris 10, with LDoms 1.0. The control domain was SX*E, but the >> zpool there showed no problems. I got into a panic loop with dangling >> dbufs. My understanding is that this was caused by a bug in the LDoms >> manager 1.0 code that has been fixed in a later release. It was a >> supported configuration, I pushed for and got a fix. However, that >> pool was still lost. > > Or maybe it wasn't fixed yet. I see that this was committed just today. > > 6684721 file backed virtual i/o should be synchronous > > http://hg.genunix.org/onnv-gate.hg/rev/eb40ff0c92ec
The related information from the LDoms Manager 1.1 Early Access release notes (820-4914-10):
Data Might Not Be Written Immediately to the Virtual Disk Backend If Virtual I/O Is Backed by a File or Volume
Bug ID 6684721: When a file or volume is exported as a virtual disk, then the service domain exporting that file or volume is acting as a storage cache for the virtual disk. In that case, data written to the virtual disk might get cached into the service domain memory instead of being immediately written to the virtual disk backend. Data are not cached if the virtual disk backend is a physical disk or slice, or if it is a volume device exported as a single-slice disk.
Workaround: If the virtual disk backend is a file or a volume device exported as a full disk, then you can prevent data from being cached into the service domain memory and have data written immediately to the virtual disk backend by adding the following line to the /etc/system file on the service domain.
set vds:vd_file_write_flags = 0
Note – Setting this tunable flag does have an impact on performance when writing to a virtual disk, but it does ensure that data are written immediately to the virtual disk backend.
-- Mike Gerdts http://mgerdts.blogspot.com/ _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Miles Nordin
carton@Ivy.NET
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @
Sun / Moscow
Posted:
Oct 9, 2008 11:38 AM
in response to: shawga
|
|
|
|
>>>>> "gs" == Greg Shaw <Greg dot Shaw at Sun dot COM> writes:
gs> Nevada isn't production code. For real ZFS testing, you must gs> use a production release, currently Solaris 10 (update 5, soon gs> to be update 6).
based on list feedback, my impression is that the results of a ``test'' confined to s10, particularly s10u4 (the latest available during most of Mike's experience), would be worse than Nevada experience over the same period. but I doubt either matches UFS+SVM or ext3+LVM2. The on-disk format with ``ditto blocks'' and ``always consistent'' may be fantastic, but the code for reading it is not.
Maybe the code is stellar, and the problem really is underlying storage stacks that fail to respect write barriers. If so, ZFS needs to include a storage stack qualification tool. For me it doesn't strain credibility to believe these problems might be rampant in VM stacks and SAN's, nor do I find it unacceptable if ZFS is vastly more sensitive to them than any other filesystem. If this speculation turns out to really be the case, I imagine the two going together: the problems are rampant because they don't bother other filesystems too catastrophically. If this is really the situation, then ZFS needs to give the sysadmin a way to isolate and fix the problems deterministically before filling the pool with data, not just blame the sysadmin based on nebulous speculatory hindsight gremlins.
And if it's NOT the case, the ZFS problems need to be acknowledged and fixed.
To my view, the above is *IN ADDITION* to developing a recovery/forensic/``fsck'' tool, not either/or. The pools should not be getting corrupt in the first place, and pulling the cord should not mean you have to settle for best-effort. None of the modern filesystems demand an fsck after unclean shutdown.
The current procedure for qualifying a platform seems to be: (1) subject it to heavy write activity, (2) pull the cord, (3) repeat. Ahmed, maybe you should use that test to ``quantify'' filesystem reliability. You can try it with ZFS, then reinstall the machine with CentOS and try the same test with ext3+LVM2 or xfs+areca. The numbers you get are how many times can you pull the cord before you lose something, and how much do you lose. Here's a really old test of that sort comparing Linux filesystems which is something like what I have in mind:
https://www.redhat.com/archives/fedora-list/2004-July/msg00418.html
so you see he got two sets of numbers---number of reboots and amount of corruption. For reiserfs and JFS he lost their equivalent of ``the whole pool'', and for ext3 and XFS he got corruption but never lost the pool. It's not clear to me the filesystems ever claimed to prevent corruption in his test scenario (was he calling fsync() after each log write? syslog does that sometimes, and if so, they do claim it, but if he's just writing with some silly script they don't), but definitely they do all claim you won't lose the whole pool in a power outage, and only two out of four delivered on that. I base my choice of Linux filesystem on this test, and wish I'd done such a test before converting things to ZFS. _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
874
From:
US
Registered:
8/19/08
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Oct 9, 2008 12:06 PM
in response to: Miles Nordin
|
|
On Thu, 9 Oct 2008, Miles Nordin wrote: > > catastrophically. If this is really the situation, then ZFS needs to > give the sysadmin a way to isolate and fix the problems > deterministically before filling the pool with data, not just blame > the sysadmin based on nebulous speculatory hindsight gremlins. > > And if it's NOT the case, the ZFS problems need to be acknowledged and > fixed.
Can you provide any supportive evidence that ZFS is as fragile as you describe?
>From recent opinions expressed here, properly-designed ZFS pools must be inexplicably permanently cratering each and every day.
Bob ====================================== Bob Friesenhahn bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Timh Bergström
timh.bergstrom@diino...
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Oct 10, 2008 12:38 AM
in response to: bfriesen
|
|
2008/10/9 Bob Friesenhahn <bfriesen at simple dot dallas dot tx dot us>: > On Thu, 9 Oct 2008, Miles Nordin wrote: >> >> catastrophically. If this is really the situation, then ZFS needs to >> give the sysadmin a way to isolate and fix the problems >> deterministically before filling the pool with data, not just blame >> the sysadmin based on nebulous speculatory hindsight gremlins. >> >> And if it's NOT the case, the ZFS problems need to be acknowledged and >> fixed. > > Can you provide any supportive evidence that ZFS is as fragile as you > describe?
The hundreds of sysadmins seeing their pools go byebye after normal operations in a production environment is evidence enough. And the number of times people like Victor have saved our asses.
> > >From recent opinions expressed here, properly-designed ZFS pools must > be inexplicably permanently cratering each and every day. > > Bob > ====================================== > Bob Friesenhahn > bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris dot org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
-- Timh Bergström System Administrator Diino AB - www.diino.com :wq _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
22
From:
Registered:
2/1/08
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Posted:
Oct 11, 2008 11:06 AM
in response to: Timh Bergström
|
|
"Timh Bergström" <timh dot bergstrom at diino dot net> writes:
> Unfortunely I can only agree to the doubts about running ZFS in > production environments, i've lost ditto-blocks, i''ve gotten > corrupted pools and a bunch of other failures even in > mirror/raidz/raidz2 setups with or without hardware mirrors/raid5/6. > Plus the insecurity of a sudden crash/reboot will corrupt or even > destroy the pools with "restore from backup" as the only advice. I've > been lucky so far about getting my pools back thanks to people like > Victor.
With which release was that? Solaris 10 or OpenSolaris?
Regards, Juergen. _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
129
From:
US
Registered:
3/9/05
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Oct 10, 2008 1:26 AM
in response to: mgerdts
|
|
> The circumstances where I have lost data have been when ZFS has not > handled a layer of redundancy. However, I am not terribly optimistic > of the prospects of ZFS on any device that hasn't committed writes > that ZFS thinks are committed.
FYI, I'm working on a workaround for broken devices. As you note, some disks flat-out lie: you issue the synchronize-cache command, they say "got it, boss", yet the data is still not on stable storage. Why do they do this? Because "it performs better". Well, duh -- you can make stuff *really* fast if it doesn't have to be correct.
Before I explain how ZFS can fix this, I need to get something off my chest: people who knowingly make such disks should be in federal prison. It is *fraud* to win benchmarks this way. Doing so causes real harm to real people. Same goes for NFS implementations that ignore sync. We have specifications for a reason. People assume that you honor them, and build higher-level systems on top of them. Change the mass of the proton by a few percent, and the stars explode. It is impossible to build a functioning civil society in a culture that tolerates lies. We need a little more Code of Hammurabi in the storage industry.
Now:
The uberblock ring buffer in ZFS gives us a way to cope with this, as long as we don't reuse freed blocks for a few transaction groups. The basic idea: if we can't read the pool startign from the most recent uberblock, then we should be able to use the one before it, or the one before that, etc, as long as we haven't yet reused any blocks that were freed in those earlier txgs. This allows us to use the normal load on the pool, plus the passage of time, as a displacement flush for disk caches that ignore the sync command.
If we go back far enough in (txg) time, we will eventually find an uberblock all of whose dependent data blocks have make it to disk. I'll run tests with known-broken disks to determine how far back we need to go in practice -- I'll bet one txg is almost always enough.
Jeff _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
883
From:
GB
Registered:
10/24/07
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted:
Oct 10, 2008 2:29 AM
in response to: bonwick
To: Communities » zfs » discuss
|
|
That sounds like a great idea for a tool Jeff. Would it be possible to build that in as a "zpool recover" command?
Being able to run a tool like that and see just how bad the corruption is, but know it's possible to recover an older version would be great. Is there any chance of outputting details so the sysadmin can know roughly how much was lost?
My thoughts are going to be very rough (I don't know much about zfs internals), but I'm wondering if something like this would work, where all bad blocks are reported, along with the latest 3 good ones:
*************************************8 # zpool recover <pool> ......... pool details ...........
Finding and testing uberblocks... 1. block a date/time: xxxxx/xxxx CORRUPTED 2. block b date/time: yyyyy/yyyy CORRUPTED 3. block c date/time: zzzzz/zzzz Appears OK 4. block d date/time: zzzzz/zzzz Appears OK 5. block e date/time: zzzzz/zzzz Appears OK
> *************************************8
Victor was talking in another thread about using zdb to check the pool before doing an import of a damaged pool. Might it be possible for the next stage of the recovery process to give the user an option of testing or importing the pool for any particular uberblock?
It does sound like testing can take a long time, so this would need to be something that can be cancelled, and you would also need a way to mark uberblocks as bad should problems be found with either the test or import.
This would be a great addition to ZFS though, and would hopefully save Victor a bit of time ;-)
Ross
|
|
|
|
Ricardo M. Corr...
Ricardo.M.Correia@Su...
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Oct 10, 2008 2:48 AM
in response to: bonwick
|
|
Hi Jeff,
On Sex, 2008-10-10 at 01:26 -0700, Jeff Bonwick wrote: > > The circumstances where I have lost data have been when ZFS has not > > handled a layer of redundancy. However, I am not terribly optimistic > > of the prospects of ZFS on any device that hasn't committed writes > > that ZFS thinks are committed. > > FYI, I'm working on a workaround for broken devices. As you note, > some disks flat-out lie: you issue the synchronize-cache command, > they say "got it, boss", yet the data is still not on stable storage.
It's not just about ignoring the synchronize-cache command, there's also another weak spot.
ZFS is quite resilient against so-called phantom writes, provided that they occur sporadically - let's say, if the disk decides to _randomly_ ignore writes 10% of the time, ZFS could probably survive that pretty well even on single-vdev pools, due to ditto blocks.
However, it is not so resilient when the storage system suffers hiccups which cause phantom writes to occur continuously, even if for a small period of time (say less than 10 seconds), and then return to normal. This could happen for several reasons, including network problems, bugs in software or even firmware, etc.
I think in this case, going back to a previous uberblock could also be enough to recover from such a scenario most of the times, unless perhaps the error occurred too long ago, and the unwritten metadata got flushed out of the ARC and didn't have a chance to get rewritten.
In any case, a more generic solution to repair all kinds of metadata corruption, such as (e.g.) space map corruption, would be very desirable, as I think everyone can agree.
Best regards, Ricardo
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
417
From:
BR
Registered:
7/18/06
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted:
Oct 10, 2008 6:15 AM
in response to: Ricardo M. Corr...
To: Communities » zfs » discuss
|
|
Hello all, I think the problem here is the ZFS´ capacity for recovery from a failure. Forgive me, but thinking about creating a code "without failures", maybe the hackers did forget that other people can make mistakes (if they can´t). - "ZFS does not need fsck". Ok, that´s a great statement, but i think ZFS needs one. Really does. And in my opinion a enhanced zdb would be the solution. Flexibility. Options. - "I have 90% of something i think is your filesystem, do you want it"? I think a software is as good as it can recovery from failures. And i don´t want to know who failed, i´m not going to send anyone to jail, i´m not a lawyer. I agree with Jeff, really do, but that is "another" problem... The solution Jeff is working one, i think is really great, since it does NOT be the "all or nothing" again... I don´t know about you, but A LOT of times i was saved by the "Lost and Found" directory! All the beauty of a UNIX system is "rm /etc/passwd" after have edited it, and get the whole file doing a "cat /dev/mem". ;-) I think there are a lot of parts in ZFS design that remembers me when you see something left on the floor at home, so you ask for your son why he did not get it, and he says "it was not me". peace.
Leal.
|
|
|
|
Posts:
804
From:
Menlo Park, CA
Registered:
3/9/05
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Oct 10, 2008 11:23 AM
in response to: byleal
|
|
On Fri, Oct 10, 2008 at 06:15:16AM -0700, Marcelo Leal wrote: > - "ZFS does not need fsck". > Ok, that?s a great statement, but i think ZFS needs one. Really does. > And in my opinion a enhanced zdb would be the solution. Flexibility. > Options.
About 99% of the problems reported as "I need ZFS fsck" can be summed up by two ZFS bugs:
1. If a toplevel vdev fails to open, we should be able to pull information from necessary ditto blocks to open the pool and make what progress we can. Right now, the root vdev code assumes "can't open = faulted pool," which results in failure scenarios that are perfectly recoverable most of the time. This needs to be fixed so that pool failure is only determined by the ability to read critical metadata (such as the root of the DSL).
2. If an uberblock ends up with an inconsistent view of the world (due to failure of DKIOCFLUSHWRITECACHE, for example), we should be able to go back to previous uberblocks to find a good view of our pool. This is the failure mode described by Jeff.
These are both bugs in ZFS and will be fixed. The other 1% of the complaints are usually of the form "I created my pool on top of my old one" or "I imported a LUN on two different systems at the same time". It's unclear what a 'fsck' tool could do in this scenario, if anything. Due to a variety of reasons (hierarchical nature of ZFS, variable block sizes, RAIDZ-Z, compression, etc), it's difficult to even *identify* a ZFS block, let alone determine its validity and associate it in some larger construct.
There are some interesting possibilities for limited forensic tools - in particular, I like the idea of a mdb backend for reading and writing ZFS pools[1]. But I haven't actually heard a reasonable proposal for what a fsck-like tool (i.e. one that could "repair" things automatically) would actually *do*, let alone how it would work in the variety of situations it needs to (compressed RAID-Z?) where the standard ZFS infrastructure fails.
- Eric
[1] http://mbruning.blogspot.com/2008/08/recovering-removed-file-on-zfs-disk.html
-- Eric Schrock, Fishworks http://blogs.sun.com/eschrock _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
221
From:
RU
Registered:
3/27/06
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun
/ Moscow
Posted:
Oct 10, 2008 12:48 PM
in response to: eschrock
|
|
Eric Schrock wrote: > On Fri, Oct 10, 2008 at 06:15:16AM -0700, Marcelo Leal wrote: >> - "ZFS does not need fsck". >> Ok, that?s a great statement, but i think ZFS needs one. Really does. >> And in my opinion a enhanced zdb would be the solution. Flexibility. >> Options. > > About 99% of the problems reported as "I need ZFS fsck" can be summed up > by two ZFS bugs: > > 1. If a toplevel vdev fails to open, we should be able to pull > information from necessary ditto blocks to open the pool and make > what progress we can. Right now, the root vdev code assumes "can't > open = faulted pool," which results in failure scenarios that are > perfectly recoverable most of the time. This needs to be fixed > so that pool failure is only determined by the ability to read > critical metadata (such as the root of the DSL). > > 2. If an uberblock ends up with an inconsistent view of the world (due > to failure of DKIOCFLUSHWRITECACHE, for example), we should be able > to go back to previous uberblocks to find a good view of our pool. > This is the failure mode described by Jeff.
I've mostly seen (2), because despite all the best practices out there, single vdev pools are quite common. In all such cases that I had my hands on it was possible to recover pool by going back by one or two txgs.
> These are both bugs in ZFS and will be fixed. The other 1% of the > complaints are usually of the form "I created my pool on top of my old > one" or "I imported a LUN on two different systems at the same time".
Of these two former is not easy because it requires searching through the entire disk space for root block candidates and trying each of them. Latter one is not catastrophic in case there were little to no activity from one system. In this case one of the first things to suffer is pool config object, and corruption of it prevents pool open.
Fortunately enough, after putback of
6733970 assertion failure in dbuf_dirty() via spa_sync_nvlist()
in build 99 corrupted pool config object is written in such a way during open that prevents reading in old corrupted copy, and in most cases this allows to import pool and save most of the data. zdb is useful to understand how much is corrupted and how much is recovered. If nothing else is corrupted, then pool may be available for further use without recreation. Again, in every case I had my hands on it was possible to either recover pool completely or at least save most of the data.
> It's unclear what a 'fsck' tool could do in this scenario, if anything. > Due to a variety of reasons (hierarchical nature of ZFS, variable block > sizes, RAIDZ-Z, compression, etc), it's difficult to even *identify* a > ZFS block, let alone determine its validity and associate it in some > larger construct.
Indeed. In "more ZFS recovery" case involving 42TB pool with about 8TB used, zdb -bv alone took several hours to walk the block tree and verify consistency of block pointers, and zdb -bcv took couple of days to verify all user data blocks as well. And different checksums and gang blocks in addition to all other dynamic features mentioned complicate the task of identifying ZFS blocks and linking those blocks into tree and make it really time (and space) consuming.
> There are some interesting possibilities for limited forensic tools - in > particular, I like the idea of a mdb backend for reading and writing ZFS > pools[1]. But I haven't actually heard a reasonable proposal for what a > fsck-like tool (i.e. one that could "repair" things automatically) would > actually *do*, let alone how it would work in the variety of situations > it needs to (compressed RAID-Z?) where the standard ZFS infrastructure > fails.
There are a number of bugs and rfes to improve usefulness of zdb for field use, e.g.
6720637 want zdb -l option to dump uberblock arrays as well 6709782 issues running zdb with -p and -e options 6736356 zdb -R needs to work with exported pools 6720907 zdb should handle errors while dumping datasets and objects 6746101 zdb command to search for ZFS labels in a device 6757444 want zdb -R to supoprt decompression, checksumming and raid-z 6757430 want an option for zdb to disable space map loading and leak tracking
Hth, Victor
> - Eric > > [1] http://mbruning.blogspot.com/2008/08/recovering-removed-file-on-zfs-disk.html > > -- > Eric Schrock, Fishworks http://blogs.sun.com/eschrock > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris dot org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
David Magda
dmagda@ee.ryerson.ca
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun
/ Moscow
Posted:
Oct 10, 2008 6:55 PM
in response to: iktorn
|
|
On Oct 10, 2008, at 15:48, Victor Latushkin wrote:
> I've mostly seen (2), because despite all the best practices out > there, > single vdev pools are quite common. In all such cases that I had my > hands on it was possible to recover pool by going back by one or two > txgs.
For better or worse this is the case where I work.
Most of our storage is on SANs (EMC and NetApp), and so if we need more space we ask for it and we get a giant LUN given to us (usually multi-pathed). We also have a lot of Veritas VxVM and VxFS for Oracle, and so even if we're running Solaris 10, we're not using ZFS in that case.
SAN space is also allocated to Windows and VMware ESX machines as well, so it's not like we can ask for the disks in the SAN to be exported raw, as that would mess up managing of things with the other OSes. (We have a very small global storage / back up team, and I really don't want to add more to their workload.)
If someone finds themselves in this position, what advice can be followed to minimize risks?
For example, is having checksums enabled a good idea? If you have no redundancy and an error occurs, the system will panic by default (configurable in newer builds of OpenSolaris, but not in Solaris 'proper' yet). But if the system is ignoring checksums, you're no worse off than most other file systems (but still get all the other features of ZFS).
Or is there a way to mitigate a checksum error on non-redundant zpool?
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
129
From:
US
Registered:
3/9/05
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Oct 10, 2008 7:14 PM
in response to: David Magda
|
|
> Or is there a way to mitigate a checksum error on non-redundant zpool?
It's just like the difference between non-parity, parity, and ECC memory. Most filesystems don't have checksums (non-parity), so they don't even know when they're returning corrupt data. ZFS without any replication can detect errors, but can't fix them (like parity memory). ZFS with mirroring or RAID-Z can both detect and correct (like ECC memory).
Note: even in a single-device pool, ZFS metadata is replicated via ditto blocks at two or three different places on the device, so that a localized media failure can be both detected and corrected. If you have two or more devices, even without any mirroring or RAID-Z, ZFS metadata is mirrored (again via ditto blocks) across those devices.
Jeff _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
1,361
From:
US
Registered:
8/5/05
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Oct 10, 2008 8:59 PM
in response to: bonwick
|
|
On Fri, Oct 10, 2008 at 9:14 PM, Jeff Bonwick <Jeff dot Bonwick at sun dot com> wrote: > Note: even in a single-device pool, ZFS metadata is replicated via > ditto blocks at two or three different places on the device, so that > a localized media failure can be both detected and corrected. > If you have two or more devices, even without any mirroring > or RAID-Z, ZFS metadata is mirrored (again via ditto blocks) > across those devices.
And in the event that you have a pool that is mostly not very important but some of it is important, you can have data mirrored on a per dataset level via copies=n.
If we can avoid losing an entire pool by rolling back a txg or two, the biggest source of data loss and frustration is taken care of. Ditto blocks for metadata should take care of most other cases that would result in wide spread loss. Normal bit rot that causes you to lose blocks here and there are somewhat likely to take out a small minority of files and spit warnings along the way. If there are some files that are more important to you than others (e.g. losing files in rpool/home may have more impact than than rpool/ROOT) copies=2 can help there.
And for those places where losing a txg or two is a mortal sin, don't use flaky hardware and allow zfs to handle a layer of redundancy.
This gets me thinking that it may be worthwhile to have a small (<100 MB x 2) rescue boot environment with copies=2 (as well as rpool/b4 GB) boot environment from booting.
-- Mike Gerdts http://mgerdts.blogspot.com/ _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Miles Nordin
carton@Ivy.NET
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin
@ Sun / Moscow
Posted:
Oct 13, 2008 10:50 AM
in response to: bonwick
|
|
|
|
>>>>> "dm" == David Magda <dmagda at ee dot ryerson dot ca> writes: >>>>> "jb" == Jeff Bonwick <Jeff dot Bonwick at sun dot com> writes: >>>>> "mg" == Mike Gerdts <mgerdts at gmail dot com> writes:
dm> If you have no redundancy and an error occurs, the system will dm> panic by default (configurable in newer builds of OpenSolaris, dm> but not in Solaris 'proper' yet). But if the system is dm> ignoring checksums, you're no worse off than most other file dm> systems
It's not safe to assume the checksum errors are silent corruption. Most or all of the checksum errors I've seen on my system come from ZFS failing to fully resilver a temporarily-broken mirror.
It's not safe to assume failmode=<!panic> will stop your box from freezing. Problems with one zpool can cause problems with other unaffected pools. Problems at the storage driver level can cause one bad disk to freeze other good disks. Problems with the user interface generally make it impossible to offline a known-bad device because the user interface is frozen, or you get some catchall error like ``no valid replicas'' because who-knows-what, or ``I/O error'' because the user interface can't mark the failed drive as offline in the copy of the label stored on the failed drive---if metastat behaved that way?!
I've also had problems with iscsiadm and format pausing for minutes because a discovery-address is not responding, which could turn into hours if I had a hundred iSCSI targets---if I could just edit a damned text file like on a real Unix, I wouldn't have to put up with these needlessly-complex state machines and multiplicative timeouts. NFS can freeze entirely if any exported filesystem has problems.
Yes, some of the panics reported may come from failmode, but if you look through bugs.opensolaris.org and the list you'll see many different kinds of assertion-failure panics that aren't controlled by the failmode knob, usually panic-on-import or freeze-on-import, but sometimes other kinds.
To my view, the good news for ZFS is that most other things suck almost as much, so there is only a little catching-up to do before it's competitive. OTOH it looks like an unworkable disaster w.r.t. the promised future environment where pools have hundreds of disks, always some of them failing. The exception handling is a mess, the timers are attached to accidental hodge-podge ``layered'' state machines for which no one will accept ultimate responsibility, and the locking of various user interfaces and subsystems is coarse because it's built either for correctness/simplicity/deadlines, or for a mistaken, outdated goal: high-performance, assuming-a-fully-working-system, otherwise-fix-your-hardware.
jb> ditto blocks mg> copies=n.
neither of which applies to the situations Victor helped recover from. It's possible ditto blocks are quietly helping people, but I've not read on the list of one scenario where something bad happened and the resolution was ``you should have used copies=n''.
The OP is asking about best practices that mitigate known problems, not a repeat of the standard list of bullet point features and their hypothetical virtues.
mg> And for those places where losing a txg or two is a mortal mg> sin, don't use flaky hardware and allow zfs to handle a layer mg> of redundancy.
It is a mortal sin for a filesystem in all places. It's just much less bad than losing the entire pool. To be a safe backing-store for databases or email, ZFS needs to have implementable best-practices that stop this from happening, not just recover from it. Whatever recovery there is, certainly should not be silent and maybe should not be automatic. _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
121
From:
Registered:
4/27/05
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Oct 11, 2008 7:36 PM
in response to: David Magda
|
|
On Oct 10, 2008, at 7:55 PM 10/10/, David Magda wrote:
> > If someone finds themselves in this position, what advice can be > followed to minimize risks?
Can you ask for two LUNs on different physical SAN devices and have an expectation of getting it?
>
-- Keith H. Bierman khbkhb at gmail dot com | AIM kbiermank 5430 Nassau Circle East | Cherry Hills Village, CO 80113 | 303-997-2749 <speaking for myself*> Copyright 2008
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
125
From:
MPLS
Registered:
1/5/07
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun
/ Moscow
Posted:
Oct 13, 2008 8:46 AM
in response to: khb
|
|
zfs-discuss-bounces at opensolaris dot org wrote on 10/11/2008 09:36:02 PM:
> > On Oct 10, 2008, at 7:55 PM 10/10/, David Magda wrote: > > > > > If someone finds themselves in this position, what advice can be > > followed to minimize risks? > > Can you ask for two LUNs on different physical SAN devices and have > an expectation of getting it?
Better yet also ask for multiple paths over different SAN infrastructure to each. Then again, I would hope you don't need to ask your SAN folks for that?
-Wade
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
417
From:
BR
Registered:
7/18/06
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted:
Oct 10, 2008 1:29 PM
in response to: eschrock
To: Communities » zfs » discuss
|
|
> On Fri, Oct 10, 2008 at 06:15:16AM -0700, Marcelo > Leal wrote: > > - "ZFS does not need fsck". > > Ok, that?s a great statement, but i think ZFS > needs one. Really does. > > And in my opinion a enhanced zdb would be the > solution. Flexibility. > > Options. > > About 99% of the problems reported as "I need ZFS > fsck" can be summed up > by two ZFS bugs: > > 1. If a toplevel vdev fails to open, we should be > able to pull > information from necessary ditto blocks to open > the pool and make > what progress we can. Right now, the root vdev > code assumes "can't > open = faulted pool," which results in failure > scenarios that are > perfectly recoverable most of the time. This needs > to be fixed > so that pool failure is only determined by the > ability to read > critical metadata (such as the root of the DSL). > . If an uberblock ends up with an inconsistent view > of the world (due > to failure of DKIOCFLUSHWRITECACHE, for example), > we should be able > to go back to previous uberblocks to find a good > view of our pool. > This is the failure mode described by Jeff. > hese are both bugs in ZFS and will be fixed.
That´s it! It´s 100% for me! ;-) One is the "all-or-nothing" problem, and the other is about guilty... ;-))
> > There are some interesting possibilities for limited > forensic tools - in > particular, I like the idea of a mdb backend for > reading and writing ZFS > pools[1]. In my opinion would be great the whole functionality in zdb. it´s simple, and the concepts are clear on the tool. mdb is a debugger, needs concepts that i think is different in a tool for read/fix filesystems. Just an opinion... What does not mean we can not have both. Like i said, flexibility, options... ;-)
But I haven't actually heard a reasonable > proposal for what a > fsck-like tool
I think we must NOT stuck in the word "fsck", i have used it just as an example (Lost and Found). And i think other users used just as an example too. The important is the two points you have described very *well*.
(i.e. one that could "repair" things > automatically) would > actually *do*, let alone how it would work in the > variety of situations > it needs to (compressed RAID-Z?) where the standard > ZFS infrastructure > fails. > > - Eric > > [1] > http://mbruning.blogspot.com/2008/08/recovering-remove > d-file-on-zfs-disk.html > > -- > Eric Schrock, Fishworks > http://blogs.sun.com/eschrock > ________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris dot org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss
Many thanks for your answer! Leal.
|
|
|
|
Ricardo M. Corr...
Ricardo.M.Correia@Su...
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Oct 10, 2008 1:42 PM
in response to: eschrock
|
|
On Sex, 2008-10-10 at 11:23 -0700, Eric Schrock wrote: > But I haven't actually heard a reasonable proposal for what a > fsck-like tool (i.e. one that could "repair" things automatically) would > actually *do*, let alone how it would work in the variety of situations > it needs to (compressed RAID-Z?) where the standard ZFS infrastructure > fails.
I'd say an fsck-like tool for ZFS should not worry much compression, checksums, RAID-Z and whatnot. In essence, it would try to do what an fsck tool does for a typical filesystem, and so would be mostly oblivious to the layout or encoding of the blocks, perhaps treating blocks with failed checksums as blocks full of zeros.
Here's how it could work (of course, this is all easier said than done):
1) Open all the devices specified by the user. Optionally, take just a pool name/guid and scan for the right devices in /dev/[r]dsk.
2) Verify if the pool configuration read from the devices is sane -- if not, try to generate a consistent configuration. Some elements of the pool configuration, such as the correct pool version, could be checked in later steps, depending on features that were found.
3) Starting from the last uberblock, fully traverse a few levels down the tree. If less than 100% of the blocks could be read without errors, do the same for previous uberblocks and offer the user the choice to which uberblock to use, or if running non-interactively, choose the one with the best success rate.
4) Traverse the list/tree of filesystems, snapshots and clones. Make sure that they are well-connected. For each filesystem, try to replay the ZILs, clean them out.
5) Now fully traverse the pool. Compute the space maps and FS space usage on-the-go, as blocks are read.
6) For each metadata block read, check whether the fields are sane, fix them/zero them out if they're not. Basically we're assuming here that we may have corrupted metadata with correct checksums.
If some metadata block can not be read due to a failed checksum, assume the block is full of zeros, and fix it.
By the way, this includes every field of every kind of metadata block, including ZAPs, ACLs, FID maps, znode fields, everything.
For fields that reference other objects, make sure that the object they reference is of the correct type and that the object itself is correct.
For objects that are missing, create empty ones if necessary.
7) Check that every object is referenced somewhere and link unreferenced objects to /lost+found/object-type/, or similar.
8) Probably do other things that I'm forgetting.
9) In the end, check if the space maps are consistent with the ones computed, write correct ones if not. Check that space usage/reservations/quotas are correct.
Essentially, the goal is that at the end of this process, the pool should contain consistent information, should have as much data as could be recovered and should never cause any further errors in ZFS due to invalid metadata/fields; either when importing it, reading from it or writing/modifying it (except that it would still return EIO errors when trying to read corrupted file data blocks, of course).
Now, a problem with fsck-like tools, and perhaps especially with ZFS, is that some of these steps may either require lots of memory or multiple filesystem/pool traversals.
I'd say having such a tool, even if it required additional temporary storage for operation (hopefully not a very large fraction of the pool size), would be *very* useful and would clear up any worries that people currently have.
Kind regards, Ricardo
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
168
From:
Registered:
7/20/06
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted:
Nov 29, 2008 3:49 AM
in response to: eschrock
To: Communities » zfs » discuss
|
|
> About 99% of the problems reported as "I need ZFS > fsck" can be summed up > by two ZFS bugs: > > 1. If a toplevel vdev fails to open, we should be > able to pull > information from necessary ditto blocks to open > the pool and make > what progress we can. Right now, the root vdev > code assumes "can't > open = faulted pool," which results in failure > scenarios that are > perfectly recoverable most of the time. This needs > to be fixed > so that pool failure is only determined by the > ability to read > critical metadata (such as the root of the DSL). > . If an uberblock ends up with an inconsistent view > of the world (due > to failure of DKIOCFLUSHWRITECACHE, for example), > we should be able > to go back to previous uberblocks to find a good > view of our pool. > This is the failure mode described by Jeff. > [b]These are both bugs in ZFS and will be fixed. [/b]
I totally agree these covers most of the corruptions we had in past. Any news about that bugs in recent Nevada release?
Anyone can provide us a detailed procedure to "go back to previous uberblocks to find a good view of our pool" as described by Jeff?
Thanks gino
|
|
|
|
Miles Nordin
carton@Ivy.NET
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @
Sun / Moscow
Posted:
Oct 10, 2008 10:58 AM
in response to: Ricardo M. Corr...
|
|
|
|
>>>>> "jb" == Jeff Bonwick <Jeff dot Bonwick at sun dot com> writes: >>>>> "rmc" == Ricardo M Correia <Ricardo dot M dot Correia at Sun dot COM> writes:
jb> We need a little more Code of Hammurabi in the storage jb> industry.
It seems like most of the work people have to do now is cleaning up after the sloppyness of others. At least it takes the longest.
You could always mention which disks you found ignoring the command---wouldn't that help the overall problem? I understand there's a pervasive ``i don' wan' any trouble, mistah'' attitude, but I don't understand where it comes from.
http://www.ferris.edu/news/jimcrow/tom/
jb> displacement flush for disk caches that ignore the sync jb> command.
Sounds like a good idea but:
(1) won't this break the NFS guarantees you were just saying should never be broken?
I get it, someone else is breaking a standard so how can ZFS be expected to yadda yadda yadad. But I fear it will just push ``blame the sysadmin'' one step further out. ex., Q. ``with ZFS all my NFS clients become unstable after the server reboots,'' or ``I'm getting silent corruption with NFS''. A. ``your drives might have gremlins in them, no way to know,'' and ``well what do you expect without a single integrity domain and TCP's weak checksums. / no i'm using a crossover cable, and FCS is not weak. / ZFS managing a layer of redundancy it is probably your RAM or corruption on the uh, between the Ethernet MAC chip and the PCI slot''
(1a) I'm concerned about how it'll be reported when it happens.
(a) if it's not reported at all, then ZFS is hiding the fact that fsync() is not working. Also, other journaling filesystems sometimes report when they find ``unexpected'' corruption, which is useful for finding both hardware and software problems.
I'm already concerned ZFS is not reporting enough, like when it says a vdev component is ONLINE, but 'zpool offline pool <component>' says 'no valid replicas', then after a scrub there is no change to zpool status, but zpool offline works again.
ZFS should not ``simplify'' the user interface to the point that it's hiding problems with itself and its environment to the ends of avoiding discussion.
(b) if it is reported, then whenever the reporter-blob raises its hand it will have the effect of exonerating ZFS in most people's minds, like the stupid CKSUM column does right now. ``ZFS-FEED-B33F error? oh yeah that's the new ueberblock search code. that means your disks are ignoring the SYNCHRONIZE CACHE command. thank GOD you have ZFS with ANY OTHER FILESYSTEM all bets would be totally off. lucky you. / I have tried ten different models from all four brands. / yeah sucks don't it? flagrant violation of the standard, industry wide. / my linux testing tool says they're obeying the command fine / linux is **** / i added a patch to solaris to block the SYNC CACHE command and the disks got faster so I think it's not being ignored / well the stack is complicated and flushing happens at many levels, like think about controller performance, and that's completely unsupported you are doing something REALLY UNSAFE there you should NOT DO THAT it is STUPID'' and so on, stalling the actual fix literally for years.
The right way to exonerate ZFS is to make a diagnosis tool for the disks which proves they're broken, and then don't buy those disks. not to make a new class of ZFS fault report that could potentially capture all kinds of problems, then hazily assign blame to an untestable quantity.
(2) disks are probably not the only thing dropping the write barriers. So far, we're also suspecting (unproven!) iSCSI targets/initiators, particularly around a TCP reconnection event or target reboot. and VM stacks, both VirtualBox and the HVM in UltraSPARC T1. probably other stuff.
I'm concerned that assumptions you'll find safe to make about disks after you get started, like nothing is more than 1s stale, or send a CDB to size the on-disk cache and imagine it's a FIFO and it'll be no worse than that, or ``you can get an fsync by pausing reads for 500ms'' or whatever, will add robustness for current and future broken disks but won't apply to other types of broken storage layer.
rmc> However, it is not so resilient when the storage system rmc> suffers hiccups which cause phantom writes to occur rmc> continuously, even if for a small period of time (say less rmc> than 10 seconds), and then return to normal.
ha! that is a great idea. temporal ditto blocks: Important writes should be written, aged in RAM for 1 minute, then rewritten. :) This will help with latent sector errors caused by powersag/vibration too. but...Even I will admit at some point you have to give up and let the filesystem get corrupted.
actually I'm more in the camp of making ZFS fragile to incorrect storage stacks, and offering an offline recovery tool that treats the corrupt pool as read-only and copies it into a new filesystem (so you need a second same-size empty pool to use the tool). I like this painful way better than fsck-like things, and much better than silent workarounds. but i'm probably in the wrong camp on this one.
My reasoning is, we will not be ultimately happy with a fileystem where fsync() is broken, and that's the best you can do. To compete with Netapp, we need to bang on this thing until it's actually working. So far I think sysadmins are receptive to the idea they need to fix <...> about their setup, or make purchases with extreme care, or do testing before production. We are not lazy and do not expect an appliance-on-a-CD.
it's just that pass-the-buck won't ever deliver something useful. When ext3 was corrupting filesystems on laptops, ext3 got blamed, and ext3 was not at the root of the problem. But no one _accepted_ that ext3 was correctly-coded until the overall problem was fixed. (IIRC it was: you need to send drives a stop-unit command before sending the ACPI powerdown, because even if they ignore synchronize-cache they do still flush when told to stop-unit)
It's proper to have a strict separation between ``unclean shutdown'' and ``recovery from corruption''. UFS does have the separation between log-rolling and fsck-ing, but ZFS could detect the difference between unclean shutdown and corruption a lot better than UFS, and that's good. Currently ZFS seems to detect it by telling you ``pool's corrupt. <shrug>, destroy it.''---the fact that the recovery tool is entirely absent isn't good, but keeping recovery actions like this ueberblock-search strictly separate makes delivering something truly correct on the ``unclean shutdown'' front more likely.
I think, if iSCSI target/initiator combinations are silently discarding 10sec worth of writes (ex., when they drop and reconnect their TCP session), then this needs to be proven and their implementation can be and needs to be corrected, not speculated on and then worked around.
And I bet this same beefing-up performance numbers by discarding cache flushes is as rampant in the virtualization game as in the hard disk game. _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
130
From:
Registered:
12/20/07
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
Moscow
Posted:
Nov 30, 2008 8:22 AM
in response to: bonwick
To: Communities » zfs » discuss
|
|
It would be extremely helpful to know what brands/models of disks lie and which don't. This information could be provided diplomatically simply as threads documenting problems you are working on, stating the facts. Use of a specific string of words would make searching for it easy. There should be no liability, since you are simply documenting compatibility with zfs.
Or perhaps if the lawyers let you, you could simply publish a compatibility/incompatibility list. These ARE facts.
If there is a way to make a detection tool, that would be very useful too, although after the purchase is made, it could be hard to send it back. However that info could be fed into the database as that drive/model being incompatible with zfs.
As Solaris / zfs gains ground, this could become a strong driver in the industry.
Re: I'll run tests with known-broken disks to determine how far back we need to go in practice -- I'll bet one txg is almost always enough.
So go back three - we are using zfs because we want absolute reliability (or at least as close as we can get).
--Ray
|
|
|
|
Posts:
168
From:
Registered:
7/20/06
|
|
|
|
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Posted:
Feb 7, 2009 5:54 AM
in response to: bonwick
To: Communities » zfs » discuss
|
|
> FYI, I'm working on a workaround for broken devices. > As you note, > ome disks flat-out lie: you issue the > synchronize-cache command, > they say "got it, boss", yet the data is still not on > stable storage. > Why do they do this? Because "it performs better". > Well, duh -- > ou can make stuff *really* fast if it doesn't have to > be correct. >
> The uberblock ring buffer in ZFS gives us a way to > cope with this, > as long as we don't reuse freed blocks for a few > transaction groups. > The basic idea: if we can't read the pool startign > from the most > recent uberblock, then we should be able to use the > one before it, > or the one before that, etc, as long as we haven't > yet reused any > blocks that were freed in those earlier txgs. This > allows us to > use the normal load on the pool, plus the passage of > time, as a > displacement flush for disk caches that ignore the > sync command. > > If we go back far enough in (txg) time, we will > eventually find an > uberblock all of whose dependent data blocks have > make it to disk. > I'll run tests with known-broken disks to determine > how far back we > need to go in practice -- I'll bet one txg is almost > always enough. > > Jeff
Hi Jeff, we just losed 2 pools on snv91. Any news about your workaround to recover pools discarding last txg?
thanks gino
|
|
|
|
Miles Nordin
carton@Ivy.NET
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 12, 2008 12:10 PM
in response to: mgerdts
|
|
|
|
>>>>> "tt" == Toby Thain <toby at telegraphics dot com dot au> writes: >>>>> "mg" == Mike Gerdts <mgerdts at gmail dot com> writes:
tt> I think we have to assume Anton was joking - otherwise his tt> measure is uselessly unscientific.
I think it's rude to talk about someone who's present in the third person, especially when you're trying to minimize his view. Were you joking, Anton? :)
0. The reports I read were not useless in the way some have stated, because for example Mike sampled his own observations:
mg> In the past year I've lost more ZFS file systems than I have mg> any other type of file system in the past 5 years. With other mg> file systems I can almost always get some data back. With ZFS mg> I can't get any back.
It's not just bloggers and pundits sampling mailing list traffic. I thought there was at least one other post like this but could not find it.
1. I don't think your impressions nor Anton's and mine are ``useless''
2. I don't think your positive impression is any more scientific than his and my skeptical one.
3. I'm in general troubled by reports of corruption that aren't well-investigated, because this will stop young, fragile filesystems from becoming old and robust. BUT....
4. I'm less troubled by (3) because a few of the corruption reports were well-investigated by Victor, and he recovered them manually and posted a summary here:
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051643.html
and how the exprience might inform ZFS improvements:
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051667.html
5. I'm more troubled again because everyone seems to have forgotten (4). Mike, Victor, and others can't necessarily repeat themselves every time this thread's resurrected. If yapping mailing list monkeys like me don't remember this experience, invested-wishing and marketing white papers will drown out the experience we're getting.
I've pointed straight at an unfixed corruption problem that's biting ZFS users, and the discussion about where to place the blame and how to fix it. It is not fixed now, yet pundits on-list and all over the Interweb like here:
http://www.kev009.com/wp/2008/11/on-file-systems/
talk about corruption bugs hazily and say ``most of all that's been fixed'' when it's not so hazy and hasn't been, then focus on theoretical unrealized capabilities of the on-disk format and mimimize this clear experience into ghostly distant-past rumor.
I don't see when the single-LUN SAN corruption problems were fixed. I think the supposed ``silent FC bit flipping'' basis for the ``use multiple SAN LUN's'' best-practice is revoltingly dishonest, that we _know_ better. I'm not saying devices aren't guilty---Sun's sun4v IO virtualizer was documented as guilty of ignoring cache flushes to inflate performance just like the loomingly-unnamed models of lying SATA drives:
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051735.html
Is a storage-stack-related version this problem the cause of lost single-LUN SAN pools? maybe, maybe not, but either way we need an end-to-end solution. I don't currently see an end-to-end solution to this pervasive blame-the-device mantra every time a pool goes bad.
I keep digging through the archives to post messages like this because I feel like everyone only wants to have happy memories, and that it's going to bring about a sad end. _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
208
From:
Cape Town, South Africa
Registered:
11/18/06
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 12, 2008 12:38 PM
in response to: Miles Nordin
|
|
On Fri, Dec 12, 2008 at 10:10 PM, Miles Nordin <carton at ivy dot net> wrote:
0. The reports I read were not useless in the way some have stated,
because for example Mike sampled his own observations: [snip]
I don't see when the single-LUN SAN corruption problems were fixed. I
think the supposed ``silent FC bit flipping'' basis for the ``use
multiple SAN LUN's'' best-practice is revoltingly dishonest, that we
_know_ better. I'm not saying devices aren't guilty---Sun's sun4v IO
virtualizer was documented as guilty of ignoring cache flushes to
inflate performance just like the loomingly-unnamed models of lying
SATA drives:
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051735.html
Is a storage-stack-related version this problem the cause of lost
single-LUN SAN pools? maybe, maybe not, but either way we need an
end-to-end solution. I don't currently see an end-to-end solution to
this pervasive blame-the-device mantra every time a pool goes bad.
I keep digging through the archives to post messages like this because
I feel like everyone only wants to have happy memories, and that it's
going to bring about a sad end.
Thank you.
There is so much unsupported claims and noise on both sides that everybody is sounding like a bunch of fanboys.
The only bit that I understand about why HW raid "might" be bad is that if it had access to the disks behind a HW RAID LUN, then _IF_ zfs were to encounter corrupted data in a read, it will probably be able to re-construct that data. This is at the cost of doing the parity calculations on a general purpose CPU, and then sending that parity data, as well as the data to write, across the wire. Some of that cost may be offset against Raid-Z's optimizations over raid-5 in some situations, but all of this is pretty much if-then-maybe type situations.
I also understand that HW raid arrays have some vulnerabilities and weaknesses, but those seem to be offset against ZFS' notorious instability during error conditions. I say notorious, because of all the open bug reports and reports on the list of I/O hanging and/or systems panicing while waiting for ZFS to realize that something has gone wrong.
I think if this last point can be addressed - make ZFS respond MUCH faster to failures, then it will go a long way to make ZFS be more readily adopted.
-- Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke My blog: http://initialprogramload.blogspot.com
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
126
From:
CA
Registered:
1/19/06
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 12, 2008 12:51 PM
in response to: hartz
|
|
On 12-Dec-08, at 3:38 PM, Johan Hartzenberg wrote: ... The only bit that I understand about why HW raid "might" be bad is that if it had access to the disks behind a HW RAID LUN, then _IF_ zfs were to encounter corrupted data in a read, it will probably be able to re-construct that data. This is at the cost of doing the parity calculations on a general purpose CPU,
Except that it's not just parity - ZFS checksums where RAID-N does not (although I've heard that some RAID systems checksum "somewhere" - not end-to-end of course).
Call me a fanboy if you will, but ZFS is different from hw RAID. I am not an "automatic denier" of ZFS bugs or flaws, but I do acknowledge it's more revolution than evolution. It's software. We only need be patient while it matures. :)
--Toby
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
874
From:
US
Registered:
8/19/08
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 12, 2008 1:16 PM
in response to: qu1j0t3
|
|
On Fri, 12 Dec 2008, Toby Thain wrote: > > Except that it's not just parity - ZFS checksums where RAID-N does not > (although I've heard that some RAID systems checksum "somewhere" - not > end-to-end of course).
It will soon be quite easy to build a RAID system like this using OpenSolaris and a sub-project known as COMSTAR. The checksums will be done using a storage technology called ZFS.
Bob ====================================== Bob Friesenhahn bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
649
From:
US
Registered:
8/21/06
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 12, 2008 1:30 PM
in response to: qu1j0t3
|
|
On Fri, Dec 12, 2008 at 2:51 PM, Toby Thain <toby at telegraphics dot com dot au> wrote:
On 12-Dec-08, at 3:38 PM, Johan Hartzenberg wrote: ... The only bit that I understand about why HW raid "might" be bad is that if it had access to the disks behind a HW RAID LUN, then _IF_ zfs were to encounter corrupted data in a read, it will probably be able to re-construct that data. This is at the cost of doing the parity calculations on a general purpose CPU,
Except that it's not just parity - ZFS checksums where RAID-N does not (although I've heard that some RAID systems checksum "somewhere" - not end-to-end of course).
Call me a fanboy if you will, but ZFS is different from hw RAID. I am not an "automatic denier" of ZFS bugs or flaws, but I do acknowledge it's more revolution than evolution. It's software. We only need be patient while it matures. :)
--Toby
I'm going to pitch in here as devil's advocate and say this is hardly revolution. 99% of what zfs is attempting to do is something NetApp and WAFL have been doing for 15 years+. Regardless of the merits of their patents and prior art, etc., this is not something revolutionarily new. It may be "revolution" in the sense that it's the first time it's come to open source software and been given away, but it's hardly "revolutionary" in file systems as a whole.
--Tim
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
1,760
From:
NZ
Registered:
4/27/05
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 12, 2008 1:36 PM
in response to: tcook
|
|
Tim wrote: > > > On Fri, Dec 12, 2008 at 2:51 PM, Toby Thain <toby at telegraphics dot com dot au > <mailto:toby at telegraphics dot com dot au>> wrote: > > > On 12-Dec-08, at 3:38 PM, Johan Hartzenberg wrote: > >> ... >> The only bit that I understand about why HW raid "might" be bad >> is that if it had access to the disks behind a HW RAID LUN, then >> _IF_ zfs were to encounter corrupted data in a read, it will >> probably be able to re-construct that data. This is at the cost >> of doing the parity calculations on a general purpose CPU, > > Except that it's /not just parity/ - ZFS checksums where RAID-N > does not (although I've heard that some RAID systems checksum > "somewhere" - not end-to-end of course). > > Call me a fanboy if you will, but ZFS is different from hw RAID. I > am not an "automatic denier" of ZFS bugs or flaws, but I do > acknowledge it's more /revolution/ than evolution. It's software. > We only need be patient while it matures. :) > > --Toby > > > I'm going to pitch in here as devil's advocate and say this is hardly > revolution. 99% of what zfs is attempting to do is something NetApp > and WAFL have been doing for 15 years+.
The ideas aren't new, but the combination of the ideas is. NetApp is still a box at the end of a bit of wire that the OS has to blindly trust.
-- Ian.
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
649
From:
US
Registered:
8/21/06
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 12, 2008 2:00 PM
in response to: ian
|
|
On Fri, Dec 12, 2008 at 3:36 PM, Ian Collins <ian at ianshome dot com> wrote:
The ideas aren't new, but the combination of the ideas is. NetApp is
still a box at the end of a bit of wire that the OS has to blindly trust.
--
Ian.
I'm not aware of many, if any large shops that are moving to a model of "all internal disk with applications running on them". The sun box will just be "a box at the end of the wire", a-la storage 7000 when it's an nfs/cifs/iscsi target. Centralized storage is a *good thing*.
--Tim
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
1,760
From:
NZ
Registered:
4/27/05
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 12, 2008 2:11 PM
in response to: tcook
|
|
Tim wrote: > > > On Fri, Dec 12, 2008 at 3:36 PM, Ian Collins <ian at ianshome dot com > <mailto:ian at ianshome dot com>> wrote: > > > > The ideas aren't new, but the combination of the ideas is. NetApp is > still a box at the end of a bit of wire that the OS has to blindly > trust. > > -- > Ian. > > > > I'm not aware of many, if any large shops that are moving to a model > of "all internal disk with applications running on them". The sun box > will just be "a box at the end of the wire", a-la storage 7000 when > it's an nfs/cifs/iscsi target. Centralized storage is a *good thing*. > Maybe, but I'm sure that will change as the performance of the storage subsystems continue to exceed the performance of the bit of wire.
That's where the revolution bit comes in; applications can now coexist with NetApp quality storage management.
-- Ian.
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
129
From:
US
Registered:
3/9/05
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 12, 2008 6:16 PM
in response to: tcook
|
|
> I'm going to pitch in here as devil's advocate and say this is hardly > revolution. 99% of what zfs is attempting to do is something NetApp and > WAFL have been doing for 15 years+. Regardless of the merits of their > patents and prior art, etc., this is not something revolutionarily new. It > may be "revolution" in the sense that it's the first time it's come to open > source software and been given away, but it's hardly "revolutionary" in file > systems as a whole.
"99% of what ZFS is attempting to do?" Hmm, OK -- let's make a list:
end-to-end checksums unlimited snapshots and clones O(1) snapshot creation O(delta) snapshot deletion O(delta) incremental generation transactionally safe RAID without NVRAM variable blocksize block-level compression dynamic striping intelligent prefetch with automatic length and stride detection ditto blocks to increase metadata replication delegated administration scalability to many cores scalability to huge datasets hybrid storage pools (flash/disk mix) that optimize price/performance
How many of those does NetApp have? I believe the correct answer is 0%.
Jeff _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
649
From:
US
Registered:
8/21/06
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 13, 2008 1:15 AM
in response to: bonwick
|
|
On Fri, Dec 12, 2008 at 8:16 PM, Jeff Bonwick <Jeff dot Bonwick at sun dot com> wrote:
> I'm going to pitch in here as devil's advocate and say this is hardly
> revolution. 99% of what zfs is attempting to do is something NetApp and
> WAFL have been doing for 15 years+. Regardless of the merits of their
> patents and prior art, etc., this is not something revolutionarily new. It
> may be "revolution" in the sense that it's the first time it's come to open
> source software and been given away, but it's hardly "revolutionary" in file
> systems as a whole.
"99% of what ZFS is attempting to do?" Hmm, OK -- let's make a list:
end-to-end checksums
unlimited snapshots and clones
O(1) snapshot creation
O(delta) snapshot deletion
O(delta) incremental generation
transactionally safe RAID without NVRAM
variable blocksize
block-level compression
dynamic striping
intelligent prefetch with automatic length and stride detection
ditto blocks to increase metadata replication
delegated administration
scalability to many cores
scalability to huge datasets
hybrid storage pools (flash/disk mix) that optimize price/performance
How many of those does NetApp have? I believe the correct answer is 0%.
Jeff Seriously? Do you know anything about the NetApp platform? I'm hoping this is a genuine question...
Off the top of my head nearly all of them. Some of them have artificial limitations because they learned the hard way that if you give customers enough rope they'll hang themselves. For instance "unlimited snapshots". Do I even need to begin to tell you what a horrible, HORRIBLE idea that is? "Why can't I get my space back?" Oh, just do a snapshot list and figure out which one is still holding the data. What? Your console locks up for 8 hours when you try to list out the snapshots? Huh... that's weird.
It's sort of like that whole "unlimited filesystems" thing. Just don't ever reboot your server, right? Or "you can have 40pb in one pool!!!". How do you back it up? Oh, just mirror it to another system? And when you hit a bug that toasts both of them you can just start restoring from tape for the next 8 years, right? Or if by some luck we get a zfsiron, you can walk the metadata for the next 5 years.
NVRAM has been replaced by flash drives in a ZFS world to get any kind of performance... so you're trading one high priced storage for another. Your snapshot creation and deletion is identical. Your incremental generations is identical. End-to-end checksums? Yup.
Let's see... they don't have block-level compression, they chose dedup instead which nets better results. "Hybrid storage pool" is achieved through PAM modules. Outside of that... I don't see ANYTHING in your list they didn't do first.
--Tim
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
129
From:
US
Registered:
3/9/05
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 13, 2008 2:01 AM
in response to: tcook
|
|
> Off the top of my head nearly all of them. Some of them have artificial > limitations because they learned the hard way that if you give customers > enough rope they'll hang themselves. For instance "unlimited snapshots".
Oh, that's precious! It's not an arbitrary limit, it's a safety feafure!
> Outside of that... I don't see ANYTHING in your list they didn't do first.
Then you don't know ANYTHING about either platform. Constant-time snapshots, for example. ZFS has them; NetApp's are O(N), where N is the total number of blocks, because that's how big their bitmaps are. If you think O(1) is not a revolutionary improvement over O(N), then not only do you not know much about either snapshot algorithm, you don't know much about computing.
Sorry, everyone else, for feeding the troll. Chum the water all you like, I'm done with this thread.
Jeff _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Bryan Cantrill
bmc@eng.sun.com
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 13, 2008 7:54 AM
in response to: tcook
|
|
> Seriously? Do you know anything about the NetApp platform? I'm hoping this > is a genuine question... > > Off the top of my head nearly all of them. Some of them have artificial > limitations because they learned the hard way that if you give customers > enough rope they'll hang themselves. For instance "unlimited snapshots". > Do I even need to begin to tell you what a horrible, HORRIBLE idea that is? > "Why can't I get my space back?" Oh, just do a snapshot list and figure out > which one is still holding the data. What? Your console locks up for 8 > hours when you try to list out the snapshots? Huh... that's weird. > > It's sort of like that whole "unlimited filesystems" thing. Just don't ever > reboot your server, right? Or "you can have 40pb in one pool!!!". How do > you back it up? Oh, just mirror it to another system? And when you hit a > bug that toasts both of them you can just start restoring from tape for the > next 8 years, right? Or if by some luck we get a zfsiron, you can walk the > metadata for the next 5 years. > > NVRAM has been replaced by flash drives in a ZFS world to get any kind of > performance... so you're trading one high priced storage for another. Your > snapshot creation and deletion is identical. Your incremental generations > is identical. End-to-end checksums? Yup. > > Let's see... they don't have block-level compression, they chose dedup > instead which nets better results. "Hybrid storage pool" is achieved > through PAM modules. Outside of that... I don't see ANYTHING in your list > they didn't do first.
Wow -- I've spoken to many NetApp partisans over the years, but you might just take the cake. Of course, most of the people I talk to are actually _using_ NetApp's technology, a practice that tends to leave even the most stalwart proponents realistic about the (many) limitations of NetApp's technology...
For example, take the PAM. Do you actually have one of these, or are you basing your thoughts on reading whitepapers? I ask because (1) they are horrifically expensive (2) they don't perform that well (especially considering that they're DRAM!) (3) they're grossly undersized (a 6000 series can still only max out at a paltry 96G -- and that's with virtually no slots left for I/O) and (4) they're not selling well. So if you actually bought a PAM, that already puts you in a razor-thin minority of NetApp customers (most of whom see through the PAM and recognize it for the kludge that it is); if you bought a PAM and think that it's somehow a replacement for the ZFS hybrid storage pool (which has an order of magnitude more cache), then I'm sure NetApp loves you: you must be the dumbest, richest customer that ever fell in their lap!
- Bryan
-------------------------------------------------------------------------- Bryan Cantrill, Sun Microsystems Fishworks. http://blogs.sun.com/bmc _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
874
From:
US
Registered:
8/19/08
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 13, 2008 8:03 AM
in response to: tcook
|
|
On Sat, 13 Dec 2008, Tim wrote: > > Seriously? Do you know anything about the NetApp platform? I'm hoping this > is a genuine question...
I believe that esteemed Sun engineers like Jeff are quite familiar with the NetApp platform. Besides NetApp being one of the primary storage competitors, it is a virtual minefield out there and one must take great care not to step on other company's patents.
> Off the top of my head nearly all of them. Some of them have artificial > limitations because they learned the hard way that if you give customers > enough rope they'll hang themselves. For instance "unlimited snapshots". > Do I even need to begin to tell you what a horrible, HORRIBLE idea that is? > "Why can't I get my space back?" Oh, just do a snapshot list and figure out > which one is still holding the data. What? Your console locks up for 8 > hours when you try to list out the snapshots? Huh... that's weird.
I suggest that you retire to the safety of the rubber room while the rest of us enjoy these zfs features. By the same measures, you would advocate that people should never be allowed to go outside due to the wide open spaces. Perhaps people will wander outside their homes and forget how to make it back. Or perhaps there will be gravity failure and some of the people outside will be lost in space.
There is some activity off the starboard bow, perhaps you should check it out ...
Bob ====================================== Bob Friesenhahn bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Joseph Zhou
jz@excelsioritsoluti...
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 13, 2008 3:14 PM
in response to: bfriesen
|
|
Hi Bob, Tim, Jeff, you are all my friends, and you all know what you are talking about. As a friend, and trusting your personal integrity, I ask you, please, don't get mad, enjoy the open discussion.
(ok, ok, O(N) is revolutionary in tech thinking, just not revolutionary in end customer value. And safety features are important in risk management for enterprises.)
I have friends at NetApp, and there are people there that I don't give a ****.
I am an enterprise architect, I don't care about the little environments that can be fulfilled most effectively by any one operating enviornment applications. They are not enterprises and are risky in that business model in economy downturns.
In that spirit, and looking at the NetApp virtual server support architecture, I would say -- as much as the ONTAP/WAFL thing (even with GX integration) is elegant, it would make more sense to utilize the file system capabilities with kernal integration to hypervisors, in virtual server deployments, instead of promoting a storage-device-based file system and data management solution (more proprietary at the solution level).
So, in my position, NetApp PiT is not as good as ZFS PiT, because it is too far from the hypervisor. You can support me or attack me with more technical details (if you know NetApp is developing an API for all server hypervisors, I don't). And don't worry, I have the biggest eagle, but so far, no one has been able to hurt that. ;-)
Best, z
----- Original Message ----- From: "Bob Friesenhahn" <bfriesen at simple dot dallas dot tx dot us> To: "Tim" <tim at tcsac dot net> Cc: <zfs-discuss at opensolaris dot org> Sent: Saturday, December 13, 2008 11:03 AM Subject: Re: [zfs-discuss] Split responsibility for data with ZFS
> On Sat, 13 Dec 2008, Tim wrote: >> >> Seriously? Do you know anything about the NetApp platform? I'm hoping >> this >> is a genuine question... > > I believe that esteemed Sun engineers like Jeff are quite familiar > with the NetApp platform. Besides NetApp being one of the primary > storage competitors, it is a virtual minefield out there and one must > take great care not to step on other company's patents. > >> Off the top of my head nearly all of them. Some of them have artificial >> limitations because they learned the hard way that if you give customers >> enough rope they'll hang themselves. For instance "unlimited snapshots". >> Do I even need to begin to tell you what a horrible, HORRIBLE idea that >> is? >> "Why can't I get my space back?" Oh, just do a snapshot list and figure >> out >> which one is still holding the data. What? Your console locks up for 8 >> hours when you try to list out the snapshots? Huh... that's weird. > > I suggest that you retire to the safety of the rubber room while the > rest of us enjoy these zfs features. By the same measures, you would > advocate that people should never be allowed to go outside due to the > wide open spaces. Perhaps people will wander outside their homes and > forget how to make it back. Or perhaps there will be gravity failure > and some of the people outside will be lost in space. > > There is some activity off the starboard bow, perhaps you should check > it out ... > > Bob > ====================================== > Bob Friesenhahn > bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris dot org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
874
From:
US
Registered:
8/19/08
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 13, 2008 9:45 PM
in response to: Joseph Zhou
|
|
On Sat, 13 Dec 2008, Joseph Zhou wrote: > > In that spirit, and looking at the NetApp virtual server support > architecture, I would say -- > as much as the ONTAP/WAFL thing (even with GX integration) is elegant, it > would make more sense to utilize the file system capabilities with kernal > integration to hypervisors, in virtual server deployments, instead of > promoting a storage-device-based file system and data management solution > (more proprietary at the solution level).
I am not an enterprise architect but I do agree that when multiple client OSs are involved it is still useful if storage looks like a legacy disk drive. Luckly Solaris already offers iSCSI in Solaris 10 and OpenSolaris is now able to offer high performance fiber channel target and fiber channel over ethernet layers on top of reliable ZFS. The full benefit of ZFS is not provided, but the storage is successfully divorced from the client with a higher degree of data reliability and performance than is available from current firmware based RAID arrays.
Bob ====================================== Bob Friesenhahn bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Miles Nordin
carton@Ivy.NET
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 15, 2008 2:04 PM
in response to: Joseph Zhou
|
|
|
|
>>>>> "bc" == Bryan Cantrill <bmc at eng dot sun dot com> writes: >>>>> "jz" == Joseph Zhou <jz at excelsioritsolutions dot com> writes:
bc> most of the people I talk to are actually _using_ NetApp's bc> technology, a practice that tends to leave even the most bc> stalwart proponents realistic about the (many) limitations of bc> NetApp's
same applies to ZFS pundits!
As Tim said, the one-filesystem-per-user thing is not working out. O(1) for number of filesystems would be great but isn't there.
Maybe the format allows unlimited O(1) snapshots, but it's at best O(1) to take them. All over the place it's probably O(n) or worse to _have_ them. to boot with them, to scrub with them.
I think the winning snapshot architecture is more like source code revision control: take infinitely-granular snapshots, a continuous line, and run a cron service to trim the line into a series of points.
The management can be delegated, but inspection commands are not safe and can lock the whole filesystem, and 'zfs recv'ing certain streams panics the whole box so backup cannot really be safely delegated either. The panic-on-import problems are bad for delegation because you can't safely let users mount things, which to my view is where delegated administration begins. It's too unstable to think of delegating anything---it's all just UI baloney until the panics are fixed and failures are contained within one pool.
The scalability to multiple cores goals are admirable, but only certain things are parallelized. You can only replace one device at a time, which some day will not be enough to keep up with natural failure rates. I think 'zfs send' does not use multiple cores well, right? AIUI people are getting non-scaling performance in send/recv while the ordinary filesystem performance does scale, and thus getting painted into a corner.
Yeah there's compression, but as Tim said people are getting more savings from dedup, which goes naturally with writeable clones too. Also the NetApp dedup is a background thread while the ZFS compression is synchronous with writing. as well as not scaling to multiple cores and seeming to have some bugs in the gzip version.
Yeah there is some heirarchical storage in it, but after half a year still a slog cannot be removed?
In general I think ZFS pundits compliment the architecture and not the implementation.
The big compliment I have for it is just that the ZFS piece is free software, even though large chunks of OpenSolaris aren't. That's a gigantic advantage, especially over NetApp, which probably has about as much long-term future as Lisp.
jz> As a friend, and trusting your personal integrity, I ask you, jz> please, don't get mad, enjoy the open discussion.
Joseph, I don't see the problem and think it's fine to excited so long as actual information comes out. There's nothing ad-hominem in the discussion yet, and being ordered not to get mad will make any normal person furious, especially if you make the order based on ``trust'' and ``personal integrity''---why bring up such things at all? I almost feel like you're baiting them! I know it's normal for sysadmins to be dry and menial, but it's still a technical discussion, so I hope it doesn't upset anyone because it's not boring. _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
3,468
From:
US
Registered:
6/15/05
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 15, 2008 2:12 PM
in response to: Miles Nordin
|
|
On Mon, Dec 15, 2008 at 05:04:03PM -0500, Miles Nordin wrote: > As Tim said, the one-filesystem-per-user thing is not working out.
For NFSv3 clients that truncate MOUNT protocol answers (and v4 clients that still rely on the MOUNT protocol), yes, one-filesystem-per-user is a problem. For NFSv4 clients that support mirror mounts its not a problem at all. You're not required to go with one-filesystem-per-user though! That's only if you want to approximate quotas.
> O(1) for number of filesystems would be great but isn't there.
It is O(1) for filesystems (parts of the system could be parallelized more, but the on-disk data format is O(1) for filesystem creation and mounting, just like it is for snapshots and clones).
> Maybe the format allows unlimited O(1) snapshots, but it's at best > O(1) to take them. All over the place it's probably O(n) or worse to > _have_ them. to boot with them, to scrub with them.
It's NOT O(N) to boot because of snapshots, nor to scrub. Scrub and resilver are O(N) where N is the amount used (as opposed to O(N) where N is the size of the volume, for HW RAID and the like).
Nico -- _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Miles Nordin
carton@Ivy.NET
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 16, 2008 11:00 AM
in response to: nico
|
|
|
|
>>>>> "nw" == Nicolas Williams <Nicolas dot Williams at sun dot com> writes:
nw> For NFSv4 clients that support mirror mounts its not a problem nw> at all.
no, 3000 - 10,000 users is common for a large campus, and according to posters here, sometimes that many users actually can fit into the bandwidth of a single pool. But ZFS is not useable with that many filesystems. booting, 'zfs create', 'zfs list', all take hours. see list archives.
If the on-disk format is theoretically capable of achieving O(1) for number of filesystems, that's nice! It's just not an advantage over NetApp when it's not working yet. And, with any project, sometimes the last 5% of the work never gets done.
so I'm making a desperate call to start basing punditry on experience rather than white papers and optimistic architecture documents. OpenSolaris could have an advantage here---it's much easier to get experience with Solaris than NetApp because it's not (a) expensive and (b) locked behind a bunch of licenses, agreements and contracts, unshareable documentation, private censored web forums (NOW site), u.s.w., so OpenSolaris punditry could one day become a lot more trustworthy than NetApp punditry.
nw> You're not required to go with one-filesystem-per-user though!
It was pitched as an architectural advantage, but never fully delivered, and worse, used to justify removing traditional Unix quotas. Consequently, quota-wise, ZFS becomes a regression w.r.t. UFS rather than an evolution, because of over-focusing on the virtues of the architecture rather than the delivered implementation.
I don't use quotas and don't care, but it's a good example of broken advocacy.
nw> It's NOT O(N) to boot because of snapshots, nor to scrub.
I think it is. try it and see. :/
That was Tim's point as I read it. Jeff claimed ``unlimited snapshots and clones'' as a ZFS advantage over NetApp, and Tim said open bugs or subtle limitations make the supposed advantage a fantasy, even a liability:
``"unlimited snapshots". Do I even need to begin to tell you what a horrible, HORRIBLE idea that is? "Why can't I get my space back?" Oh, just do a snapshot list and figure out which one is still holding the data. What? Your console locks up for 8 hours when you try to list out the snapshots? Huh... that's weird.''
...and to add to that, the snapshot list in ZFS does a better job of showing which one's using the space if there are fewer snapshots. with hundreds of snapshots 'zfs list' shows a USED column full of zeroes, correctly, because you won't save any space by deleting just one---you have to delete a range of snapshots to get some space back. Of course that's not the same thing as being O(N), that's just annoying.
and I don't know that it's really O(N)---it could be better or worse than O(N). It's not O(1) though, to boot, list, or scrub snapshots.
and if it's not O(1) because of some unnecessary high-level ioctl accidentally called in some obscure, abstract library by the ``simple'' user interface, it's still not O(1)! For practical users, that library could remain suboptimal for the next two years, and I don't want to spend those two years enduring a bunch of blogging about nonexistent O(1) snapshots just because the on-disk format theoretically doesn't impede delivering them. _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
11
From:
CA
Registered:
11/17/05
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 16, 2008 11:22 AM
in response to: Miles Nordin
|
|
Miles Nordin wrote: >>>>>> "nw" == Nicolas Williams <Nicolas dot Williams at sun dot com> writes: > > > nw> You're not required to go with one-filesystem-per-user though! > > It was pitched as an architectural advantage, but never fully > delivered, and worse, used to justify removing traditional Unix > quotas. Consequently, quota-wise, ZFS becomes a regression w.r.t. UFS > rather than an evolution, because of over-focusing on the virtues of > the architecture rather than the delivered implementation. > >
Precisely.
The issues for quotas, for ZFS on a per user basis was pointed out several years ago at FAST, when some of the Sun folks showed up to discuss ZFS in a late evening meeting. A file system per user approach is not very viable when you have tens of thousands of users.
It was my hope that Sun would get that message by now, as I consider it one of the major problems with ZFS. _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
126
From:
CA
Registered:
1/19/06
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 15, 2008 5:43 PM
in response to: Miles Nordin
|
|
> > Maybe the format allows unlimited O(1) snapshots, but it's at best > O(1) to take them. All over the place it's probably O(n) or worse to > _have_ them. to boot with them, to scrub with them.
Why would a scrub be O(n snapshots)?
The O(n filesystems) effects reported from time to time in OpenSolaris seem due to code that iterates over them. The new ability to create huge numbers of them puts stress on assumptions valid in more traditional UNIX configurations, right?
--Toby _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
295
From:
US
Registered:
3/9/05
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 13, 2008 10:20 PM
in response to: qu1j0t3
To: Communities » zfs » discuss
|
|
Some RAID systems compare checksums on reads, though this is usually only for RAID-4 configurations (e.g. DataDirect) because of the performance hit otherwise.
End-to-end checksums are not yet common. The SCSI committee recently ratified T10 DIF, which allows either an operating system or application to supply checksums and have them stored and retrieved with data. Oracle has been working to add support for this to Linux, and several array and drive vendors have committed to implementing it. So one could say that ZFS is ahead of the curve here.
ZFS is not particularly revolutionary: software RAID has been around since the invention of the term; end-to-end checksums to disk have been used since the 1960s (though more often in databases, tape, and optical media); WAFL-like file structures may pre-date NetApp. It does put these together for the first time in a widely available system, though, which is certainly innovative and useful. It will be more useful when it has a more complete disaster recovery model than 'restore from backup.'
|
|
|
|
Posts:
2,083
From:
US
Registered:
6/17/05
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 13, 2008 11:11 PM
in response to: rang
|
|
Anton B. Rang wrote: > Some RAID systems compare checksums on reads, though this is usually only for RAID-4 configurations (e.g. DataDirect) because of the performance hit otherwise. >
For the record, Solaris had a (mirrored) RAID system which would compare data from both sides of the mirror upon read. It never achieved significant market penetration and was subsequently scrapped. Many of the reasons that the market did not accept it are solved by the method used by ZFS, which is far superior.
> End-to-end checksums are not yet common. The SCSI committee recently ratified T10 DIF, which allows either an operating system or application to supply checksums and have them stored and retrieved with data. Oracle has been working to add support for this to Linux, and several array and drive vendors have committed to implementing it. So one could say that ZFS is ahead of the curve here. >
Oracle also has data checksumming enabled by default for later releases. I look forward to any field data analysis they may publish :-)
> ZFS is not particularly revolutionary: software RAID has been around since the invention of the term; end-to-end checksums to disk have been used since the 1960s (though more often in databases, tape, and optical media); WAFL-like file structures may pre-date NetApp. It does put these together for the first time in a widely available system, though, which is certainly innovative and useful. It will be more useful when it has a more complete disaster recovery model than 'restore from backup.' >
If you wish to implement a disaster recovery model, then you should look far beyond what ZFS (or any file system) can provide. Effective disaster recovery requires significant attention to process. -- richard
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
2,083
From:
US
Registered:
6/17/05
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 12, 2008 1:30 PM
in response to: hartz
|
|
Johan Hartzenberg wrote: > There is so much unsupported claims and noise on both sides that > everybody is sounding like a bunch of fanboys.
I don't think there are two sides. Anyone who has been around computing for any length of time has lost data due to various failures. The question isn't about losing data, it is about how to proceed when your data is damaged.
> > The only bit that I understand about why HW raid "might" be bad is > that if it had access to the disks behind a HW RAID LUN, then _IF_ zfs > were to encounter corrupted data in a read, it will probably be able > to re-construct that data. This is at the cost of doing the parity > calculations on a general purpose CPU, and then sending that parity > data, as well as the data to write, across the wire. Some of that > cost may be offset against Raid-Z's optimizations over raid-5 in some > situations, but all of this is pretty much if-then-maybe type situations.
OK, repeat after me: there is no such thing as hardware RAID, there is no such thing as hardware RAID, there is no such thing as hardware RAID. There is only software RAID. If you believe any software is infallible, then you will be hurt. Even beyond RAID, there is quite sophisticated software on your disks, and anyone who has had to upgrade disk firmware will attest that disk firmware is not infallible.
> I also understand that HW raid arrays have some vulnerabilities and > weaknesses, but those seem to be offset against ZFS' notorious > instability during error conditions. I say notorious, because of all > the open bug reports and reports on the list of I/O hanging and/or > systems panicing while waiting for ZFS to realize that something has > gone wrong. > > I think if this last point can be addressed - make ZFS respond MUCH > faster to failures, then it will go a long way to make ZFS be more > readily adopted.
However, you can't respond too fast -- something which seems to get lost in these conversations. If you declare a disk dead too fast, then you get caught in a bind by things like Seagate disks which "freeze" for a few seconds. It may be much better to ride through such things than initiate a reconfiguration action (as described in the article below). http://blogs.zdnet.com/storage/?p=369&tag=nl.e539
Note: as of b97, it is now possible to set per-device retries in the sd and ssd drivers. This is a good start towards satisfying those who are fed up with the default sd/ssd retry logic. See sd(7d) http://opensolaris.org/os/community/arc/caselog/2007/505/
-- richard
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
126
From:
CA
Registered:
1/19/06
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 12, 2008 12:44 PM
in response to: Miles Nordin
|
|
On 12-Dec-08, at 3:10 PM, Miles Nordin wrote:
>>>>>> "tt" == Toby Thain <toby at telegraphics dot com dot au> writes: >>>>>> "mg" == Mike Gerdts <mgerdts at gmail dot com> writes: > > tt> I think we have to assume Anton was joking - otherwise his > tt> measure is uselessly unscientific. > > I think it's rude to talk about someone who's present in the third > person, especially when you're trying to minimize his view. Were you > joking, Anton? :) > .... > > 1. I don't think your impressions nor Anton's and mine are ``useless''
Alright, I agree I should retract the 'useless' but I would keep the 'unscientific'.
--Toby _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
874
From:
US
Registered:
8/19/08
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 12, 2008 1:11 PM
in response to: qu1j0t3
|
|
On Fri, 12 Dec 2008, Toby Thain wrote: >> >> 1. I don't think your impressions nor Anton's and mine are ``useless'' > > Alright, I agree I should retract the 'useless' but I would keep the > 'unscientific'.
There is no need to retract the 'useless'. By the same useless measure, George Bush Jr has done a fantastic job at dealing with world terror since there has not been a serious attack on US soil by islamic terrorists since 2002. One might think that this impression is significant yet it is not since the previous attack on US soil was in 1993, which was about 9 years and we have only gone about 6 thus far. By statistical measures, George Bush Jr could have done absolutely nothing and it is likely that nothing bad would have happened at all. There is insufficient evidence to suggest one conclusion vs another.
This example shows the dangers of using illogical thinking to presumably reach a logical conclusion. It is particularly dangerous to exhibit illogical thinking in public where everyone can see.
Bob ====================================== Bob Friesenhahn bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
295
From:
US
Registered:
3/9/05
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 13, 2008 10:07 PM
in response to: Miles Nordin
To: Communities » zfs » discuss
|
|
I wasn't joking, though as is well known, the plural of anecdote is not data.
Both UFS and ZFS, in common with all file system, have design flaws and bugs.
To lose an entire UFS file system (barring the loss of the entire underlying storage) requires a great deal of corruption; there are multiple copies of the superblock, cylinder headers and their inodes are stored in a regular pattern and easily found by recovery tools, and the UFS file system check utility, while not perfect, can repair almost any corruption. There are third party tools which can perform much more analysis and recovery in a worst-case scenario. A single bad bloc
To lose an entire ZFS pool requires that the most recent uberblock, or one of the top-level blocks to which it points, be damaged. There are currently no recovery tools (at least, none of which I am aware).
I find it naïve to imagine that Sun customers "expect" their UFS (or other) file systems to be unrecoverable. Any case where fsck failed quickly became an escalation to the sustaining engineering organization. Restoring from backup is almost never a satisfactory answer for a commercial enterprise.
As usual, the disclaimer; I now work for another storage company, and while I've been on the teams developing and maintaining a number of commercial file systems (including two of Sun's), ZFS has not been one of them.
|
|
|
|
Posts:
2,083
From:
US
Registered:
6/17/05
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 13, 2008 11:02 PM
in response to: rang
|
|
Anton B. Rang wrote: > I find it naïve to imagine that Sun customers "expect" their UFS (or other) file systems to be unrecoverable.
OK, I'll bite. If we believe the disk vendors who rate their disks as having an unrecoverable error rate of 1 bit per 10^14 bits read, and knowing that UFS has absolutely no data protection of its data, why would you think that it is naive to think that a disk system with UFS cannot lose data? Rather, I would say it has a distinctly calculable probability. Similarly, for ZFS, the checksum is not perfect, so there is a calculable probability that the ZFS checksum will not detect an unrecoverable (read) error. The difference is that the probability that ZFS will not detect an error is considerably smaller than that of UFS (or FAT, or HSFS, or ...) > Any case where fsck failed quickly became an escalation to the sustaining engineering organization. Restoring from backup is almost never a satisfactory answer for a commercial enterprise. >
I agree. However, I've personally experienced well over 100 fsck failures over the years, and while I was always unsatisfied, I didn't always lose data[1]. When I did lose data, perhaps it was data I could live without, but that was my call. Would you rather that ZFS should simply say, "hey you lost some data, but we won't tell you where... ?"
[1] once upon a time, I used a [vendor-name-elided] disk for a 2,300 user e-mail message store. I upgraded the OS, which implemented some new SCSI options. The disk's firmware didn't handle those options properly and would wait about 7 hours before corrupting the UFS file system containing the message store, requiring a full restore. So, how many shifts do you think it took to fail, recover, and ultimately resolve the disk firmware issue? Hint: the firmware rev arrived via UPS.
Personally, I'm very glad that a file system has come along that verifies data... and that feature seems to be catching, as other file systems seem to be doing the same. Hopefully, in a few years silent data corruption will be a footnote on the lore of computing. -- richard
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
883
From:
GB
Registered:
10/24/07
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 15, 2008 2:13 AM
in response to: relling
To: Communities » zfs » discuss
|
|
I think the problem for me is not that there's a risk of data loss if a pool becomes corrupt, but that there are no recovery tools available. With UFS, people expect that if the worst happens, fsck will be able to recover their data in most cases.
With ZFS you have no such tools, yet Victor has on at least two occasions shown that it's quite possible to recover pools that were completely unusable (I believe by making use of old / backup copies of the uberblock).
My concern is that ZFS has all this information on disk, it has the ability to know exactly what is and isn't corrupted, and it should (at least for a system with snapshots) have many, many potential uberblocks to try. It should be far, far better than UFS at recovering from these things, but for a certain class of faults, when it hits a problem it just stops dead.
That's what frustrates me - knowing that there's potential to have all my data there, stored safely away, but having it completely inaccessible due to a lack of recovery tools.
|
|
|
|
Posts:
3,458
From:
NL
Registered:
3/9/05
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 15, 2008 2:30 AM
in response to: myxiplx
|
|
>I think the problem for me is not that there's a risk of data loss if >a pool becomes corrupt, but that there are no recovery tools >available. With UFS, people expect that if the worst happens, fsck >will be able to recover their data in most cases.
Except, of course, that fsck lies. In "fixes" the meta data and the quality of the rest is unknown.
Anyone using UFS knows that UFS file corruption are common; specifically, when using a "UFS root" and the system panic's when trying to install a device driver, there's a good chance that some files in /etc are corrupt. Some were application problems (some code used fsync(fileno(fp)); fclose(fp); it doesn't guarantee anything)
>With ZFS you have no such tools, yet Victor has on at least two occasions >shown that it's quite possible to recover pools that were completely unusable >(I believe by making use of old / backup copies of the uberblock).
True; and certainly ZFS should be able backtrack. But it's much more likely to happen "automatically" then using a recovery tool.
See, fsck could only be written because specific corruption are known and the patterns they have. With ZFS, you can only backup to a certain uberblock and the pattern will be a surprise.
Casper _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
883
From:
GB
Registered:
10/24/07
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 15, 2008 2:59 AM
in response to: casper
|
|
Forgive me for not understanding the details, but couldn't you also work backwards through the blocks with ZFS and attempt to recreate the uberblock?
So if you lost the uberblock, could you (memory and time allowing) start scanning the disk, looking for orphan blocks that aren't refernced anywhere else and piece together the top of the tree?
Or roll back to a previous uberblock (or a snapshot uberblock), and then look to see what blocks are on the disk but not referenced anywhere. Is there any way to intelligently work out where those blocks would be linked by looking at how they interact with the known data?
Of course, rolling back to a previous uberblock would still be a massive step forward, and something I think would do much to improve the perception of ZFS as a tool to reliably store data.
You cannot understate the difference to the end user between a file system that on boot says: "Sorry, can't read your data pool."
With one that says: "Whoops, the uberblock, and all the backups are borked. Would you like to roll back to a backup uberblock, or leave the filesystem offline to repair manually?"
As much as anything else, a simple statement explaining *why* a pool is inaccessible, and saying just how badly things have gone wrong helps tons. Being able to recover anything after that is just the icing on the cake, especially if it can be done automatically.
Ross
PS. Sorry for the duplicate Casper, I forgot to cc the list.
On Mon, Dec 15, 2008 at 10:30 AM, <Casper.***@sun.com> wrote: > >>I think the problem for me is not that there's a risk of data loss if >>a pool becomes corrupt, but that there are no recovery tools >>available. With UFS, people expect that if the worst happens, fsck >>will be able to recover their data in most cases. > > Except, of course, that fsck lies. In "fixes" the meta data and the > quality of the rest is unknown. > > Anyone using UFS knows that UFS file corruption are common; specifically, > when using a "UFS root" and the system panic's when trying to > install a device driver, there's a good chance that some files in > /etc are corrupt. Some were application problems (some code used > fsync(fileno(fp)); fclose(fp); it doesn't guarantee anything) > > >>With ZFS you have no such tools, yet Victor has on at least two occasions >>shown that it's quite possible to recover pools that were completely unusable >>(I believe by making use of old / backup copies of the uberblock). > > True; and certainly ZFS should be able backtrack. But it's > much more likely to happen "automatically" then using a recovery > tool. > > See, fsck could only be written because specific corruption are known > and the patterns they have. With ZFS, you can only backup to > a certain uberblock and the pattern will be a surprise. > > Casper > _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
874
From:
US
Registered:
8/19/08
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 15, 2008 10:34 AM
in response to: myxiplx
|
|
On Mon, 15 Dec 2008, Ross wrote:
> My concern is that ZFS has all this information on disk, it has the > ability to know exactly what is and isn't corrupted, and it should > (at least for a system with snapshots) have many, many potential > uberblocks to try. It should be far, far better than UFS at > recovering from these things, but for a certain class of faults, > when it hits a problem it just stops dead.
While ZFS knows if a data block is retrieved correctly from disk, a correctly retrieved data block does not indicate that the pool isn't "corrupted". A block written in the wrong order is a form of corruption.
Bob ====================================== Bob Friesenhahn bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
883
From:
GB
Registered:
10/24/07
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 15, 2008 11:19 AM
in response to: bfriesen
|
|
I'm not sure I follow how that can happen, I thought ZFS writes were designed to be atomic? They either commit properly on disk or they don't?
On Mon, Dec 15, 2008 at 6:34 PM, Bob Friesenhahn <bfriesen at simple dot dallas dot tx dot us> wrote: > On Mon, 15 Dec 2008, Ross wrote: > >> My concern is that ZFS has all this information on disk, it has the >> ability to know exactly what is and isn't corrupted, and it should (at least >> for a system with snapshots) have many, many potential uberblocks to try. >> It should be far, far better than UFS at recovering from these things, but >> for a certain class of faults, when it hits a problem it just stops dead. > > While ZFS knows if a data block is retrieved correctly from disk, a > correctly retrieved data block does not indicate that the pool isn't > "corrupted". A block written in the wrong order is a form of corruption. > > Bob > ====================================== > Bob Friesenhahn > bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
874
From:
US
Registered:
8/19/08
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 15, 2008 11:36 AM
in response to: myxiplx
|
|
On Mon, 15 Dec 2008, Ross Smith wrote:
> I'm not sure I follow how that can happen, I thought ZFS writes were > designed to be atomic? They either commit properly on disk or they > don't?
Yes, this is true. One reason why people complain about corrupted ZFS pools is because they have hardware which writes data in a different order than what was requested. Some hardware claims to have written the data but instead it has been secretly cached for later (or perhaps for never) and data blocks get written in some other order. It seems that ZFS is capable of working reliably with "cheap" hardware but not with wrongly designed hardware.
Bob ====================================== Bob Friesenhahn bfriesen at simple dot dallas dot tx dot us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
_______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
Posts:
3,468
From:
US
Registered:
6/15/05
|
|
|
|
Re: [zfs-discuss] Split responsibility for data with ZFS
Posted:
Dec 15, 2008 11:46 AM
in response to: bfriesen
|
|
On Mon, Dec 15, 2008 at 01:36:46PM -0600, Bob Friesenhahn wrote: > On Mon, 15 Dec 2008, Ross Smith wrote: > > > I'm not sure I follow how that can happen, I thought ZFS writes were > > designed to be atomic? They either commit properly on disk or they > > don't? > > Yes, this is true. One reason why people complain about corrupted ZFS > pools is because they have hardware which writes data in a different > order than what was requested. Some hardware claims to have written > the data but instead it has been secretly cached for later (or perhaps > for never) and data blocks get written in some other order. It seems > that ZFS is capable of working reliably with "cheap" hardware but not > with wrongly designed hardware.
Order of writes matters between transactions, not inside transactions, and at the boundary is a cache flush. Thus what matters really isn't write order as much as whether the devices lie about cache flushes. _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris dot org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|
|
|
|