cascardo/linux.git
8 years agoNFS: nfs_mark_for_revalidate should always set NFS_INO_REVAL_PAGECACHE
Trond Myklebust [Sun, 5 Jul 2015 16:36:34 +0000 (12:36 -0400)]
NFS: nfs_mark_for_revalidate should always set NFS_INO_REVAL_PAGECACHE

I'm not aware of any existing bugs around this, but the expectation is
that nfs_mark_for_revalidate() should always force a revalidation of
the cached metadata.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoNFS: Remove the "NFS_CAP_CHANGE_ATTR" capability
Trond Myklebust [Sun, 5 Jul 2015 15:12:07 +0000 (11:12 -0400)]
NFS: Remove the "NFS_CAP_CHANGE_ATTR" capability

Setting the change attribute has been mandatory for all NFS versions, since
commit 3a1556e8662c ("NFSv2/v3: Simulate the change attribute"). We should
therefore not have anything be conditional on it being set/unset.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoNFS: Set NFS_INO_REVAL_PAGECACHE if the change attribute is uninitialised
Trond Myklebust [Sun, 5 Jul 2015 20:07:04 +0000 (16:07 -0400)]
NFS: Set NFS_INO_REVAL_PAGECACHE if the change attribute is uninitialised

We can't allow caching of data until the change attribute has been
initialised correctly.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoNFS: Don't revalidate the mapping if both size and change attr are up to date
Trond Myklebust [Sun, 5 Jul 2015 15:02:53 +0000 (11:02 -0400)]
NFS: Don't revalidate the mapping if both size and change attr are up to date

If we've ensured that the size and the change attribute are both correct,
then there is no point in marking those attributes as needing revalidation
again. Only do so if we know the size is incorrect and was not updated.

Fixes: f2467b6f64da ("NFS: Clear NFS_INO_REVAL_PAGECACHE when...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoNFSv4/pnfs: Ensure we don't miss a file extension
Trond Myklebust [Mon, 6 Jul 2015 00:06:38 +0000 (20:06 -0400)]
NFSv4/pnfs: Ensure we don't miss a file extension

pNFS writes don't return attributes, however that doesn't mean that we
should ignore the fact that they may be extending the file. This patch
ensures that if a write is seen to extend the file, then we always set
an attribute barrier, and update the cached file size.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoNFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_locked
Trond Myklebust [Wed, 22 Jul 2015 17:46:13 +0000 (13:46 -0400)]
NFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_locked

Otherwise, nfs4_select_rw_stateid() will always return the zero stateid
instead of the correct open stateid.

Fixes: f95549cf24660 ("NFSv4: More CLOSE/OPEN races")
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoSUNRPC: xprt_complete_bc_request must also decrement the free slot count
Trond Myklebust [Wed, 22 Jul 2015 21:05:32 +0000 (17:05 -0400)]
SUNRPC: xprt_complete_bc_request must also decrement the free slot count

Calling xprt_complete_bc_request() effectively causes the slot to be allocated,
so it needs to decrement the backchannel free slot count as well.

Fixes: 0d2a970d0ae5 ("SUNRPC: Fix a backchannel race")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoSUNRPC: Fix a backchannel deadlock
Trond Myklebust [Wed, 22 Jul 2015 20:31:17 +0000 (16:31 -0400)]
SUNRPC: Fix a backchannel deadlock

xprt_alloc_bc_request() cannot call xprt_free_bc_request() without
deadlocking, since it already holds the xprt->bc_pa_lock.

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: 0d2a970d0ae55 ("SUNRPC: Fix a backchannel race")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agopNFS: Don't throw out valid layout segments
Trond Myklebust [Thu, 9 Jul 2015 16:24:07 +0000 (18:24 +0200)]
pNFS: Don't throw out valid layout segments

It is OK for layout segments to remain hashed even if no-one holds any
references to them, provided that the segments are still valid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agopNFS: pnfs_roc_drain() fix a race with open
Trond Myklebust [Thu, 9 Jul 2015 21:54:09 +0000 (23:54 +0200)]
pNFS: pnfs_roc_drain() fix a race with open

If a process reopens the file before we can send off the CLOSE/DELEGRETURN,
then pnfs_roc_drain() may end up waiting for a new set of layout segments
that are marked as return-on-close, but haven't yet been returned.

Fix this by only waiting for those layout segments that were invalidated in
pnfs_roc().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agopNFS: Fix races between return-on-close and layoutreturn.
Trond Myklebust [Thu, 9 Jul 2015 16:40:01 +0000 (18:40 +0200)]
pNFS: Fix races between return-on-close and layoutreturn.

If one or more of the layout segments reports an error during I/O, then
we may have to send a layoutreturn to report the error back to the NFS
metadata server.
This patch ensures that the return-on-close code can detect the
outstanding layoutreturn, and not preempt it.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agopNFS: pnfs_roc_drain should return 'true' when sleeping
Trond Myklebust [Thu, 9 Jul 2015 21:38:11 +0000 (23:38 +0200)]
pNFS: pnfs_roc_drain should return 'true' when sleeping

Also clean up the case where we don't find a return-on-close layout segment.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agopNFS: Layoutreturn must invalidate all existing layout segments.
Trond Myklebust [Thu, 9 Jul 2015 15:58:39 +0000 (17:58 +0200)]
pNFS: Layoutreturn must invalidate all existing layout segments.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoNFSv4.2/flexfiles: Fix a typo in the flexfiles layoutstats code
Trond Myklebust [Wed, 8 Jul 2015 18:19:23 +0000 (20:19 +0200)]
NFSv4.2/flexfiles: Fix a typo in the flexfiles layoutstats code

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoNFSv4: Leases are renewed in sequence_done when we have sessions
Trond Myklebust [Sun, 5 Jul 2015 18:50:46 +0000 (14:50 -0400)]
NFSv4: Leases are renewed in sequence_done when we have sessions

Ensure that the calls to renew_lease() in open_done() etc. only apply
to session-less versions of NFSv4.x (i.e. NFSv4.0).

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoNFSv4.1: nfs41_sequence_done should handle sequence flag errors
Trond Myklebust [Sun, 5 Jul 2015 19:01:36 +0000 (15:01 -0400)]
NFSv4.1: nfs41_sequence_done should handle sequence flag errors

Instead of just kicking off lease recovery, we should look into the
sequence flag errors and handle them.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoNFSv4.1: Handle SEQ4_STATUS_BACKCHANNEL_FAULT correctly
Trond Myklebust [Sun, 5 Jul 2015 19:26:58 +0000 (15:26 -0400)]
NFSv4.1: Handle SEQ4_STATUS_BACKCHANNEL_FAULT correctly

RFC5661 states:

      The server has encountered an unrecoverable fault with the
      backchannel (e.g., it has lost track of the sequence ID for a slot
      in the backchannel).  The client MUST stop sending more requests
      on the session's fore channel, wait for all outstanding requests
      to complete on the fore and back channel, and then destroy the
      session.

Ensure we do so...

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoNFSv4.1: Handle SEQ4_STATUS_RECALLABLE_STATE_REVOKED status bit correctly
Trond Myklebust [Sun, 5 Jul 2015 19:20:53 +0000 (15:20 -0400)]
NFSv4.1: Handle SEQ4_STATUS_RECALLABLE_STATE_REVOKED status bit correctly

Try to handle this for now by invalidating all outstanding layouts for this
server and then testing all the open+lock+delegation stateids.
At some later stage, we may want to optimise by separating out the testing of
delegation stateids only, and adding testing of layout stateids.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoNFSv4.1: Handle SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED status bit correctly.
Trond Myklebust [Sun, 5 Jul 2015 19:12:06 +0000 (15:12 -0400)]
NFSv4.1: Handle SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED status bit correctly.

If the server tells us that only some state has been revoked, then we
need to run the full TEST_STATEID dog and pony show in order to discover
which locks and delegations are still OK. Currently we blow away all
state, which means that we lose all locks!

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoSUNRPC: Don't confuse ENOBUFS with a write_space issue
Trond Myklebust [Fri, 3 Jul 2015 13:32:23 +0000 (09:32 -0400)]
SUNRPC: Don't confuse ENOBUFS with a write_space issue

ENOBUFS means that memory allocations are failing due to an actual
low memory situation. It should not be confused with being out of
socket buffer space.

Handle the problem by just punting to the delay in call_status.

Reported-by: Neil Brown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoSUNRPC: Don't reencode message if transmission failed with ENOBUFS
Trond Myklebust [Fri, 3 Jul 2015 13:28:09 +0000 (09:28 -0400)]
SUNRPC: Don't reencode message if transmission failed with ENOBUFS

If we're running out of buffer memory when transmitting data, then
we want to just delay for a moment, and then continue transmitting
the remainder of the message.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agonfs: Remove invalid tk_pid from debug message
Kinglong Mee [Wed, 1 Jul 2015 04:00:13 +0000 (12:00 +0800)]
nfs: Remove invalid tk_pid from debug message

Before rpc_run_task(), tk_pid is uninitiated as 0 always.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agonfs: Remove invalid NFS_ATTR_FATTR_V4_REFERRAL checking in nfs4_get_rootfh
Kinglong Mee [Wed, 1 Jul 2015 03:59:42 +0000 (11:59 +0800)]
nfs: Remove invalid NFS_ATTR_FATTR_V4_REFERRAL checking in nfs4_get_rootfh

NFS_ATTR_FATTR_V4_REFERRAL is only set in nfs4_proc_lookup_common.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agonfs: Drop bad comment in nfs41_walk_client_list()
Kinglong Mee [Wed, 1 Jul 2015 03:59:04 +0000 (11:59 +0800)]
nfs: Drop bad comment in nfs41_walk_client_list()

Commit 7b1f1fd184 "NFSv4/4.1: Fix bugs in nfs4[01]_walk_client_list"
have change the logical of the list_for_each_entry().

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agonfs: Remove unneeded micro checking of CONFIG_PROC_FS
Kinglong Mee [Wed, 1 Jul 2015 03:58:31 +0000 (11:58 +0800)]
nfs: Remove unneeded micro checking of CONFIG_PROC_FS

Have checking CONFIG_PROC_FS in include/linux/sunrpc/stats.h.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agonfs: Don't setting FILE_CREATED flags always
Kinglong Mee [Wed, 1 Jul 2015 03:57:49 +0000 (11:57 +0800)]
nfs: Don't setting FILE_CREATED flags always

Commit 5bc2afc2b5 "NFSv4: Honour the 'opened' parameter in the atomic_open()
 filesystem method" have support the opened arguments now.

Also,
Commit 03da633aa7 "atomic_open: take care of EEXIST in no-open case with
 O_CREAT|O_EXCL in fs/namei.c" have change vfs's logical.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agonfs: Use remove_proc_subtree() instead remove_proc_entry()
Kinglong Mee [Wed, 1 Jul 2015 15:00:29 +0000 (23:00 +0800)]
nfs: Use remove_proc_subtree() instead remove_proc_entry()

Thanks for Al Viro's comments of killing proc_fs_nfs completely.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agonfs: Remove unused argument in nfs_server_set_fsinfo()
Kinglong Mee [Wed, 1 Jul 2015 03:55:50 +0000 (11:55 +0800)]
nfs: Remove unused argument in nfs_server_set_fsinfo()

Commit e38eb6506f "NFS: set_pnfs_layoutdriver() from nfs4_proc_fsinfo()"
have remove the using of mntfh from nfs_server_set_fsinfo().

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agonfs: Fix a memory leak when meeting an unsupported state protect
Kinglong Mee [Wed, 1 Jul 2015 03:54:53 +0000 (11:54 +0800)]
nfs: Fix a memory leak when meeting an unsupported state protect

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agonfs: take extra reference to fl->fl_file when running a LOCKU operation
Jeff Layton [Tue, 30 Jun 2015 18:12:30 +0000 (14:12 -0400)]
nfs: take extra reference to fl->fl_file when running a LOCKU operation

Jean reported another crash, similar to the one fixed by feaff8e5b2cf:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000148
    IP: [<ffffffff8124ef7f>] locks_get_lock_context+0xf/0xa0
    PGD 0
    Oops: 0000 [#1] SMP
    Modules linked in: nfsv3 nfs_layout_flexfiles rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache vmw_vsock_vmci_transport vsock cfg80211 rfkill coretemp crct10dif_pclmul ppdev vmw_balloon crc32_pclmul crc32c_intel ghash_clmulni_intel pcspkr vmxnet3 parport_pc i2c_piix4 microcode serio_raw parport nfsd floppy vmw_vmci acpi_cpufreq auth_rpcgss shpchp nfs_acl lockd grace sunrpc vmwgfx drm_kms_helper ttm drm mptspi scsi_transport_spi mptscsih ata_generic mptbase i2c_core pata_acpi
    CPU: 0 PID: 329 Comm: kworker/0:1H Not tainted 4.1.0-rc7+ #2
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013
    Workqueue: rpciod rpc_async_schedule [sunrpc]
    30ec000
    RIP: 0010:[<ffffffff8124ef7f>]  [<ffffffff8124ef7f>] locks_get_lock_context+0xf/0xa0
    RSP: 0018:ffff8802330efc08  EFLAGS: 00010296
    RAX: ffff8802330efc58 RBX: ffff880097187c80 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
    RBP: ffff8802330efc18 R08: ffff88023fc173d8 R09: 3038b7bf00000000
    R10: 00002f1a02000000 R11: 3038b7bf00000000 R12: 0000000000000000
    R13: 0000000000000000 R14: ffff8802337a2300 R15: 0000000000000020
    FS:  0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000148 CR3: 000000003680f000 CR4: 00000000001407f0
    Stack:
     ffff880097187c80 ffff880097187cd8 ffff8802330efc98 ffffffff81250281
     ffff8802330efc68 ffffffffa013e7df ffff8802330efc98 0000000000000246
     ffff8801f6901c00 ffff880233d2b8d8 ffff8802330efc58 ffff8802330efc58
    Call Trace:
     [<ffffffff81250281>] __posix_lock_file+0x31/0x5e0
     [<ffffffffa013e7df>] ? rpc_wake_up_task_queue_locked.part.35+0xcf/0x240 [sunrpc]
     [<ffffffff8125088b>] posix_lock_file_wait+0x3b/0xd0
     [<ffffffffa03890b2>] ? nfs41_wake_and_assign_slot+0x32/0x40 [nfsv4]
     [<ffffffffa0365808>] ? nfs41_sequence_done+0xd8/0x300 [nfsv4]
     [<ffffffffa0367525>] do_vfs_lock+0x35/0x40 [nfsv4]
     [<ffffffffa03690c1>] nfs4_locku_done+0x81/0x120 [nfsv4]
     [<ffffffffa013e310>] ? rpc_destroy_wait_queue+0x20/0x20 [sunrpc]
     [<ffffffffa013e310>] ? rpc_destroy_wait_queue+0x20/0x20 [sunrpc]
     [<ffffffffa013e33c>] rpc_exit_task+0x2c/0x90 [sunrpc]
     [<ffffffffa0134400>] ? call_refreshresult+0x170/0x170 [sunrpc]
     [<ffffffffa013ece4>] __rpc_execute+0x84/0x410 [sunrpc]
     [<ffffffffa013f085>] rpc_async_schedule+0x15/0x20 [sunrpc]
     [<ffffffff810add67>] process_one_work+0x147/0x400
     [<ffffffff810ae42b>] worker_thread+0x11b/0x460
     [<ffffffff810ae310>] ? rescuer_thread+0x2f0/0x2f0
     [<ffffffff810b35d9>] kthread+0xc9/0xe0
     [<ffffffff81010000>] ? perf_trace_xen_mmu_set_pmd+0xa0/0x160
     [<ffffffff810b3510>] ? kthread_create_on_node+0x170/0x170
     [<ffffffff8173c222>] ret_from_fork+0x42/0x70
     [<ffffffff810b3510>] ? kthread_create_on_node+0x170/0x170
    Code: a5 81 e8 85 75 e4 ff c6 05 31 ee aa 00 01 eb 98 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc 53 <48> 8b 9f 48 01 00 00 48 85 db 74 08 48 89 d8 5b 41 5c 5d c3 83
    RIP  [<ffffffff8124ef7f>] locks_get_lock_context+0xf/0xa0
     RSP <ffff8802330efc08>
    CR2: 0000000000000148
    ---[ end trace 64484f16250de7ef ]---

The problem is almost exactly the same as the one fixed by feaff8e5b2cf.
We must take a reference to the struct file when running the LOCKU
compound to prevent the final fput from running until the operation is
complete.

Reported-by: Jean Spector <jean@primarydata.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoNFSv4: When returning a delegation, don't reclaim an incompatible open mode.
NeilBrown [Mon, 29 Jun 2015 04:28:54 +0000 (14:28 +1000)]
NFSv4: When returning a delegation, don't reclaim an incompatible open mode.

It is possible to have an active open with one mode, and a delegation
for the same file with a different mode.
In particular, a WR_ONLY open and an RD_ONLY delegation.
This happens if a WR_ONLY open is followed by a RD_ONLY open which
provides a delegation, but is then close.

When returning the delegation, we currently try to claim opens for
every open type (n_rdwr, n_rdonly, n_wronly).  As there is no harm
in claiming an open for a mode that we already have, this is often
simplest.

However if the delegation only provides a subset of the modes that we
currently have open, this will produce an error from the server.

So when claiming open modes prior to returning a delegation, skip the
open request if the mode is not covered by the delegation - the open_stateid
must already cover that mode, so there is nothing to do.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoNFSv4.2: LAYOUTSTATS is optional to implement
Trond Myklebust [Sat, 27 Jun 2015 15:45:46 +0000 (11:45 -0400)]
NFSv4.2: LAYOUTSTATS is optional to implement

Make it so, by checking the return value for NFS4ERR_MOTSUPP and
caching the information as a server capability.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoNFSv4.2: Fix up a decoding error in layoutstats
Trond Myklebust [Sat, 27 Jun 2015 15:30:57 +0000 (11:30 -0400)]
NFSv4.2: Fix up a decoding error in layoutstats

According to the spec, the server is only returning the status,
which we decode in the op header.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agopNFS/flexfiles: Fix the reset of struct pgio_header when resending
Trond Myklebust [Fri, 26 Jun 2015 19:37:58 +0000 (15:37 -0400)]
pNFS/flexfiles: Fix the reset of struct pgio_header when resending

hdr->good_bytes needs to be set to the length of the request, not
zero.

Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agopNFS/flexfiles: Turn off layoutcommit for servers that don't need it
Trond Myklebust [Fri, 26 Jun 2015 18:51:32 +0000 (14:51 -0400)]
pNFS/flexfiles: Turn off layoutcommit for servers that don't need it

This patch ensures that we record the value of 'ffl_flags' from
the layout, and then checks for the presence of the
FF_FLAGS_NO_LAYOUTCOMMIT flag before deciding whether or not to
call pnfs_set_layoutcommit().

The effect is that servers now can decide whether or not they want
the client to call layoutcommit before returning a writeable layout.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agoMerge branch 'layoutstats'
Trond Myklebust [Fri, 26 Jun 2015 18:01:59 +0000 (14:01 -0400)]
Merge branch 'layoutstats'

* layoutstats:
  pnfs/flexfiles: protect ktime manipulation with mirror lock
  nfs: provide pnfs_report_layoutstat when NFS42 is disabled
  pnfs/flexfiles: report layoutstat regularly
  nfs42: serialize LAYOUTSTATS calls of the same file
  pnfs/flexfiles: encode LAYOUTSTATS flexfiles specific data
  pnfs/flexfiles: add ff_layout_prepare_layoutstats
  pNFS/flexfiles: track when layout is first used
  pNFS/flexfiles: add layoutstats tracking
  pNFS/flexfiles: Remove unused struct members user_name, group_name
  pnfs: add pnfs_report_layoutstat helper function
  pNFS: fill in nfs42_layoutstat_ops
  NFSv.2/pnfs Add a LAYOUTSTATS rpc function

8 years agopnfs/flexfiles: protect ktime manipulation with mirror lock
Peng Tao [Fri, 26 Jun 2015 01:45:49 +0000 (09:45 +0800)]
pnfs/flexfiles: protect ktime manipulation with mirror lock

It looks as if xchg() and cmpxchg() are not available for 64-bit integers on sparc32:

> New breakage seen in linux-next today:
>
> ERROR: "__xchg_called_with_bad_pointer" [fs/nfs/flexfilelayout/nfs_layout_flexfiles.ko] undefined!
> ERROR: "__cmpxchg_called_with_bad_pointer" [fs/nfs/flexfilelayout/nfs_layout_flexfiles.ko] undefined!
> make[2]: *** [__modpost] Error 1
> make[1]: *** [modules] Error 2

Given that mirror ktime manipulation is already under mirror->lock, let's make use of the fact.

Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agonfs: provide pnfs_report_layoutstat when NFS42 is disabled
Peng Tao [Thu, 25 Jun 2015 10:19:32 +0000 (18:19 +0800)]
nfs: provide pnfs_report_layoutstat when NFS42 is disabled

kbuild test robot reported:
   fs/built-in.o: In function `pnfs_report_layoutstat':
>> (.text+0x151a1c): undefined reference to `nfs42_proc_layoutstats_generic'

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agonfs: verify open flags before allowing open
Benjamin Coddington [Thu, 25 Jun 2015 13:25:50 +0000 (09:25 -0400)]
nfs: verify open flags before allowing open

Commit 9597c13b forbade opens with O_APPEND|O_DIRECT for NFSv4:

    nfs: verify open flags before allowing an atomic open

    Currently, you can open a NFSv4 file with O_APPEND|O_DIRECT, but cannot
    fcntl(F_SETFL,...) with those flags. This flag combination is explicitly
    forbidden on NFSv3 opens, and it seems like it should also be on NFSv4.

However, you can still open a file with O_DIRECT|O_APPEND if there exists a
cached dentry for the file because nfs4_file_open() is used instead of
nfs_atomic_open() and the check is bypassed.  Add the check in
nfs4_file_open() as well.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agonfs: always update creds in mirror, even when we have an already connected ds
Jeff Layton [Wed, 24 Jun 2015 16:10:24 +0000 (12:10 -0400)]
nfs: always update creds in mirror, even when we have an already connected ds

A ds can be associated with more than one mirror, but we currently skip
setting a mirror's credentials if we find that it's already set up with
a connected client.

The upshot is that we can end up sending DS writes with MDS credentials
instead of properly setting them up. Fix nfs4_ff_layout_prepare_ds to
always verify that the mirror's credentials are set up, even when we
have a DS that's already connected.

Reported-by: Tom Haynes <thomas.haynes@primarydata.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
8 years agonfs: fix potential credential leak in ff_layout_update_mirror_cred
Jeff Layton [Wed, 24 Jun 2015 16:10:23 +0000 (12:10 -0400)]
nfs: fix potential credential leak in ff_layout_update_mirror_cred

If we have two tasks racing to update a mirror's credentials, then they
can end up leaking one (or more) sets of credentials. The first task
will set mirror->cred and then the second task will just overwrite it.

Use a cmpxchg to ensure that the creds are only set once. If we get to
the point where we would set mirror->cred and find that they're already
set, then we just release the creds that were just found.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agopnfs/flexfiles: report layoutstat regularly
Peng Tao [Tue, 23 Jun 2015 11:52:04 +0000 (19:52 +0800)]
pnfs/flexfiles: report layoutstat regularly

As a simple scheme, report every minute if IO is still going on.

Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs42: serialize LAYOUTSTATS calls of the same file
Peng Tao [Tue, 23 Jun 2015 11:52:03 +0000 (19:52 +0800)]
nfs42: serialize LAYOUTSTATS calls of the same file

There is no need to report concurrently.

Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agopnfs/flexfiles: encode LAYOUTSTATS flexfiles specific data
Peng Tao [Tue, 23 Jun 2015 11:52:02 +0000 (19:52 +0800)]
pnfs/flexfiles: encode LAYOUTSTATS flexfiles specific data

Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agopnfs/flexfiles: add ff_layout_prepare_layoutstats
Peng Tao [Tue, 23 Jun 2015 11:52:01 +0000 (19:52 +0800)]
pnfs/flexfiles: add ff_layout_prepare_layoutstats

It fills in the generic part of LAYOUTSTATS call. One thing to note
is that we don't really track if IO is continuous or not. So just fake
to use the completed bytes for it.

Still missing flexfiles specific part, which will be included in the next patch.

Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agopNFS/flexfiles: track when layout is first used
Peng Tao [Tue, 23 Jun 2015 11:52:00 +0000 (19:52 +0800)]
pNFS/flexfiles: track when layout is first used

So that we can report cumulative time since the beginning
of statistics collection of the layout.

Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agopNFS/flexfiles: add layoutstats tracking
Trond Myklebust [Tue, 23 Jun 2015 11:51:59 +0000 (19:51 +0800)]
pNFS/flexfiles: add layoutstats tracking

Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agopNFS/flexfiles: Remove unused struct members user_name, group_name
Trond Myklebust [Tue, 23 Jun 2015 11:51:58 +0000 (19:51 +0800)]
pNFS/flexfiles: Remove unused struct members user_name, group_name

Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agopnfs: add pnfs_report_layoutstat helper function
Peng Tao [Tue, 23 Jun 2015 11:51:57 +0000 (19:51 +0800)]
pnfs: add pnfs_report_layoutstat helper function

Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agopNFS: fill in nfs42_layoutstat_ops
Peng Tao [Tue, 23 Jun 2015 11:51:56 +0000 (19:51 +0800)]
pNFS: fill in nfs42_layoutstat_ops

Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFSv.2/pnfs Add a LAYOUTSTATS rpc function
Trond Myklebust [Tue, 23 Jun 2015 11:51:55 +0000 (19:51 +0800)]
NFSv.2/pnfs Add a LAYOUTSTATS rpc function

Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoMerge branch 'bugfixes'
Trond Myklebust [Mon, 22 Jun 2015 13:55:08 +0000 (09:55 -0400)]
Merge branch 'bugfixes'

* bugfixes:
  NFS: Ensure we set NFS_CONTEXT_RESEND_WRITES when requeuing writes
  pNFS: Fix a memory leak when attempted pnfs fails
  NFS: Ensure that we update the sequence id under the slot table lock
  nfs: Initialize cb_sequenceres information before validate_seqid()
  nfs: Only update callback sequnce id when CB_SEQUENCE success
  NFSv4: nfs4_handle_delegation_recall_error should ignore EAGAIN

9 years agoSUNRPC: Set the TCP user timeout option on client sockets
Trond Myklebust [Sat, 20 Jun 2015 19:31:54 +0000 (15:31 -0400)]
SUNRPC: Set the TCP user timeout option on client sockets

Use the TCP_USER_TIMEOUT socket option to advertise to the server
how long we will keep the connection open if there is unacknowledged
data. See RFC5482.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: Ensure we release the TCP socket once it has been closed
Trond Myklebust [Fri, 19 Jun 2015 20:17:57 +0000 (16:17 -0400)]
SUNRPC: Ensure we release the TCP socket once it has been closed

This fixes a regression introduced by commit caf4ccd4e88cf2 ("SUNRPC:
Make xs_tcp_close() do a socket shutdown rather than a sock_release").
Prior to that commit, the autoclose feature would ensure that an
idle connection would result in the socket being both disconnected and
released, whereas now only gets disconnected.

While the current behaviour is harmless, it does leave the port bound
until either RPC traffic resumes or the RPC client is shut down.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: Handle connection issues correctly on the back channel
Trond Myklebust [Fri, 19 Jun 2015 17:04:13 +0000 (13:04 -0400)]
SUNRPC: Handle connection issues correctly on the back channel

If the back channel is disconnected, we can and should just fail the
transmission. The expectation is that the NFSv4.1 server will always
retransmit any outstanding callbacks once the connection is
re-established.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs: Fix comment for nfs_pageio_init() and nfs_pageio_complete_mirror()
Yijing Wang [Thu, 18 Jun 2015 11:37:13 +0000 (19:37 +0800)]
nfs: Fix comment for nfs_pageio_init() and nfs_pageio_complete_mirror()

Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agosunrpc: use sg_init_one() in krb5_rc4_setup_enc/seq_key()
Fabian Frederick [Tue, 16 Jun 2015 19:57:52 +0000 (21:57 +0200)]
sunrpc: use sg_init_one() in krb5_rc4_setup_enc/seq_key()

Don't opencode sg_init_one()

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFS: Ensure we set NFS_CONTEXT_RESEND_WRITES when requeuing writes
Trond Myklebust [Wed, 17 Jun 2015 23:56:22 +0000 (19:56 -0400)]
NFS: Ensure we set NFS_CONTEXT_RESEND_WRITES when requeuing writes

If a write attempt fails, and the write is queued up for resending to
the server, as opposed to being dropped, then we need to set the
appropriate flag so that nfs_file_fsync() does the right thing.

Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agopNFS: Fix a memory leak when attempted pnfs fails
Trond Myklebust [Wed, 17 Jun 2015 23:41:51 +0000 (19:41 -0400)]
pNFS: Fix a memory leak when attempted pnfs fails

pnfs_do_write() expects the call to pnfs_write_through_mds() to free the
pgio header and to release the layout segment before exiting. The problem
is that nfs_pgio_data_destroy() doesn't actually do this; it only frees
the memory allocated by nfs_generic_pgio().

Ditto for pnfs_do_read()...

Fix in both cases is to add a call to hdr->release(hdr).

Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoMerge tag 'nfs-rdma-for-4.2' of git://git.linux-nfs.org/projects/anna/nfs-rdma
Trond Myklebust [Tue, 16 Jun 2015 15:36:47 +0000 (11:36 -0400)]
Merge tag 'nfs-rdma-for-4.2' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFSoRDMA Client Changes

These patches continue to build up for improving the rsize and wsize that the
NFS client uses when talking over RDMA.  In addition, these patches also add
in scalability enhancements and other bugfixes.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* tag 'nfs-rdma-for-4.2' of git://git.linux-nfs.org/projects/anna/nfs-rdma: (142 commits)
  xprtrdma: Reduce per-transport MR allocation
  xprtrdma: Stack relief in fmr_op_map()
  xprtrdma: Split rb_lock
  xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy
  xprtrdma: Remove ->ro_reset
  xprtrdma: Remove unused LOCAL_INV recovery logic
  xprtrdma: Acquire MRs in rpcrdma_register_external()
  xprtrdma: Introduce an FRMR recovery workqueue
  xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external()
  xprtrdma: Introduce helpers for allocating MWs
  xprtrdma: Use ib_device pointer safely
  xprtrdma: Remove rr_func
  xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt
  xprtrdma: Warn when there are orphaned IB objects
  ...

9 years agoNFSv4: Fix stateid recovery on revoked delegations
Trond Myklebust [Tue, 16 Jun 2015 15:26:35 +0000 (11:26 -0400)]
NFSv4: Fix stateid recovery on revoked delegations

Ensure that we fix the non-NULL stateid case as well.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoRecover from stateid-type error on SETATTR
Olga Kornievskaia [Fri, 12 Jun 2015 20:53:30 +0000 (16:53 -0400)]
Recover from stateid-type error on SETATTR

Client can receives stateid-type error (eg., BAD_STATEID) on SETATTR when
delegation stateid was used. When no open state exists, in case of application
calling truncate() on the file, client has no state to recover and fails with
EIO.

Instead, upon such error, return the bad delegation and then resend the
SETATTR with a zero stateid.

Signed-off: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs: Fix showing truncated fsid/dev in, /proc/net/nfsfs/volumes
Kinglong Mee [Sat, 13 Jun 2015 02:07:00 +0000 (10:07 +0800)]
nfs: Fix showing truncated fsid/dev in, /proc/net/nfsfs/volumes

A truncated fsid showing from /proc/fs/nfsfs/volumes as,
NV SERVER   PORT DEV     FSID              FSC
v4 c0a80881  801 0:43    34931f044c2a439b  no

It should be as,
NV SERVER   PORT DEV          FSID                              FSC
v4 c0a80881  801 0:43         34931f044c2a439b:954c5d830fa4be8c no

The max buffer length for storing "%llx:%llx" format should be
 16 + 1 + 16 + 1 = 34 (16 for %llx, 1 for ':', 1 for '\0').

Also, for storing "%u:%u" of MAJOR() and MINOR() should be
 8 + 1 + 3 + 1 = 13 (8 for 2^24, 1 for ':', 3 for 2^8, 1 for '\0').

v2, add comments for dev/fsid buffer and use sizeof in snprintf.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs: make nfs4_init_uniform_client_string use a dynamically allocated buffer
Jeff Layton [Tue, 9 Jun 2015 23:44:00 +0000 (19:44 -0400)]
nfs: make nfs4_init_uniform_client_string use a dynamically allocated buffer

Change the uniform client string generator to dynamically allocate the
NFSv4 client name string buffer. With this patch, we can eliminate the
buffers that are embedded within the "args" structs and simply use the
name string that is hanging off the client.

This uniform string case is a little simpler than the nonuniform since
we don't need to deal with RCU, but we do have two different cases,
depending on whether there is a uniquifier or not.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs: make nfs4_init_nonuniform_client_string use a dynamically allocated buffer
Jeff Layton [Tue, 9 Jun 2015 23:43:59 +0000 (19:43 -0400)]
nfs: make nfs4_init_nonuniform_client_string use a dynamically allocated buffer

The way the *_client_string functions work is a little goofy. They build
the string in an on-stack buffer and then use kstrdup to copy it. This
is not only stack-heavy but artificially limits the size of the client
name string. Change it so that we determine the length of the string,
allocate it and then scnprintf into it.

Since the contents of the nonuniform string depend on rcu-managed data
structures, it's possible that they'll change between when we allocate
the string and when we go to fill it. If that happens, free the string,
recalculate the length and try again. If it the mismatch isn't resolved
on the second try then just give up and return -EINVAL.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs: update maxsz values for SETCLIENTID and EXCHANGE_ID
Jeff Layton [Tue, 9 Jun 2015 23:43:58 +0000 (19:43 -0400)]
nfs: update maxsz values for SETCLIENTID and EXCHANGE_ID

The spec allows for up to NFS4_OPAQUE_LIMIT (1k). While we'll almost
certainly never use that much, these ops are generally the only ones
in the compound so we might as well allow for them to be that large.

Also, the existing code didn't add in a word for the opaque length
field for either name string. Fix that while we're in there.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs: convert setclientid and exchange_id encoders to use clp->cl_owner_id
Jeff Layton [Tue, 9 Jun 2015 23:43:57 +0000 (19:43 -0400)]
nfs: convert setclientid and exchange_id encoders to use clp->cl_owner_id

...instead of buffers that are part of their arg structs. We already
hold a reference to the client, so we might as well use the allocated
buffer. In the event that we can't allocate the clp->cl_owner_id, then
just return -ENOMEM.

Note too that we switch from a GFP_KERNEL allocation here to GFP_NOFS.
It's possible we could end up trying to do a SETCLIENTID or EXCHANGE_ID
in order to reclaim some memory, and the GFP_KERNEL allocations in the
existing code could cause recursion back into NFS reclaim.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs: increase size of EXCHANGE_ID name string buffer
Jeff Layton [Tue, 9 Jun 2015 23:43:56 +0000 (19:43 -0400)]
nfs: increase size of EXCHANGE_ID name string buffer

The current buffer is much too small if you have a relatively long
hostname. Bring it up to the size of the one that SETCLIENTID has.

Cc: <stable@vger.kernel.org>
Reported-by: Michael Skralivetsky <michael.skralivetsky@primarydata.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agopnfs/flexfiles: use swap() in ff_layout_sort_mirrors()
Fabian Frederick [Fri, 12 Jun 2015 16:58:50 +0000 (18:58 +0200)]
pnfs/flexfiles: use swap() in ff_layout_sort_mirrors()

Use kernel.h macro definition.

Thanks to Julia Lawall for Coccinelle scripting support.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: never enqueue a ->rq_cong request on ->sending
Neil Brown [Mon, 15 Jun 2015 05:55:30 +0000 (15:55 +1000)]
SUNRPC: never enqueue a ->rq_cong request on ->sending

If the sending queue has a task without ->rq_cong set at the front,
and then a number of tasks with ->rq_cong set such that they use
the entire congestion window, then the queue deadlocks.  The first
entry cannot be processed until later entries complete.

This scenario has been seen with a client using UDP to access a server,
and the network connection breaking for a period of time - it doesn't
recover.

It never really makes sense for an ->rq_cong request to be on the ->sending
queue, but it can happen when a request is being retried, and finds
the transport if locked (XPRT_LOCKED).  In this case we simple call
__xprt_put_cong() and the deadlock goes away.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoxprtrdma: Reduce per-transport MR allocation
Chuck Lever [Tue, 26 May 2015 15:53:33 +0000 (11:53 -0400)]
xprtrdma: Reduce per-transport MR allocation

Reduce resource consumption per-transport to make way for increasing
the credit limit and maximum r/wsize. Pre-allocate fewer MRs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Stack relief in fmr_op_map()
Chuck Lever [Tue, 26 May 2015 15:53:23 +0000 (11:53 -0400)]
xprtrdma: Stack relief in fmr_op_map()

fmr_op_map() declares a 64 element array of u64 in automatic
storage. This is 512 bytes (8 * 64) on the stack.

Instead, when FMR memory registration is in use, pre-allocate a
physaddr array for each rpcrdma_mw.

This is a pre-requisite for increasing the r/wsize maximum for
FMR on platforms with 4KB pages.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Split rb_lock
Chuck Lever [Tue, 26 May 2015 15:53:13 +0000 (11:53 -0400)]
xprtrdma: Split rb_lock

/proc/lock_stat showed contention between rpcrdma_buffer_get/put
and the MR allocation functions during I/O intensive workloads.

Now that MRs are no longer allocated in rpcrdma_buffer_get(),
there's no reason the rb_mws list has to be managed using the
same lock as the send/receive buffers. Split that lock. The
new lock does not need to disable interrupts because buffer
get/put is never called in an interrupt context.

struct rpcrdma_buffer is re-arranged to ensure rb_mwlock and rb_mws
are always in a different cacheline than rb_lock and the buffer
pointers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Remove rpcrdma_ia::ri_memreg_strategy
Chuck Lever [Tue, 26 May 2015 15:53:04 +0000 (11:53 -0400)]
xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy

Clean up: This field is no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Remove ->ro_reset
Chuck Lever [Tue, 26 May 2015 15:52:54 +0000 (11:52 -0400)]
xprtrdma: Remove ->ro_reset

An RPC can exit at any time. When it does so, xprt_rdma_free() is
called, and it calls ->op_unmap().

If ->ro_reset() is running due to a transport disconnect, the two
methods can race while processing the same rpcrdma_mw. The results
are unpredictable.

Because of this, in previous patches I've altered ->ro_map() to
handle MR reset. ->ro_reset() is no longer needed and can be
removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Remove unused LOCAL_INV recovery logic
Chuck Lever [Tue, 26 May 2015 15:52:44 +0000 (11:52 -0400)]
xprtrdma: Remove unused LOCAL_INV recovery logic

Clean up: Remove functions no longer used to recover broken FRMRs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Acquire MRs in rpcrdma_register_external()
Chuck Lever [Tue, 26 May 2015 15:52:35 +0000 (11:52 -0400)]
xprtrdma: Acquire MRs in rpcrdma_register_external()

Acquiring 64 MRs in rpcrdma_buffer_get() while holding the buffer
pool lock is expensive, and unnecessary because most modern adapters
can transfer 100s of KBs of payload using just a single MR.

Instead, acquire MRs one-at-a-time as chunks are registered, and
return them to rb_mws immediately during deregistration.

Note: commit 539431a437d2 ("xprtrdma: Don't invalidate FRMRs if
registration fails") is reverted: There is now a valid case where
registration can fail (with -ENOMEM) but the QP is still in RTS.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Introduce an FRMR recovery workqueue
Chuck Lever [Tue, 26 May 2015 15:52:25 +0000 (11:52 -0400)]
xprtrdma: Introduce an FRMR recovery workqueue

After a transport disconnect, FRMRs can be left in an undetermined
state. In particular, the MR's rkey is no good.

Currently, FRMRs are fixed up by the transport connect worker, but
that can race with ->ro_unmap if an RPC happens to exit while the
transport connect worker is running.

A better way of dealing with broken FRMRs is to detect them before
they are re-used by ->ro_map. Such FRMRs are either already invalid
or are owned by the sending RPC, and thus no race with ->ro_unmap
is possible.

Introduce a mechanism for handing broken FRMRs to a workqueue to be
reset in a context that is appropriate for allocating resources
(ie. an ib_alloc_fast_reg_mr() API call).

This mechanism is not yet used, but will be in subsequent patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-By: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Acquire FMRs in rpcrdma_fmr_register_external()
Chuck Lever [Tue, 26 May 2015 15:52:16 +0000 (11:52 -0400)]
xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external()

Acquiring 64 FMRs in rpcrdma_buffer_get() while holding the buffer
pool lock is expensive, and unnecessary because FMR mode can
transfer up to a 1MB payload using just a single ib_fmr.

Instead, acquire ib_fmrs one-at-a-time as chunks are registered, and
return them to rb_mws immediately during deregistration.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Introduce helpers for allocating MWs
Chuck Lever [Tue, 26 May 2015 15:52:06 +0000 (11:52 -0400)]
xprtrdma: Introduce helpers for allocating MWs

We eventually want to handle allocating MWs one at a time, as
needed, instead of grabbing 64 and throwing them at each RPC in the
pipeline.

Add a helper for grabbing an MW off rb_mws, and a helper for
returning an MW to rb_mws. These will be used in a subsequent patch.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Use ib_device pointer safely
Chuck Lever [Tue, 26 May 2015 15:51:56 +0000 (11:51 -0400)]
xprtrdma: Use ib_device pointer safely

The connect worker can replace ri_id, but prevents ri_id->device
from changing during the lifetime of a transport instance. The old
ID is kept around until a new ID is created and the ->device is
confirmed to be the same.

Cache a copy of ri_id->device in rpcrdma_ia and in rpcrdma_rep.
The cached copy can be used safely in code that does not serialize
with the connect worker.

Other code can use it to save an extra address generation (one
pointer dereference instead of two).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Remove rr_func
Chuck Lever [Tue, 26 May 2015 15:51:46 +0000 (11:51 -0400)]
xprtrdma: Remove rr_func

A posted rpcrdma_rep never has rr_func set to anything but
rpcrdma_reply_handler.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt
Chuck Lever [Tue, 26 May 2015 15:51:37 +0000 (11:51 -0400)]
xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt

Clean up: Instead of carrying a pointer to the buffer pool and
the rpc_xprt, carry a pointer to the controlling rpcrdma_xprt.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoxprtrdma: Warn when there are orphaned IB objects
Chuck Lever [Tue, 26 May 2015 15:51:27 +0000 (11:51 -0400)]
xprtrdma: Warn when there are orphaned IB objects

WARN during transport destruction if ib_dealloc_pd() fails. This is
a sign that xprtrdma orphaned one or more RDMA API objects at some
point, which can pin lower layer kernel modules and cause shutdown
to hang.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-By: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
9 years agoNFS: Ensure that we update the sequence id under the slot table lock
Trond Myklebust [Fri, 12 Jun 2015 01:13:58 +0000 (21:13 -0400)]
NFS: Ensure that we update the sequence id under the slot table lock

Fix a callback slot table regression.

Fixes: e937ee714b2d ("nfs: Only update callback sequnce id when CB_SEQUENCE success")
Cc: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs: Initialize cb_sequenceres information before validate_seqid()
Kinglong Mee [Tue, 2 Jun 2015 10:59:01 +0000 (18:59 +0800)]
nfs: Initialize cb_sequenceres information before validate_seqid()

For a cb_layoutrecall replay, nfsd got CB_SEQUENCE status of zero,
but all informations of cb_sequenceres are zero too !!!

validate_seqid() return NFS4ERR_RETRY_UNCACHED_REP for a replay,
and skip the initlize cb_sequenceres.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs: deny backchannel RPCs with an incorrect authflavor instead of dropping them
Jeff Layton [Thu, 4 Jun 2015 22:40:13 +0000 (18:40 -0400)]
nfs: deny backchannel RPCs with an incorrect authflavor instead of dropping them

A drop should really only be done when the frame is malformed or we have
reason to think that there is some sort of DoS going on. When we get an
RPC with bad auth, we should send back an error instead.

Cc: Andy Adamson <William.Adamson@netapp.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: Address kbuild warning in net/sunrpc/debugfs.c
Chuck Lever [Thu, 11 Jun 2015 17:47:10 +0000 (13:47 -0400)]
SUNRPC: Address kbuild warning in net/sunrpc/debugfs.c

Cross-compile test on ARCH=mn10300:

In file included from include/linux/list.h:8:0,
                 from include/linux/wait.h:6,
                 from include/linux/fs.h:6,
                 from include/linux/debugfs.h:18,
                 from net/sunrpc/debugfs.c:7:
net/sunrpc/debugfs.c: In function 'fault_disconnect_write':
include/linux/kernel.h:723:17: warning: comparison of distinct pointer
types lacks a cast
    (void) (&_min1 == &_min2);  \
                   ^
>> net/sunrpc/debugfs.c:307:8: note: in expansion of macro 'min'
    len = min(len, sizeof(buffer) - 1);

Fixes: ('SUNRPC: Transport fault injection')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agonfs: Only update callback sequnce id when CB_SEQUENCE success
Kinglong Mee [Tue, 2 Jun 2015 10:58:46 +0000 (18:58 +0800)]
nfs: Only update callback sequnce id when CB_SEQUENCE success

When testing pnfs layout, nfsd got error NFS4ERR_SEQ_MISORDERED.
It is caused by nfs return NFS4ERR_DELAY before validate_seqid(),
don't update the sequnce id, but nfsd updates the sequnce id !!!

According to RFC5661 20.9.3,
" If CB_SEQUENCE returns an error, then the state of the slot
  (sequence ID, cached reply) MUST NOT change. "

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFS: Convert use of __constant_htonl to htonl
Vaishali Thakkar [Wed, 10 Jun 2015 14:45:29 +0000 (20:15 +0530)]
NFS: Convert use of __constant_htonl to htonl

In little endian cases, the macro htonl unfolds to __swab32 which
provides special case for constants. In big endian cases,
__constant_htonl and htonl expand directly to the same expression.
So, replace __constant_htonl with htonl with the goal of getting
rid of the definition of __constant_htonl completely.

The semantic patch that performs this transformation is as follows:

@@expression x;@@

- __constant_htonl(x)
+ htonl(x)

Signed-off-by: Vaishali Thakkar <vthakkar1994@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoSUNRPC: Transport fault injection
Chuck Lever [Mon, 11 May 2015 18:02:25 +0000 (14:02 -0400)]
SUNRPC: Transport fault injection

It has been exceptionally useful to exercise the logic that handles
local immediate errors and RDMA connection loss.  To enable
developers to test this regularly and repeatably, add logic to
simulate connection loss every so often.

Fault injection is disabled by default. It is enabled with

  $ sudo echo xxx > /sys/kernel/debug/sunrpc/inject_fault/disconnect

where "xxx" is a large positive number of transport method calls
before a disconnect. A value of several thousand is usually a good
number that allows reasonable forward progress while still causing a
lot of connection drops.

These hooks are disabled when SUNRPC_DEBUG is turned off.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFS: Remove unused nfs_rw_ops->rw_release() function
Anna Schumaker [Wed, 10 Jun 2015 20:54:28 +0000 (16:54 -0400)]
NFS: Remove unused nfs_rw_ops->rw_release() function

This was only ever set to nfs_writeback_release_common(), a function
which is completely empty.  Let's just drop this function pointer and
simplify the code a bit.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoNFSv4: handle nfs4_get_referral failure
Dominique Martinet [Thu, 4 Jun 2015 15:04:17 +0000 (17:04 +0200)]
NFSv4: handle nfs4_get_referral failure

nfs4_proc_lookup_common is supposed to return a posix error, we have to
handle any error returned that isn't errno

Reported-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Frank S. Filz <ffilzlnx@mindspring.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agosunrpc: turn swapper_enable/disable functions into rpc_xprt_ops
Jeff Layton [Wed, 3 Jun 2015 20:14:29 +0000 (16:14 -0400)]
sunrpc: turn swapper_enable/disable functions into rpc_xprt_ops

RDMA xprts don't have a sock_xprt, but an rdma_xprt, so the
xs_swapper_enable/disable functions will likely oops when fed an RDMA
xprt. Turn these functions into rpc_xprt_ops so that that doesn't
occur. For now the RDMA versions are no-ops that just return -EINVAL
on an attempt to swapon.

Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agosunrpc: lock xprt before trying to set memalloc on the sockets
Jeff Layton [Wed, 3 Jun 2015 20:14:28 +0000 (16:14 -0400)]
sunrpc: lock xprt before trying to set memalloc on the sockets

It's possible that we could race with a call to xs_reset_transport, in
which case the xprt->inet pointer could be zeroed out while we're
accessing it. Lock the xprt before we try to set memalloc on it.

Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agosunrpc: if we're closing down a socket, clear memalloc on it first
Jeff Layton [Wed, 3 Jun 2015 20:14:27 +0000 (16:14 -0400)]
sunrpc: if we're closing down a socket, clear memalloc on it first

We currently increment the memalloc_socks counter if we have a xprt that
is associated with a swapfile. That socket can be replaced however
during a reconnect event, and the memalloc_socks counter is never
decremented if that occurs.

When tearing down a xprt socket, check to see if the xprt is set up for
swapping and sk_clear_memalloc before releasing the socket if so.

Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agosunrpc: make xprt->swapper an atomic_t
Jeff Layton [Wed, 3 Jun 2015 20:14:26 +0000 (16:14 -0400)]
sunrpc: make xprt->swapper an atomic_t

Split xs_swapper into enable/disable functions and eliminate the
"enable" flag.

Currently, it's racy if you have multiple swapon/swapoff operations
running in parallel over the same xprt. Also fix it so that we only
set it to a memalloc socket on a 0->1 transition and only clear it
on a 1->0 transition.

Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agosunrpc: keep a count of swapfiles associated with the rpc_clnt
Jeff Layton [Wed, 3 Jun 2015 20:14:25 +0000 (16:14 -0400)]
sunrpc: keep a count of swapfiles associated with the rpc_clnt

Jerome reported seeing a warning pop when working with a swapfile on
NFS. The nfs_swap_activate can end up calling sk_set_memalloc while
holding the rcu_read_lock and that function can sleep.

To fix that, we need to take a reference to the xprt while holding the
rcu_read_lock, set the socket up for swapping and then drop that
reference. But, xprt_put is not exported and having NFS deal with the
underlying xprt is a bit of layering violation anyway.

Fix this by adding a set of activate/deactivate functions that take a
rpc_clnt pointer instead of an rpc_xprt, and have nfs_swap_activate and
nfs_swap_deactivate call those.

Also, add a per-rpc_clnt atomic counter to keep track of the number of
active swapfiles associated with it. When the counter does a 0->1
transition, we enable swapping on the xprt, when we do a 1->0 transition
we disable swapping on it.

This also allows us to be a bit more selective with the RPC_TASK_SWAPPER
flag. If non-swapper and swapper clnts are sharing a xprt, then we only
need to flag the tasks from the swapper clnt with that flag.

Acked-by: Mel Gorman <mgorman@suse.de>
Reported-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
9 years agoLinux 4.1-rc7 v4.1-rc7
Linus Torvalds [Mon, 8 Jun 2015 03:23:50 +0000 (20:23 -0700)]
Linux 4.1-rc7

9 years agoMerge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Linus Torvalds [Sun, 7 Jun 2015 23:56:10 +0000 (16:56 -0700)]
Merge branch 'upstream' of git://git.linux-mips.org/ralf/upstream-linus

Pull MIPS updates from Ralf Baechle:
 "Eight fixes across arch/mips.  Nothing stands particuarly out nor is
  complicated but fixes keep coming in at a higher than comfortable
  rate"

* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
  MIPS: KVM: Do not sign extend on unsigned MMIO load
  MIPS: BPF: Fix stack pointer allocation
  MIPS: Loongson-3: Fix a cpu-hotplug issue in loongson3_ipi_interrupt()
  MIPS: Fix enabling of DEBUG_STACKOVERFLOW
  MIPS: c-r4k: Fix typo in probe_scache()
  MIPS: Avoid an FPE exception in FCSR mask probing
  MIPS: ath79: Add a missing new line in log message
  MIPS: ralink: Fix clearing the illegal access interrupt