cascardo/linux.git
10 years agopowerpc: split hugepage when using subpage protection
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:25 +0000 (14:30 +0530)]
powerpc: split hugepage when using subpage protection

We find all the overlapping vma and mark them such that we don't allocate
hugepage in that range. Also we split existing huge page so that the
normal page hash can be invalidated and new page faulted in with new
protection bits.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc: disable assert_pte_locked for collapse_huge_page
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:24 +0000 (14:30 +0530)]
powerpc: disable assert_pte_locked for collapse_huge_page

With THP we set pmd to none, before we do pte_clear. Hence we can't
walk page table to get the pte lock ptr and verify whether it is locked.
THP do take pte lock before calling pte_clear. So we don't change the locking
rules here. It is that we can't use page table walking to check whether
pte locks are held with THP.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc: Prevent gcc to re-read the pagetables
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:23 +0000 (14:30 +0530)]
powerpc: Prevent gcc to re-read the pagetables

GCC is very likely to read the pagetables just once and cache them in
the local stack or in a register, but it is can also decide to re-read
the pagetables. The problem is that the pagetable in those places can
change from under gcc.

With THP/hugetlbfs the pmd (and pgd for hugetlbfs giga pages) can
change under gup_fast. The pages won't be freed untill we finish
gup fast because we have irq disabled and we free these pages via
rcu callback.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc: Make linux pagetable walk safe with THP enabled
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:22 +0000 (14:30 +0530)]
powerpc: Make linux pagetable walk safe with THP enabled

We need to have irqs disabled to handle all the possible parallel update for
linux page table without holding locks.

Events that we are intersted in while walking page tables are
1) Page fault
2) umap
3) THP split
4) THP collapse

A) local_irq_disabled:
------------------------
1) page fault:
A none to valid transition via page fault is not an issue because we
would either see a none or valid. If it is none, we would error out
the page table walk. We may need to use on stack values when checking for
type of page table elements, because if we do

if (!is_hugepd()) {
    if (!pmd_none() {
       if (pmd_bad() {

We could take that bad condition because the pmd got converted to a hugepd
after the !is_hugepd check via a hugetlb fault.

The right way would be to check for pmd_none higher up or use on stack value.

2) A valid to none conversion via unmap:
We can safely walk the upper level table, because we don't remove the the
page table entries until rcu grace period. So even if we followed a
wrong pointer we still have the pointer valid till the grace period.

A PTE pointer returned need to be atomically checked for _PAGE_PRESENT and
 _PAGE_BUSY. A valid pointer returned could becoming none later. To prevent
pte_clear we take _PAGE_BUSY.

3) THP split:
A valid transparent hugepage is converted to nomal page. Before we split we
do pmd_splitting_flush, which sets the hugepage PTE to _PAGE_SPLITTING
So when walking page table we need to check for pmd_trans_splitting and
handle that. The pte returned should also need to be checked for
_PAGE_SPLITTING before setting _PAGE_BUSY similar to _PAGE_PRESENT. We save
the value of PTE on stack and check for the flag in the local pte value.
If we don't have the value set we can safely operate on the local pte value
and we atomicaly set _PAGE_BUSY.

4) THP collapse:
A normal page gets converted to hugepage. In the collapse path, we
mark the pmd none early (pmdp_clear_flush). With irq disabled, if we
are aleady walking page table we would see the pmd_none and won't continue.
If we see a valid PMD, we should still check for _PAGE_PRESENT before
setting _PAGE_BUSY, to make sure we didn't collapse the PTE to a Huge PTE.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/THP: Add code to handle HPTE faults for hugepages
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:21 +0000 (14:30 +0530)]
powerpc/THP: Add code to handle HPTE faults for hugepages

The deposted PTE page in the second half of the PMD table is used to
track the state on hash PTEs. After updating the HPTE, we mark the
coresponding slot in the deposted PTE page valid.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc: Update gup_pmd_range to handle transparent hugepages
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:20 +0000 (14:30 +0530)]
powerpc: Update gup_pmd_range to handle transparent hugepages

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/kvm: Handle transparent hugepage in KVM
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:19 +0000 (14:30 +0530)]
powerpc/kvm: Handle transparent hugepage in KVM

We can find pte that are splitting while walking page tables. Return
None pte in that case.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc: Replace find_linux_pte with find_linux_pte_or_hugepte
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:18 +0000 (14:30 +0530)]
powerpc: Replace find_linux_pte with find_linux_pte_or_hugepte

Replace find_linux_pte with find_linux_pte_or_hugepte and explicitly
document why we don't need to handle transparent hugepages at callsites.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc: Update find_linux_pte_or_hugepte to handle transparent hugepages
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:17 +0000 (14:30 +0530)]
powerpc: Update find_linux_pte_or_hugepte to handle transparent hugepages

Reviewed-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc: move find_linux_pte_or_hugepte and gup_hugepte to common code
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:16 +0000 (14:30 +0530)]
powerpc: move find_linux_pte_or_hugepte and gup_hugepte to common code

We will use this in the later patch for handling THP pages

Reviewed-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/THP: Implement transparent hugepages for ppc64
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:15 +0000 (14:30 +0530)]
powerpc/THP: Implement transparent hugepages for ppc64

We now have pmd entries covering 16MB range and the PMD table double its original size.
We use the second half of the PMD table to deposit the pgtable (PTE page).
The depoisted PTE page is further used to track the HPTE information. The information
include [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry.
With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need
4096 entries. Both will fit in a 4K PTE page. On hugepage invalidate we need to walk
the PTE page and invalidate all valid HPTEs.

This patch implements necessary arch specific functions for THP support and also
hugepage invalidate logic. These PMD related functions are intentionally kept
similar to their PTE counter-part.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/THP: Double the PMD table size for THP
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:14 +0000 (14:30 +0530)]
powerpc/THP: Double the PMD table size for THP

THP code does PTE page allocation along with large page request and deposit them
for later use. This is to ensure that we won't have any failures when we split
hugepages to regular pages.

On powerpc we want to use the deposited PTE page for storing hash pte slot and
secondary bit information for the HPTEs. We use the second half
of the pmd table to save the deposted PTE page.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/mm: handle hugepage size correctly when invalidating hpte entries
Aneesh Kumar K.V [Thu, 20 Jun 2013 09:00:13 +0000 (14:30 +0530)]
powerpc/mm: handle hugepage size correctly when invalidating hpte entries

If a hash bucket gets full, we "evict" a more/less random entry from it.
When we do that we don't invalidate the TLB (hpte_remove) because we assume
the old translation is still technically "valid". This implies that when
we are invalidating or updating pte, even if HPTE entry is not valid
we should do a tlb invalidate. With hugepages, we need to pass the correct
actual page size value for tlb invalidation.

This change update the patch 0608d692463598c1d6e826d9dd7283381b4f246c
"powerpc/mm: Always invalidate tlb on hpte invalidate and update" to handle
transparent hugepages correctly.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Debugfs for error injection
Gavin Shan [Thu, 20 Jun 2013 10:13:26 +0000 (18:13 +0800)]
powerpc/eeh: Debugfs for error injection

The patch creates debugfs entries (powerpc/PCIxxxx/err_injct) for
injecting EEH errors for testing purpose.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/powernv: Debugfs directory for PHB
Gavin Shan [Thu, 20 Jun 2013 10:13:25 +0000 (18:13 +0800)]
powerpc/powernv: Debugfs directory for PHB

The patch creates one debugfs directory ("powerpc/PCIxxxx") for
each PHB so that we can hook EEH error injection debugfs entry
there in proceeding patch.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Register OPAL notifier for PCI error
Gavin Shan [Thu, 20 Jun 2013 10:13:24 +0000 (18:13 +0800)]
powerpc/eeh: Register OPAL notifier for PCI error

The patch registers OPAL event notifier and process the PCI errors
from firmware. If we have pending PCI errors, special EEH event
(without binding PE) will be sent to EEH core for processing.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowernv/opal: Disable OPAL notifier upon poweroff
Gavin Shan [Thu, 20 Jun 2013 10:13:23 +0000 (18:13 +0800)]
powernv/opal: Disable OPAL notifier upon poweroff

While we're restarting or powering off the system, we needn't
the OPAL notifier any more. So just to disable that.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowernv/opal: Notifier for OPAL events
Gavin Shan [Thu, 20 Jun 2013 10:13:22 +0000 (18:13 +0800)]
powernv/opal: Notifier for OPAL events

This patch implements a notifier to receive a notification on OPAL
event mask changes. The notifier is only called as a result of an OPAL
interrupt, which will happen upon reception of FSP messages or PCI errors.
Any event mask change detected as a result of opal_poll_events() will not
result in a notifier call.

[benh: changelog]
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Allow to check fenced PHB proactively
Gavin Shan [Thu, 20 Jun 2013 05:21:16 +0000 (13:21 +0800)]
powerpc/eeh: Allow to check fenced PHB proactively

It's meaningless to handle frozen PE if we already had fenced PHB.
The patch intends to check the PHB state before checking PE. If the
PHB has been put into fenced state, we need take care of that firstly.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Enable EEH check for config access
Gavin Shan [Thu, 20 Jun 2013 05:21:15 +0000 (13:21 +0800)]
powerpc/eeh: Enable EEH check for config access

The patch enables EEH check and let EEH core to process the EEH
errors for PowerNV platform while accessing config space. Originally,
the implementation already had mechanism to check EEH errors and
tried to recover from them. However, we never let EEH core to handle
the EEH errors.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Initialization for PowerNV
Gavin Shan [Thu, 20 Jun 2013 05:21:14 +0000 (13:21 +0800)]
powerpc/eeh: Initialization for PowerNV

The patch initializes EEH for PowerNV platform. Because the OPAL
APIs requires HUB ID, we need trace that through struct pnv_phb.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: PowerNV EEH backends
Gavin Shan [Thu, 20 Jun 2013 05:21:13 +0000 (13:21 +0800)]
powerpc/eeh: PowerNV EEH backends

The patch adds EEH backends for PowerNV platform. It's notable that
part of those EEH backends call to the I/O chip dependent backends.

[Removed pointless change to eeh_pseries.c -- BenH]

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: I/O chip next error
Gavin Shan [Thu, 20 Jun 2013 05:21:12 +0000 (13:21 +0800)]
powerpc/eeh: I/O chip next error

The patch implements the backend for EEH core to retrieve next
EEH error to handle. For the informational errors, we won't bother
the EEH core. Otherwise, the EEH should take appropriate actions
depending on the return value:

0 - No further errors detected
1 - Frozen PE
2 - Fenced PHB
3 - Dead PHB
4 - Dead IOC

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: I/O chip PE log and bridge setup
Gavin Shan [Thu, 20 Jun 2013 05:21:11 +0000 (13:21 +0800)]
powerpc/eeh: I/O chip PE log and bridge setup

The patch adds backends to retrieve error log and configure p2p
bridges for the indicated PE.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: I/O chip PE reset
Gavin Shan [Thu, 20 Jun 2013 05:21:10 +0000 (13:21 +0800)]
powerpc/eeh: I/O chip PE reset

The patch adds the I/O chip backend to do PE reset. For now, we
focus on PCI bus dependent PE. If PHB PE has been put into error
state, the PHB will take complete reset. Besides, the root bridge
will take fundamental or hot reset accordingly if the indicated
PE locates at the toppest of PCI hierarchy tree. Otherwise, the
upstream p2p bridge will take hot reset.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: I/O chip EEH state retrieval
Gavin Shan [Thu, 20 Jun 2013 05:21:09 +0000 (13:21 +0800)]
powerpc/eeh: I/O chip EEH state retrieval

The patch adds I/O chip backend to retrieve the state for the
indicated PE. While the PE state is temperarily unavailable,
the upper layer (powernv platform) should return default delay
(1 second).

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: I/O chip EEH enable option
Gavin Shan [Thu, 20 Jun 2013 05:21:08 +0000 (13:21 +0800)]
powerpc/eeh: I/O chip EEH enable option

The patch adds the backend to enable or disable EEH functionality
for the specified PE. The backend is also used to enable MMIO or
DMA path for the problematic PE. It's notable that all PEs on
PowerNV platform support EEH functionality by default, and we
disallow to disable EEH for the specific PE.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: I/O chip post initialization
Gavin Shan [Thu, 20 Jun 2013 05:21:07 +0000 (13:21 +0800)]
powerpc/eeh: I/O chip post initialization

The post initialization (struct eeh_ops::post_init) is called after
the EEH probe is done. On the other hand, the EEH core post
initialization is designed to call platform and then I/O chip backend
on PowerNV platform.

The patch adds the backend for I/O chip to notify the platform
that the specific PHB is ready to supply EEH service.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: EEH backend for P7IOC
Gavin Shan [Thu, 20 Jun 2013 05:21:06 +0000 (13:21 +0800)]
powerpc/eeh: EEH backend for P7IOC

For EEH on PowerNV platform, the overall architecture is different
from that on pSeries platform. In order to support multiple I/O chips
in future, we split EEH to 3 layers for PowerNV platform: EEH core,
platform layer, I/O layer. It would give EEH implementation on PowerNV
platform much more flexibility in future.

The patch adds the EEH backend for P7IOC.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Sync OPAL API with firmware
Gavin Shan [Thu, 20 Jun 2013 05:21:05 +0000 (13:21 +0800)]
powerpc/eeh: Sync OPAL API with firmware

The patch synchronizes OPAL APIs between kernel and firmware. Also,
we starts to replace opal_pci_get_phb_diag_data() with the similar
opal_pci_get_phb_diag_data2() and the former OPAL API would return
OPAL_UNSUPPORTED from now on.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: EEH core to handle special event
Gavin Shan [Thu, 20 Jun 2013 05:21:04 +0000 (13:21 +0800)]
powerpc/eeh: EEH core to handle special event

On PowerNV platform, the EEH event caused by interrupt won't have
binding PE. The patch enables EEH core to handle the special event.
To avoid the current logic we have, The eeh_handle_event() is renamed
to eeh_handle_normal_event(), and the eeh_handle_special_event() is
introduced. The function eeh_handle_event() dispatches to above two
functions according to the input parameter. Besides, new backend
"next_error" added to eeh_ops and it's expected to have following
return values:

        4 - Dead IOC           3 - Dead PHB
        2 - Fenced PHB         1 - Frozen PE
        0 - No error found

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Export confirm_error_lock
Gavin Shan [Thu, 20 Jun 2013 05:21:03 +0000 (13:21 +0800)]
powerpc/eeh: Export confirm_error_lock

An EEH event is created and queued to the event queue for each
ingress EEH error. When there're mutiple EEH errors, we need serialize
the process to keep consistent PE state (flags). The spinlock
"confirm_error_lock" was introduced for the purpose. We'll inject
EEH event upon error reporting interrupts on PowerNV platform. So
we export the spinlock for that to use for consistent PE state.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Allow to purge EEH events
Gavin Shan [Thu, 20 Jun 2013 05:21:02 +0000 (13:21 +0800)]
powerpc/eeh: Allow to purge EEH events

On PowerNV platform, we might run into the situation where subsequent
events are duplicated events of former one, which is being processed.
For the case, we need the function implemented by the patch to purge
EEH events accordingly.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Trace time on first error for PE
Gavin Shan [Thu, 20 Jun 2013 05:21:01 +0000 (13:21 +0800)]
powerpc/eeh: Trace time on first error for PE

We're not expecting that one specific PE got frozen for over 5
times in last hour. Otherwise, the PE will be removed from the
system upon newly coming EEH errors. The patch introduces time
stamp to trace the first error on specific PE in last hour and
function to update that accordingly. Besides, the time stamp
is recovered during PE hotplug path as we did for frozen count.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Single kthread to handle events
Gavin Shan [Thu, 20 Jun 2013 05:21:00 +0000 (13:21 +0800)]
powerpc/eeh: Single kthread to handle events

We possiblly have multiple kthreads running for multiple EEH errors
(events) and use one spinlock to make the process of handling those
EEH events serialized. That's unnecessary and the patch creates only
one kthread, which is started during EEH core initialization time in
eeh_init(). A new semaphore introduced to count the number of existing
EEH events in the queue and the kthread waiting on the semaphore.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Delay EEH probe during hotplug
Gavin Shan [Thu, 20 Jun 2013 05:20:59 +0000 (13:20 +0800)]
powerpc/eeh: Delay EEH probe during hotplug

While doing EEH recovery, the PCI devices of the problematic PE
should be removed and then added to the system again. During the
so-called hotplug event, the PCI devices of the problematic PE
will be probed through early/late phase. We would delay EEH probe
on late point for PowerNV platform since the PCI device isn't
available in early phase.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Refactor eeh_reset_pe_once()
Gavin Shan [Thu, 20 Jun 2013 05:20:58 +0000 (13:20 +0800)]
powerpc/eeh: Refactor eeh_reset_pe_once()

We shouldn't check that the returned PE status is exactly equal to
(EEH_STATE_MMIO_ACTIVE | EEH_STATE_DMA_ACTIVE) but instead only check
that they are both set.

[benh: changelog]
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: EEH post initialization operation
Gavin Shan [Thu, 20 Jun 2013 05:20:57 +0000 (13:20 +0800)]
powerpc/eeh: EEH post initialization operation

The patch adds new EEH operation post_init. It's used to notify
the platform that EEH core has completed the EEH probe. By that,
PowerNV platform starts to use the services supplied by EEH
functionality.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Make eeh_init() public
Gavin Shan [Thu, 20 Jun 2013 05:20:56 +0000 (13:20 +0800)]
powerpc/eeh: Make eeh_init() public

For EEH on PowerNV platform, we will do EEH probe based on the
real PCI devices. The PCI devices are available after PCI probe.
So we have to call eeh_init() explicitly on PowerNV platform
after PCI probe. The patch also does EEH probe for PowerNV platform
in eeh_init().

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Trace PCI bus from PE
Gavin Shan [Thu, 20 Jun 2013 05:20:55 +0000 (13:20 +0800)]
powerpc/eeh: Trace PCI bus from PE

There're several types of PEs can be supported for now: PHB, Bus
and Device dependent PE. For PCI bus dependent PE, tracing the
corresponding PCI bus from PE (struct eeh_pe) would make the code
more efficient. The patch also enables the retrieval of PCI bus based
on the PCI bus dependent PE.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Make eeh_pe_get() public
Gavin Shan [Thu, 20 Jun 2013 05:20:54 +0000 (13:20 +0800)]
powerpc/eeh: Make eeh_pe_get() public

While processing EEH event interrupt from P7IOC, we need function
to retrieve the PE according to the indicated EEH device. The patch
makes function eeh_pe_get() public so that other source files can call
it for that purpose. Also, the patch fixes referring to wrong BDF
(Bus/Device/Function) address while searching PE in function
__eeh_pe_get().

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Make eeh_phb_pe_get() public
Gavin Shan [Thu, 20 Jun 2013 05:20:53 +0000 (13:20 +0800)]
powerpc/eeh: Make eeh_phb_pe_get() public

One of the possible cases indicated by P7IOC interrupt is fenced
PHB. For that case, we need fetch the PE corresponding to the PHB
and disable the PHB and all subordinate PCI buses/devices, recover
from the fenced state and eventually enable the whole PHB. We need
one function to fetch the PHB PE outside eeh_pe.c and the patch is
going to make eeh_phb_pe_get() public for that purpose.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Move common part to kernel directory
Gavin Shan [Thu, 20 Jun 2013 05:20:52 +0000 (13:20 +0800)]
powerpc/eeh: Move common part to kernel directory

The patch moves the common part of EEH core into arch/powerpc/kernel
directory so that we needn't PPC_PSERIES while compiling POWERNV
platform:

        * Move the EEH common part into arch/powerpc/kernel
        * Move the functions for PCI hotplug from pSeries platform to
          arch/powerpc/kernel/pci-hotplug.c
        * Move CONFIG_EEH from arch/powerpc/platforms/pseries/Kconfig to
          arch/powerpc/platforms/Kconfig
        * Adjust makefile accordingly

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Cleanup for EEH core
Gavin Shan [Thu, 20 Jun 2013 05:20:51 +0000 (13:20 +0800)]
powerpc/eeh: Cleanup for EEH core

Cleanup on EEH core to remove unnecessary whitespaces.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/tm: Fix return of active 64bit signals
Michael Neuling [Sun, 9 Jun 2013 11:23:19 +0000 (21:23 +1000)]
powerpc/tm: Fix return of active 64bit signals

Currently we only restore signals which are transactionally suspended but it's
possible that the transaction can be restored even when it's active.  Most
likely this will result in a transactional rollback by the hardware as the
transaction will have been doomed by an earlier treclaim.

The current code is a legacy of earlier kernel implementations which did
software rollback of active transactions in the kernel.  That code has now gone
but we didn't correctly fix up this part of the signals code which still makes
assumptions based on having software rollback.

This changes the signal return code to always restore both contexts on 64 bit
signal return.  It also ensures that the MSR TM bits are properly restored from
the signal context which they are not currently.

Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/tm: Fix return of 32bit rt signals to active transactions
Michael Neuling [Sun, 9 Jun 2013 11:23:18 +0000 (21:23 +1000)]
powerpc/tm: Fix return of 32bit rt signals to active transactions

Currently we only restore signals which are transactionally suspended but it's
possible that the transaction can be restored even when it's active.  Most
likely this will result in a transactional rollback by the hardware as the
transaction will have been doomed by an earlier treclaim.

The current code is a legacy of earlier kernel implementations which did
software rollback of active transactions in the kernel.  That code has now gone
but we didn't correctly fix up this part of the signals code which still makes
assumptions based on having software rollback.

This changes the signal return code to always restore both contexts on 32 bit
rt signal return.

Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/tm: Fix restoration of MSR on 32bit signal return
Michael Neuling [Sun, 9 Jun 2013 11:23:17 +0000 (21:23 +1000)]
powerpc/tm: Fix restoration of MSR on 32bit signal return

Currently we clear out the MSR TM bits on signal return assuming that the
signal should never return to an active transaction.

This is bogus as the user may do this.  It's most likely the transaction will
be doomed due to a treclaim but that's a problem for the HW not the kernel.

The current code is a legacy of earlier kernel implementations which did
software rollback of active transactions in the kernel.  That code has now gone
but we didn't correctly fix up this part of the signals code which still makes
the assumption that it must be returning to a suspended transaction.

This pulls out both MSR TM bits from the user supplied context rather than just
setting TM suspend.  We pull out only the bits needed to ensure the user can't
do anything dangerous to the MSR.

Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/tm: Fix 32 bit non-rt signals
Michael Neuling [Sun, 9 Jun 2013 11:23:16 +0000 (21:23 +1000)]
powerpc/tm: Fix 32 bit non-rt signals

Currently sys_sigreturn() is TM unaware.  Therefore, if we take a 32 bit signal
without SIGINFO (non RT) inside a transaction, on signal return we don't
restore the signal frame correctly.

This checks if the signal frame being restoring is an active transaction, and
if so, it copies the additional state to ptregs so it can be restored.

Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/tm: Fix writing top half of MSR on 32 bit signals
Michael Neuling [Sun, 9 Jun 2013 11:23:15 +0000 (21:23 +1000)]
powerpc/tm: Fix writing top half of MSR on 32 bit signals

The MSR TM controls are in the top 32 bits of the MSR hence on 32 bit signals,
we stick the top half of the MSR in the checkpointed signal context so that the
user can access it.

Unfortunately, we don't currently write anything to the checkpointed signal
context when coming in a from a non transactional process and hence the top MSR
bits can contain junk.

This updates the 32 bit signal handling code to always write something to the
top MSR bits so that users know if the process is transactional or not and the
kernel can use it on signal return.

Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/8xx: Remove 8xx specific "minimal FPU emulation"
Benjamin Herrenschmidt [Sun, 9 Jun 2013 07:04:58 +0000 (17:04 +1000)]
powerpc/8xx: Remove 8xx specific "minimal FPU emulation"

This is duplicated code from math-emu and implements such a small
subset of the FPU (load/stores/fmr) that it's essentially pointless
nowdays.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/math-emu: Allow math-emu to be used for HW FPU
Benjamin Herrenschmidt [Sun, 9 Jun 2013 07:01:24 +0000 (17:01 +1000)]
powerpc/math-emu: Allow math-emu to be used for HW FPU

(Including 64-bit ones)

This allow SW emulation by the kernel of optional instructions
such as fsqrt which aren't implemented on some processors, and
thus fixes some Fedora 19 issues such as Anaconda since the
compiler is set to generate those by default on 64-bit.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/math-emu: Fix decoding of some instructions
Benjamin Herrenschmidt [Sun, 9 Jun 2013 07:00:42 +0000 (17:00 +1000)]
powerpc/math-emu: Fix decoding of some instructions

The decoding of some instructions such as fsqrt{s} was incorrect,
using the wrong registers, and thus could not work.

This fixes it and also adds a couple of place holders for missing
instructions.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/pseries: Read common partition via pstore
Aruna Balakrishnaiah [Wed, 5 Jun 2013 18:52:20 +0000 (00:22 +0530)]
powerpc/pseries: Read common partition via pstore

This patch exploits pstore subsystem to read details of common partition
in NVRAM to a separate file in /dev/pstore. For instance, common partition
details will be stored in a file named [common-nvram-6].

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/pseries: Read of-config partition via pstore
Aruna Balakrishnaiah [Wed, 5 Jun 2013 18:52:10 +0000 (00:22 +0530)]
powerpc/pseries: Read of-config partition via pstore

This patch set exploits the pstore subsystem to read details of
of-config partition in NVRAM to a separate file in /dev/pstore.
For instance, of-config partition details will be stored in a
file named [of-nvram-5].

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/pseries: Distinguish between a os-partition and non-os partition
Aruna Balakrishnaiah [Wed, 5 Jun 2013 18:51:59 +0000 (00:21 +0530)]
powerpc/pseries: Distinguish between a os-partition and non-os partition

Introduce os_partition member in nvram_os_partition structure to identify
if the partition is an os partition or not. This will be useful to handle
non-os partitions of-config and common.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/pseries: Read rtas partition via pstore
Aruna Balakrishnaiah [Wed, 5 Jun 2013 18:51:44 +0000 (00:21 +0530)]
powerpc/pseries: Read rtas partition via pstore

This patch set exploits the pstore subsystem to read details of rtas partition
in NVRAM to a separate file in /dev/pstore. For instance, rtas details will be
stored in a file named [rtas-nvram-4].

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/pseries: Read/Write oops nvram partition via pstore
Aruna Balakrishnaiah [Wed, 5 Jun 2013 18:51:32 +0000 (00:21 +0530)]
powerpc/pseries: Read/Write oops nvram partition via pstore

IBM's p series machines provide persistent storage for LPARs through NVRAM.
NVRAM's lnx,oops-log partition is used to log oops messages.
Currently the kernel provides the contents of p-series NVRAM only as a
simple stream of bytes via /dev/nvram, which must be interpreted in user
space by the nvram command in the powerpc-utils package.

This patch set exploits the pstore subsystem to expose oops partition in
NVRAM as a separate file in /dev/pstore. For instance, Oops messages will be
stored in a file named [dmesg-nvram-2]. In case pstore registration fails it
will fall back to kmsg_dump mechanism.

This patch will read/write the oops messages from/to this partition via pstore.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/pseries: Introduce generic read function to read nvram-partitions
Aruna Balakrishnaiah [Wed, 5 Jun 2013 18:51:16 +0000 (00:21 +0530)]
powerpc/pseries: Introduce generic read function to read nvram-partitions

Introduce generic read function to read nvram partitions other than rtas.
nvram_read_error_log will be retained which is used to read rtas partition
from rtasd. nvram_read_partition is the generic read function to read from
any nvram partition.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/pseries: Add version and timestamp to oops header
Aruna Balakrishnaiah [Wed, 5 Jun 2013 18:51:05 +0000 (00:21 +0530)]
powerpc/pseries: Add version and timestamp to oops header

Introduce version and timestamp information in the oops header.
oops_log_info (oops header) holds version (to distinguish between old
and new format oops header), length of the oops text
(compressed or uncompressed) and timestamp.

The version field will sit in the same place as the length in old
headers. version is assigned 5000 (greater than oops partition size)
so that existing tools will refuse to dump new style partitions as
the length is too large. The updated tools will work with both
old and new format headers.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/pseries: Remove syslog prefix in uncompressed oops text
Aruna Balakrishnaiah [Wed, 5 Jun 2013 18:50:55 +0000 (00:20 +0530)]
powerpc/pseries: Remove syslog prefix in uncompressed oops text

Removal of syslog prefix in the uncompressed oops text will
help in capturing more oops data.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Enhance converting EEH dev
Gavin Shan [Wed, 5 Jun 2013 07:34:03 +0000 (15:34 +0800)]
powerpc/eeh: Enhance converting EEH dev

Under some special circumstances, the EEH device doesn't have the
associated device tree node or PCI device. The patch enhances those
functions converting EEH device to device tree node or PCI device
accordingly to avoid unnecessary system crash.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/eeh: Fix fetching bus for single-dev-PE
Gavin Shan [Wed, 5 Jun 2013 07:34:02 +0000 (15:34 +0800)]
powerpc/eeh: Fix fetching bus for single-dev-PE

While running Linux as guest on top of phyp, we possiblly have
PE that includes single PCI device. However, we didn't return
its PCI bus correctly and it leads to failure on recovery from
EEH errors for single-dev-PE. The patch fixes the issue.

Cc: <stable@vger.kernel.org> # v3.7+
Cc: Steve Best <sbest@us.ibm.com>
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc: Align thread->fpr to 16 bytes
Anton Blanchard [Wed, 5 Jun 2013 03:02:26 +0000 (13:02 +1000)]
powerpc: Align thread->fpr to 16 bytes

On newer CPUs we use VSX loads and stores to the thread->fpr array.
For best performance we need to ensure 16 byte alignment.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/pseries: Use 'true' instead of '1' for orderly_poweroff
liguang [Thu, 30 May 2013 07:20:33 +0000 (15:20 +0800)]
powerpc/pseries: Use 'true' instead of '1' for orderly_poweroff

orderly_poweroff is expecting a bool parameter, so
use 'true' instead '1'

Signed-off-by: liguang <lig.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/smp: Use '==' instead of '<' for system_state
liguang [Thu, 30 May 2013 06:47:53 +0000 (14:47 +0800)]
powerpc/smp: Use '==' instead of '<' for system_state

'system_state < SYSTEM_RUNNING' will have same effect
with 'system_state == SYSTEM_BOOTING', but the later
one is more clearer.

Signed-off-by: liguang <lig.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc: Restore dbcr0 on user space exit
Bharat Bhushan [Wed, 22 May 2013 04:20:59 +0000 (09:50 +0530)]
powerpc: Restore dbcr0 on user space exit

On BookE (Branch taken + Single Step) is as same as Branch Taken
on BookS and in Linux we simulate BookS behavior for BookE as well.
When doing so, in Branch taken handling we want to set DBCR0_IC but
we update the current->thread->dbcr0 and not DBCR0.

Now on 64bit the current->thread.dbcr0 (and other debug registers)
is synchronized ONLY on context switch flow. But after handling
Branch taken in debug exception if we return back to user space
without context switch then single stepping change (DBCR0_ICMP)
does not get written in h/w DBCR0 and Instruction Complete exception
does not happen.

This fixes using ptrace reliably on BookE-PowerPC

lmbench latency test (lat_syscall) Results are (they varies a little
on each run)

1) ./lat_syscall <action> /dev/shm/uImage

action: Open read write stat fstat null
Before: 3.8618 0.2017 0.2851 1.6789 0.2256 0.0856
After: 3.8580 0.2017 0.2851 1.6955 0.2255 0.0856

1) ./lat_syscall -P 2 -N 10 <action> /dev/shm/uImage
action: Open read write stat fstat null
Before: 4.1388 0.2238 0.3066 1.7106 0.2256 0.0856
After: 4.1413 0.2236 0.3062 1.7107 0.2256 0.0856

[ Slightly modified to avoid extra branch in the fast path
  on Book3S and fix build on all non-BookE 64-bit -- BenH
]

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc: Debug control and status registers are 32bit
Bharat Bhushan [Wed, 22 May 2013 04:20:58 +0000 (09:50 +0530)]
powerpc: Debug control and status registers are 32bit

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/vfio: Enable on pSeries platform
Alexey Kardashevskiy [Tue, 21 May 2013 03:33:11 +0000 (13:33 +1000)]
powerpc/vfio: Enable on pSeries platform

The enables VFIO on the pSeries platform, enabling user space
programs to access PCI devices directly.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/vfio: Implement IOMMU driver for VFIO
Alexey Kardashevskiy [Tue, 21 May 2013 03:33:10 +0000 (13:33 +1000)]
powerpc/vfio: Implement IOMMU driver for VFIO

VFIO implements platform independent stuff such as
a PCI driver, BAR access (via read/write on a file descriptor
or direct mapping when possible) and IRQ signaling.

The platform dependent part includes IOMMU initialization
and handling.  This implements an IOMMU driver for VFIO
which does mapping/unmapping pages for the guest IO and
provides information about DMA window (required by a POWER
guest).

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/vfio: Enable on PowerNV platform
Alexey Kardashevskiy [Tue, 21 May 2013 03:33:09 +0000 (13:33 +1000)]
powerpc/vfio: Enable on PowerNV platform

This initializes IOMMU groups based on the IOMMU configuration
discovered during the PCI scan on POWERNV (POWER non virtualized)
platform.  The IOMMU groups are to be used later by the VFIO driver,
which is used for PCI pass through.

It also implements an API for mapping/unmapping pages for
guest PCI drivers and providing DMA window properties.
This API is going to be used later by QEMU-VFIO to handle
h_put_tce hypercalls from the KVM guest.

The iommu_put_tce_user_mode() does only a single page mapping
as an API for adding many mappings at once is going to be
added later.

Although this driver has been tested only on the POWERNV
platform, it should work on any platform which supports
TCE tables.  As h_put_tce hypercall is received by the host
kernel and processed by the QEMU (what involves calling
the host kernel again), performance is not the best -
circa 220MB/s on 10Gb ethernet network.

To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
option and configure VFIO as required.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc: Update currituck pci/usb fixup for new board revision
Alistair Popple [Thu, 9 May 2013 00:42:13 +0000 (10:42 +1000)]
powerpc: Update currituck pci/usb fixup for new board revision

The currituck board uses a different IRQ for the pci usb host
controller depending on the board revision. This patch adds support
for newer board revisions by retrieving the board revision from the
FPGA and mapping the appropriate IRQ.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Acked-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc: Fix single step emulation of 32bit overflowed branches
Michael Neuling [Mon, 6 May 2013 11:32:40 +0000 (21:32 +1000)]
powerpc: Fix single step emulation of 32bit overflowed branches

Check truncate_if_32bit() on final write to nip.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc: Update default configurations
Alistair Popple [Mon, 29 Apr 2013 03:42:44 +0000 (13:42 +1000)]
powerpc: Update default configurations

Update default configurations for systems with CONFIG_BOOTX_TEXT
selected so that they continue to print early debug messages as is
currently the case.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc: Add a configuration option for early BootX/OpenFirmware debug
Alistair Popple [Mon, 29 Apr 2013 03:42:43 +0000 (13:42 +1000)]
powerpc: Add a configuration option for early BootX/OpenFirmware debug

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/prom: Scan reserved-ranges node for memory reservations
Jeremy Kerr [Wed, 24 Apr 2013 06:26:30 +0000 (14:26 +0800)]
powerpc/prom: Scan reserved-ranges node for memory reservations

Based on benh's proposal at
https://lists.ozlabs.org/pipermail/linuxppc-dev/2012-September/101237.html,
this change provides support for reserving memory from the
reserved-ranges node at the root of the device tree.

We just call memblock_reserve on these ranges for now.

Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/mm: Make mmap_64.c compile on 32bit powerpc
Daniel Walker [Wed, 24 Apr 2013 00:50:33 +0000 (17:50 -0700)]
powerpc/mm: Make mmap_64.c compile on 32bit powerpc

There appears to be no good reason to keep this as 64bit only. It works
on 32bit also, and has checks so that it can work correctly with 32bit
binaries on 64bit hardware which is why I think this works.

I tested this on qemu using the virtex-ml507 machine type.

Before,

/bin2 # ./test & cat /proc/${!}/maps
00100000-00103000 r-xp 00000000 00:00 0          [vdso]
10000000-10007000 r-xp 00000000 00:01 454        /bin2/test
10017000-10018000 rw-p 00007000 00:01 454        /bin2/test
48000000-48020000 r-xp 00000000 00:01 224        /lib/ld-2.11.3.so
48021000-48023000 rw-p 00021000 00:01 224        /lib/ld-2.11.3.so
bfd03000-bfd24000 rw-p 00000000 00:00 0          [stack]
/bin2 # ./test & cat /proc/${!}/maps
00100000-00103000 r-xp 00000000 00:00 0          [vdso]
0fe6e000-0ffd8000 r-xp 00000000 00:01 214        /lib/libc-2.11.3.so
0ffd8000-0ffe8000 ---p 0016a000 00:01 214        /lib/libc-2.11.3.so
0ffe8000-0ffed000 rw-p 0016a000 00:01 214        /lib/libc-2.11.3.so
0ffed000-0fff0000 rw-p 00000000 00:00 0
10000000-10007000 r-xp 00000000 00:01 454        /bin2/test
10017000-10018000 rw-p 00007000 00:01 454        /bin2/test
48000000-48020000 r-xp 00000000 00:01 224        /lib/ld-2.11.3.so
48020000-48021000 rw-p 00000000 00:00 0
48021000-48023000 rw-p 00021000 00:01 224        /lib/ld-2.11.3.so
bf98a000-bf9ab000 rw-p 00000000 00:00 0          [stack]
/bin2 # ./test & cat /proc/${!}/maps
00100000-00103000 r-xp 00000000 00:00 0          [vdso]
0fe6e000-0ffd8000 r-xp 00000000 00:01 214        /lib/libc-2.11.3.so
0ffd8000-0ffe8000 ---p 0016a000 00:01 214        /lib/libc-2.11.3.so
0ffe8000-0ffed000 rw-p 0016a000 00:01 214        /lib/libc-2.11.3.so
0ffed000-0fff0000 rw-p 00000000 00:00 0
10000000-10007000 r-xp 00000000 00:01 454        /bin2/test
10017000-10018000 rw-p 00007000 00:01 454        /bin2/test
48000000-48020000 r-xp 00000000 00:01 224        /lib/ld-2.11.3.so
48020000-48021000 rw-p 00000000 00:00 0
48021000-48023000 rw-p 00021000 00:01 224        /lib/ld-2.11.3.so
bfa54000-bfa75000 rw-p 00000000 00:00 0          [stack]

After,

bash-4.1# ./test & cat /proc/${!}/maps
[7] 803
00100000-00103000 r-xp 00000000 00:00 0          [vdso]
10000000-10007000 r-xp 00000000 00:01 454        /bin2/test
10017000-10018000 rw-p 00007000 00:01 454        /bin2/test
b7eb0000-b7ed0000 r-xp 00000000 00:01 224        /lib/ld-2.11.3.so
b7ed1000-b7ed3000 rw-p 00021000 00:01 224        /lib/ld-2.11.3.so
bfbc0000-bfbe1000 rw-p 00000000 00:00 0          [stack]
bash-4.1# ./test & cat /proc/${!}/maps
[8] 805
00100000-00103000 r-xp 00000000 00:00 0          [vdso]
10000000-10007000 r-xp 00000000 00:01 454        /bin2/test
10017000-10018000 rw-p 00007000 00:01 454        /bin2/test
b7b03000-b7b23000 r-xp 00000000 00:01 224        /lib/ld-2.11.3.so
b7b24000-b7b26000 rw-p 00021000 00:01 224        /lib/ld-2.11.3.so
bfc27000-bfc48000 rw-p 00000000 00:00 0          [stack]
bash-4.1# ./test & cat /proc/${!}/maps
[9] 807
00100000-00103000 r-xp 00000000 00:00 0          [vdso]
10000000-10007000 r-xp 00000000 00:01 454        /bin2/test
10017000-10018000 rw-p 00007000 00:01 454        /bin2/test
b7f37000-b7f57000 r-xp 00000000 00:01 224        /lib/ld-2.11.3.so
b7f58000-b7f5a000 rw-p 00021000 00:01 224        /lib/ld-2.11.3.so
bff96000-bffb7000 rw-p 00000000 00:00 0          [stack]

Signed-off-by: Daniel Walker <dwalker@fifo90.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc: Remove the unneeded trigger of decrementer interrupt in decrementer_check_ov...
Kevin Hao [Wed, 17 Apr 2013 09:50:35 +0000 (17:50 +0800)]
powerpc: Remove the unneeded trigger of decrementer interrupt in decrementer_check_overflow

Previously in order to handle the edge sensitive decrementers,
we choose to set the decrementer to 1 to trigger a decrementer
interrupt when re-enabling interrupts. But with the rework of the
lazy EE, we would replay the decrementer interrupt when re-enabling
interrupts if a decrementer interrupt occurs with irq soft-disabled.
So there is no need to trigger a decrementer interrupt in this case
any more.

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/mm/nohash: Ignore NULL stale_map entries
Scott Wood [Thu, 21 Mar 2013 00:06:12 +0000 (19:06 -0500)]
powerpc/mm/nohash: Ignore NULL stale_map entries

This happens with threads that are offline due to CPU hotplug
(including threads that were never "plugged in" to begin with because
SMT is disabled).

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc: Move the single step enable code to a generic path
Suzuki K. Poulose [Mon, 3 Dec 2012 15:08:37 +0000 (20:38 +0530)]
powerpc: Move the single step enable code to a generic path

This patch moves the single step enable code used by kprobe to a generic
routine header so that, it can be re-used by other code, in this case,
uprobes. No functional changes.

Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com>
Cc: Ananth N Mavinakaynahalli <ananth@in.ibm.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: linuxppc-dev@ozlabs.org
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc/kprobes: Do not disable External interrupts during single step
Suzuki K. Poulose [Mon, 3 Dec 2012 15:07:42 +0000 (20:37 +0530)]
powerpc/kprobes: Do not disable External interrupts during single step

External/Decrement exceptions have lower priority than the Debug Exception.
So, we don't have to disable the External interrupts before a single step.
However, on BookE, Critical Input Exception(CE) has higher priority than a
Debug Exception. Hence we mask them.

Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Ananth N Mavinakaynahalli <ananth@in.ibm.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agopowerpc: Mark low level irq handlers NO_THREAD
Thomas Gleixner [Wed, 13 Feb 2013 22:38:51 +0000 (23:38 +0100)]
powerpc: Mark low level irq handlers NO_THREAD

These low level handlers cannot be threaded. Mark them NO_THREAD

Reported-by: leroy christophe <christophe.leroy@c-s.fr>
Tested-by: leroy christophe <christophe.leroy@c-s.fr>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agomm/THP: deposit the transpare huge pgtable before set_pmd
Aneesh Kumar K.V [Thu, 6 Jun 2013 00:14:06 +0000 (17:14 -0700)]
mm/THP: deposit the transpare huge pgtable before set_pmd

Architectures like powerpc use the deposited pgtable to store hash index
values.  We need to make the deposted pgtable is visible to other cpus
before we are ready to take a hash fault.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agomm/THP: don't use HPAGE_SHIFT in transparent hugepage code
Aneesh Kumar K.V [Thu, 6 Jun 2013 00:14:05 +0000 (17:14 -0700)]
mm/THP: don't use HPAGE_SHIFT in transparent hugepage code

For architectures like powerpc that support multiple explicit hugepage
sizes, HPAGE_SHIFT indicate the default explicit hugepage shift.  For THP
to work the hugepage size should be same as PMD_SIZE.  So use PMD_SHIFT
directly.  So move the define outside CONFIG_TRANSPARENT_HUGEPAGE #ifdef
because we want to use these defines in generic code with if
(pmd_trans_huge()) conditional.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agomm/THP: withdraw the pgtable after pmdp related operations
Aneesh Kumar K.V [Thu, 6 Jun 2013 00:14:04 +0000 (17:14 -0700)]
mm/THP: withdraw the pgtable after pmdp related operations

For architectures like ppc64 we look at deposited pgtable when calling
pmdp_get_and_clear.  So do the pgtable_trans_huge_withdraw after finishing
pmdp related operations.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agomm/THP: add pmd args to pgtable deposit and withdraw APIs
Aneesh Kumar K.V [Thu, 6 Jun 2013 00:14:02 +0000 (17:14 -0700)]
mm/THP: add pmd args to pgtable deposit and withdraw APIs

This will be later used by powerpc THP support.  In powerpc we want to use
pgtable for storing the hash index values.  So instead of adding them to
mm_context list, we would like to store them in the second half of pmd

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agomm/thp: use the correct function when updating access flags
Aneesh Kumar K.V [Thu, 6 Jun 2013 07:20:34 +0000 (00:20 -0700)]
mm/thp: use the correct function when updating access flags

We should use pmdp_set_access_flags to update access flags.  Archs like
powerpc use extra checks(_PAGE_BUSY) when updating a hugepage PTE.  A
set_pmd_at doesn't do those checks.  We should use set_pmd_at only when
updating a none hugepage PTE.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>a
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
10 years agoMerge tag 'acpi-3.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael...
Linus Torvalds [Thu, 13 Jun 2013 20:09:50 +0000 (13:09 -0700)]
Merge tag 'acpi-3.10-rc6' of git://git./linux/kernel/git/rafael/linux-pm

Pull ACPI fix from Rafael Wysocki:
 "This is an alternative fix for the regression introduced in 3.9 whose
  previous fix had to be reverted right before 3.10-rc5, because it
  broke one of the Tony's machines.

  In this one the check is confined to the ACPI video driver (which is
  the only one causing the problem to happen in the first place) and the
  Tony's box shouldn't even notice it.

   - ACPI fix for an issue causing ACPI video driver to attempt to bind
     to devices it shouldn't touch from Rafael J Wysocki."

* tag 'acpi-3.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI / video: Do not bind to device objects with a scan handler

10 years agoMerge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Thu, 13 Jun 2013 20:08:51 +0000 (13:08 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Peter Anvin:
 "Another set of fixes, the biggest bit of this is yet another tweak to
  the UEFI anti-bricking code; apparently we finally got some feedback
  from Samsung as to what makes at least their systems fail.  This set
  should actually fix the boot regressions that some other systems (e.g.
  SGI) have exhibited.

  Other than that, there is a patch to avoid a panic with particularly
  unhappy memory layouts and two minor protocol fixes which may or may
  not be manifest bugs"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Fix typo in kexec register clearing
  x86, relocs: Move __vvar_page from S_ABS to S_REL
  Modify UEFI anti-bricking code
  x86: Fix adjust_range_size_mask calling position

10 years agoMerge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck...
Linus Torvalds [Thu, 13 Jun 2013 19:36:42 +0000 (12:36 -0700)]
Merge branch 'rcu/urgent' of git://git./linux/kernel/git/paulmck/linux-rcu

Pull RCU fixes from Paul McKenney:
 "I must confess that this past merge window was not RCU's best showing.
  This series contains three more fixes for RCU regressions:

   1.   A fix to __DECLARE_TRACE_RCU() that causes it to act as an
        interrupt from idle rather than as a task switch from idle.
        This change is needed due to the recent use of _rcuidle()
        tracepoints that can be invoked from interrupt handlers as well
        as from idle.  Without this fix, invoking _rcuidle() tracepoints
        from interrupt handlers results in splats and (more seriously)
        confusion on RCU's part as to whether a given CPU is idle or not.
        This confusion can in turn result in too-short grace periods and
        therefore random memory corruption.

   2.   A fix to a subtle deadlock that could result due to RCU doing
        a wakeup while holding one of its rcu_node structure's locks.
        Although the probability of occurrence is low, it really
        does happen.  The fix, courtesy of Steven Rostedt, uses
        irq_work_queue() to avoid the deadlock.

   3.   A fix to a silent deadlock (invisible to lockdep) due to the
        interaction of timeouts posted by RCU debug code enabled by
        CONFIG_PROVE_RCU_DELAY=y, grace-period initialization, and CPU
        hotplug operations.  This will not occur in production kernels,
        but really does occur in randconfig testing.  Diagnosis courtesy
        of Steven Rostedt"

* 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  rcu: Fix deadlock with CPU hotplug, RCU GP init, and timer migration
  rcu: Don't call wakeup() with rcu_node structure ->lock held
  trace: Allow idle-safe tracepoints to be called from irq

10 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Linus Torvalds [Thu, 13 Jun 2013 18:02:31 +0000 (11:02 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/s390/linux

Pull s390 fixes from Martin Schwidefsky:
 "Three kvm related memory management fixes, a fix for show_trace, a fix
  for early console output and a patch from Ben to help prevent compile
  errors in regard to irq functions (or our lack thereof)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/pci: Implement IRQ functions if !PCI
  s390/sclp: fix new line detection
  s390/pgtable: make pgste lock an explicit barrier
  s390/pgtable: Save pgste during modify_prot_start/commit
  s390/dumpstack: fix address ranges for asynchronous and panic stack
  s390/pgtable: Fix guest overindication for change bit

10 years agoMerge tag 'asoc-v3.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie...
Linus Torvalds [Thu, 13 Jun 2013 17:18:33 +0000 (10:18 -0700)]
Merge tag 'asoc-v3.10-rc5' of git://git./linux/kernel/git/broonie/sound

Pull ASoC sound updates from Mark Brown:
 "Takashi is travelling at the minute and it'd be good to get the
  MAINTAINERS update in here merged so sending directly.

  As well as the usual driver specifics we've got a couple of core fixes
  here, one fixing capabilities for unidirectional streams and the other
  fixing suspend while audio streams are active.

  The suspend fix is a little involved but mostly as a result of
  removing some special casing that was doing the wrong thing."

* tag 'asoc-v3.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound:
  ASoC: tlv320aic3x: Remove deadlock from snd_soc_dapm_put_volsw_aic3x()
  ASoC: dapm: Treat DAI widgets like AIF widgets for power
  ASoC: arizona: Correct AEC loopback enable
  ASoC: pcm: Require both CODEC and CPU support when declaring stream caps
  MAINTAINERS: Remove myself from Wolfson maintainers
  ASoC: wm8994: Ensure microphone detection state is reset on removal
  ASoC: wm8994: Avoid leaking pm_runtime reference on removed jack race
  ASoC: cs42l52: fix hp_gain_enum shift value.
  ASoC: cs42l52: use correct PCM mixer TLV dB scale to match datasheet.

10 years agoMerge tag 'md-3.10-fixes' of git://neil.brown.name/md
Linus Torvalds [Thu, 13 Jun 2013 17:13:29 +0000 (10:13 -0700)]
Merge tag 'md-3.10-fixes' of git://neil.brown.name/md

Pull md bugfixes from Neil Brown:
 "A few bugfixes for md

  Some tagged for -stable"

* tag 'md-3.10-fixes' of git://neil.brown.name/md:
  md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
  md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
  md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
  md: md_stop_writes() should always freeze recovery.

10 years agoturbostat: Increase output buffer size to accommodate C8-C10
Josh Triplett [Thu, 13 Jun 2013 00:26:37 +0000 (17:26 -0700)]
turbostat: Increase output buffer size to accommodate C8-C10

On platforms with C8-C10 support, the additional C-states cause
turbostat to overrun its output buffer of 128 bytes per CPU.  Increase
this to 256 bytes per CPU.

[ As a bugfix, this should go into 3.10; however, since the C8-C10
  support didn't go in until after 3.9, this need not go into any stable
  kernel. ]

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
10 years agoMerge tag 'efi-urgent' into x86/urgent
H. Peter Anvin [Thu, 13 Jun 2013 15:59:23 +0000 (08:59 -0700)]
Merge tag 'efi-urgent' into x86/urgent

 * More tweaking to the EFI variable anti-bricking algorithm. Quite a
   few users were reporting boot regressions in v3.9. This has now been
   fixed with a more accurate "minimum storage requirement to avoid
   bricking" value from Samsung (5K instead of 50%) and code to trigger
   garbage collection when we near our limit - Matthew Garrett.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
10 years agomd/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
H. Peter Anvin [Wed, 12 Jun 2013 14:37:43 +0000 (07:37 -0700)]
md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place

There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME.  This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.

After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed.  This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.

However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.

Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.

[neilb: added raid5]

This patch is appropriate for any -stable since 3.7 when write_same
support was added.

Cc: stable@vger.kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
10 years agomd/raid1,raid10: use freeze_array in place of raise_barrier in various places.
NeilBrown [Wed, 12 Jun 2013 01:01:22 +0000 (11:01 +1000)]
md/raid1,raid10: use freeze_array in place of raise_barrier in various places.

Various places in raid1 and raid10 are calling raise_barrier when they
really should call freeze_array.
The former is only intended to be called from "make_request".
The later has extra checks for 'nr_queued' and makes a call to
flush_pending_writes(), so it is safe to call it from within the
management thread.

Using raise_barrier will sometimes deadlock.  Using freeze_array
should not.

As 'freeze_array' currently expects one request to be pending (in
handle_read_error - the only previous caller), we need to pass
it the number of pending requests (extra) to ignore.

The deadlock was made particularly noticeable by commits
050b66152f87c7 (raid10) and 6b740b8d79252f13 (raid1) which
appeared in 3.4, so the fix is appropriate for any -stable
kernel since then.

This patch probably won't apply directly to some early kernels and
will need to be applied by hand.

Cc: stable@vger.kernel.org
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
10 years agomd/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuil...
Alex Lyakas [Tue, 4 Jun 2013 17:42:21 +0000 (20:42 +0300)]
md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.

Without that fix, the following scenario could happen:

- RAID1 with drives A and B; drive B was freshly-added and is rebuilding
- Drive A fails
- WRITE request arrives to the array. It is failed by drive A, so
r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
succeeds in writing it, so the same r1_bio is marked as
R1BIO_Uptodate.
- r1_bio arrives to handle_write_finished, badblocks are disabled,
md_error()->error() does nothing because we don't fail the last drive
of raid1
- raid_end_bio_io()  calls call_bio_endio()
- As a result, in call_bio_endio():
        if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
                clear_bit(BIO_UPTODATE, &bio->bi_flags);
this code doesn't clear the BIO_UPTODATE flag, and the whole master
WRITE succeeds, back to the upper layer.

So we returned success to the upper layer, even though we had written
the data onto the rebuilding drive only. But when we want to read the
data back, we would not read from the rebuilding drive, so this data
is lost.

[neilb - applied identical change to raid10 as well]

This bug can result in lost data, so it is suitable for any
-stable kernel.

Cc: stable@vger.kernel.org
Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
10 years agomd: md_stop_writes() should always freeze recovery.
NeilBrown [Wed, 8 May 2013 23:48:30 +0000 (09:48 +1000)]
md: md_stop_writes() should always freeze recovery.

__md_stop_writes() will currently sometimes freeze recovery.
So any caller must be ready for that to happen, and indeed they are.

However if __md_stop_writes() doesn't freeze_recovery, then
a recovery could start before mddev_suspend() is called, which
could be awkward.  This can particularly cause problems or dm-raid.

So change __md_stop_writes() to always freeze recovery.  This is safe
and more predicatable.

Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Tested-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
10 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Thu, 13 Jun 2013 00:18:29 +0000 (17:18 -0700)]
Merge git://git./linux/kernel/git/davem/net

Pull networking update from David Miller:

 1) Fix dump iterator in nfnl_acct_dump() and ctnl_timeout_dump() to
    dump all objects properly, from Pablo Neira Ayuso.

 2) xt_TCPMSS must use the default MSS of 536 when no MSS TCP option is
    present.  Fix from Phil Oester.

 3) qdisc_get_rtab() looks for an existing matching rate table and uses
    that instead of creating a new one.  However, it's key matching is
    incomplete, it fails to check to make sure the ->data[] array is
    identical too.  Fix from Eric Dumazet.

 4) ip_vs_dest_entry isn't fully initialized before copying back to
    userspace, fix from Dan Carpenter.

 5) Fix ubuf reference counting regression in vhost_net, from Jason
    Wang.

 6) When sock_diag dumps a socket filter back to userspace, we have to
    translate it out of the kernel's internal representation first.
    From Nicolas Dichtel.

 7) davinci_mdio holds a spinlock while calling pm_runtime, which
    sleeps.  Fix from Sebastian Siewior.

 8) Timeout check in sh_eth_check_reset is off by one, from Sergei
    Shtylyov.

 9) If sctp socket init fails, we can NULL deref during cleanup.  Fix
    from Daniel Borkmann.

10) netlink_mmap() does not propagate errors properly, from Patrick
    McHardy.

11) Disable powersave and use minstrel by default in ath9k.  From Sujith
    Manoharan.

12) Fix a regression in that SOCK_ZEROCOPY is not set on tuntap sockets
    which prevents vhost from being able to use zerocopy.  From Jason
    Wang.

13) Fix race between port lookup and TX path in team driver, from Jiri
    Pirko.

14) Missing length checks in bluetooth L2CAP packet parsing, from Johan
    Hedberg.

15) rtlwifi fails to connect to networking using any encryption method
    other than WPA2.  Fix from Larry Finger.

16) Fix iwlegacy build due to incorrect CONFIG_* ifdeffing for power
    management stuff.  From Yijing Wang.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits)
  b43: stop format string leaking into error msgs
  ath9k: Use minstrel rate control by default
  Revert "ath9k_hw: Update rx gain initval to improve rx sensitivity"
  ath9k: Disable PowerSave by default
  net: wireless: iwlegacy: fix build error for il_pm_ops
  rtlwifi: Fix a false leak indication for PCI devices
  wl12xx/wl18xx: scan all 5ghz channels
  wl12xx: increase minimum singlerole firmware version required
  wl12xx: fix minimum required firmware version for wl127x multirole
  rtlwifi: rtl8192cu: Fix problem in connecting to WEP or WPA(1) networks
  mwifiex: debugfs: Fix out of bounds array access
  Bluetooth: Fix mgmt handling of power on failures
  Bluetooth: Fix missing length checks for L2CAP signalling PDUs
  Bluetooth: btmrvl: support Marvell Bluetooth device SD8897
  Bluetooth: Fix checks for LE support on LE-only controllers
  team: fix checks in team_get_first_port_txable_rcu()
  team: move add to port list before port enablement
  team: check return value of team_get_port_by_index_rcu() for NULL
  tuntap: set SOCK_ZEROCOPY flag during open
  netlink: fix error propagation in netlink_mmap()
  ...

10 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
Linus Torvalds [Thu, 13 Jun 2013 00:08:49 +0000 (17:08 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jikos/hid

Pull input layer bugfix from Jiri Kosina:
 "Memory leak regression fix from Benjamin Tissoires"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
  HID: multitouch: prevent memleak with the allocated name