ovn: Implement basic logical L3 routing.
authorBen Pfaff <blp@nicira.com>
Sat, 17 Oct 2015 06:43:58 +0000 (23:43 -0700)
committerBen Pfaff <blp@nicira.com>
Sat, 17 Oct 2015 06:52:41 +0000 (23:52 -0700)
This implements basic logical L3 routing.  It has a lot of caveats,
including the following regarding testing:

   * Only single-router hops have been tested.  Chains or trees of
     logical routers may work but definitely need testing and may
     need a little extra code.

   * No testing of logical router ARP replies.

   * Not enough testing in general.

ovn/TODO describes a lot of other caveats in terms of the work needed
to fix them.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Justin Pettit <jpettit@nicira.com>
ovn/TODO
ovn/northd/ovn-northd.8.xml
ovn/northd/ovn-northd.c
ovn/ovn-sb.xml
tests/ovn.at

index 3119169..10c3adf 100644 (file)
--- a/ovn/TODO
+++ b/ovn/TODO
@@ -47,12 +47,6 @@ various ways to ensure it could be implemented, e.g. the same as for
 OpenFlow by allowing the logical inport to be zeroed, or by
 introducing a new action that ignores the inport.
 
-** ovn-northd
-
-*** What flows should it generate?
-
-See description in ovn-northd(8).
-
 ** New OVN logical actions
 
 *** arp
@@ -166,13 +160,7 @@ userspace-only and no one has complained yet.)
 
 *** ICMPv6
 
-** IP to MAC binding
-
-Somehow it has to be possible for an L3 logical router to map from an
-IP address to an Ethernet address.  This can happen statically or
-dynamically.  Probably both cases need to be supported eventually.
-
-*** Dynamic IP to MAC bindings
+** Dynamic IP to MAC bindings
 
 Some bindings from IP address to MAC will undoubtedly need to be
 discovered dynamically through ARP requests.  It's straightforward
@@ -193,32 +181,32 @@ place for this in the OVN_Southbound database.
 
 Details need to be worked out, including:
 
-**** OVN_Southbound schema changes.
+*** OVN_Southbound schema changes.
 
 Possibly bindings could be added to the Port_Binding table by adding
 or modifying columns.  Another possibility is that another table
 should be added.
 
-**** Logical_Flow representation
+*** Logical_Flow representation
 
 It would be really nice to maintain the general-purpose nature of
 logical flows, but these bindings might have to include some
 hard-coded special cases, especially when it comes to the relationship
 with populating the bindings into the OVN_Southbound table.
 
-**** Tracking queries
+*** Tracking queries
 
 It's probably best to only record in the database responses to queries
 actually issued by an L3 logical router, so somehow they have to be
 tracked, probably by putting a tentative binding without a MAC address
 into the database.
 
-**** Renewal and expiration.
+*** Renewal and expiration.
 
 Something needs to make sure that bindings remain valid and expire
 those that become stale.
 
-*** MTU handling (fragmentation on output)
+** MTU handling (fragmentation on output)
 
 ** Ratelimiting.
 
index f51852e..c5760a5 100644 (file)
       One of the main purposes of <code>ovn-northd</code> is to populate the
       <code>Logical_Flow</code> table in the <code>OVN_Southbound</code>
       database.  This section describes how <code>ovn-northd</code> does this
-      for logical datapaths.
+      for switch and router logical datapaths.
     </p>
 
-    <h2>Ingress Table 0: Admission Control and Ingress Port Security</h2>
+    <h2>Logical Switch Datapaths</h2>
+
+    <h3>Ingress Table 0: Admission Control and Ingress Port Security</h3>
 
     <p>
       Ingress table 0 contains these logical flows:
       be dropped.
     </p>
 
-    <h2>Ingress Table 1: <code>from-lport</code> Pre-ACLs</h2>
+    <h3>Ingress Table 1: <code>from-lport</code> Pre-ACLs</h3>
 
     <p>
       Ingress table 1 prepares flows for possible stateful ACL processing
       the connection tracker before advancing to table 2.
     </p>
 
-    <h2>Ingress table 2: <code>from-lport</code> ACLs</h2>
+    <h3>Ingress table 2: <code>from-lport</code> ACLs</h3>
 
     <p>
       Logical flows in this table closely reproduce those in the
       </li>
     </ul>
 
-    <h2>Ingress Table 3: Destination Lookup</h2>
+    <h3>Ingress Table 3: Destination Lookup</h3>
 
     <p>
       This table implements switching behavior.  It contains these logical
       </li>
     </ul>
 
-    <h2>Egress Table 0: <code>to-lport</code> Pre-ACLs</h2>
+    <h3>Egress Table 0: <code>to-lport</code> Pre-ACLs</h3>
 
     <p>
       This is similar to ingress table 1 except for <code>to-lport</code>
       traffic.
     </p>
 
-    <h2>Egress Table 1: <code>to-lport</code> ACLs</h2>
+    <h3>Egress Table 1: <code>to-lport</code> ACLs</h3>
 
     <p>
       This is similar to ingress table 2 except for <code>to-lport</code> ACLs.
     </p>
 
-    <h2>Egress Table 2: Egress Port Security</h2>
+    <h3>Egress Table 2: Egress Port Security</h3>
 
     <p>
       This is similar to the ingress port security logic in ingress table 0,
       <code>eth.src</code>.  Second, packets directed to broadcast or multicast
       <code>eth.dst</code> are always accepted instead of being subject to the
       port security rules; this is implemented through a priority-100 flow that
-      matches on <code>eth.dst[40]</code> with action <code>output;</code>.
+      matches on <code>eth.mcast</code> with action <code>output;</code>.
       Finally, to ensure that even broadcast and multicast packets are not
       delivered to disabled logical ports, a priority-150 flow for each
       disabled logical <code>outport</code> overrides the priority-100 flow
       with a <code>drop;</code> action.
     </p>
+
+    <h2>Logical Router Datapaths</h2>
+
+    <h3>Ingress Table 0: L2 Admission Control</h3>
+
+    <p>
+      This table drops packets that the router shouldn't see at all based on
+      their Ethernet headers.  It contains the following flows:
+    </p>
+
+    <ul>
+      <li>
+        Priority-100 flows to drop packets with VLAN tags or multicast Ethernet
+        source addresses.
+      </li>
+
+      <li>
+        For each enabled router port <var>P</var> with Ethernet address
+        <var>E</var>, a priority-50 flow that matches <code>inport ==
+        <var>P</var> &amp;&amp; (eth.mcast || eth.dst ==
+        <var>E</var></code>), with action <code>next;</code>.
+      </li>
+    </ul>
+
+    <p>
+      Other packets are implicitly dropped.
+    </p>
+
+    <h3>Ingress Table 1: IP Input</h3>
+
+    <p>
+      This table is the core of the logical router datapath functionality.  It
+      contains the following flows to implement very basic IP host
+      functionality.
+    </p>
+
+    <ul>
+      <li>
+        <p>
+          L3 admission control: A priority-100 flow drops packets that match
+          any of the following:
+        </p>
+
+        <ul>
+          <li>
+            <code>ip4.src[28..31] == 0xe</code> (multicast source)
+          </li>
+          <li>
+            <code>ip4.src == 255.255.255.255</code> (broadcast source)
+          </li>
+          <li>
+            <code>ip4.src == 127.0.0.0/8 || ip4.dst == 127.0.0.0/8</code>
+            (localhost source or destination)
+          </li>
+          <li>
+            <code>ip4.src == 0.0.0.0/8 || ip4.dst == 0.0.0.0/8</code> (zero
+            network source or destination)
+          </li>
+          <li>
+            <code>ip4.src</code> is any IP address owned by the router.
+          </li>
+          <li>
+            <code>ip4.src</code> is the broadcast address of any IP network
+            known to the router.
+          </li>
+        </ul>
+      </li>
+
+      <li>
+        <p>
+          ICMP echo reply.  These flows reply to ICMP echo requests received
+          for the router's IP address.  Let <var>A</var> be an IP address or
+          broadcast address owned by a router port.  Then, for each
+          <var>A</var>, a priority-90 flow matches on <code>ip4.dst ==
+          <var>A</var></code> and <code>icmp4.type == 8 &amp;&amp; icmp4.code
+          == 0</code> (ICMP echo request).  These flows use the following
+          actions where, if <var>A</var> is unicast, then <var>S</var> is
+          <var>A</var>, and if <var>A</var> is broadcast, <var>S</var> is the
+          router's IP address in <var>A</var>'s network:
+        </p>
+
+        <pre>
+ip4.dst = ip4.src;
+ip4.src = <var>S</var>;
+ip4.ttl = 255;
+icmp4.type = 0;
+next;
+        </pre>
+
+        <p>
+          Similar flows match on <code>ip4.dst == 255.255.255.255</code> and
+          each individual <code>inport</code>, and use the same actions in
+          which <var>S</var> is a function of <code>inport</code>.
+        </p>
+
+        <p>
+          Not yet implemented.
+        </p>
+      </li>
+
+      <li>
+        <p>
+          ARP reply.  These flows reply to ARP requests for the router's own IP
+          address.  For each router port <var>P</var> that owns IP address
+          <var>A</var> and Ethernet address <var>E</var>, a priority-90 flow
+          matches <code>inport == <var>P</var> &amp;&amp; arp.tpa ==
+          <var>A</var> &amp;&amp; arp.op == 1</code> (ARP request) with the
+          following actions:
+        </p>
+
+        <pre>
+eth.dst = eth.src;
+eth.src = <var>E</var>;
+arp.op = 2; /* ARP reply. */
+arp.tha = arp.sha;
+arp.sha = <var>E</var>;
+arp.tpa = arp.spa;
+arp.spa = <var>A</var>;
+outport = <var>P</var>;
+inport = 0; /* Allow sending out inport. */
+output;
+        </pre>
+      </li>
+
+      <li>
+        <p>
+          UDP port unreachable.  Priority-80 flows generate ICMP port
+          unreachable messages in reply to UDP datagrams directed to the
+          router's IP address.  The logical router doesn't accept any UDP
+          traffic so it always generates such a reply.
+        </p>
+
+        <p>
+          These flows should not match IP fragments with nonzero offset.
+        </p>
+
+        <p>
+          Details TBD.  Not yet implemented.
+        </p>
+      </li>
+
+      <li>
+        <p>
+          TCP reset.  Priority-80 flows generate TCP reset messages in reply to
+          TCP datagrams directed to the router's IP address.  The logical
+          router doesn't accept any TCP traffic so it always generates such a
+          reply.
+        </p>
+
+        <p>
+          These flows should not match IP fragments with nonzero offset.
+        </p>
+
+        <p>
+          Details TBD.  Not yet implemented.
+        </p>
+      </li>
+
+      <li>
+        <p>
+          Protocol unreachable.  Priority-70 flows generate ICMP protocol
+          unreachable messages in reply to packets directed to the router's IP
+          address on IP protocols other than UDP, TCP, and ICMP.
+        </p>
+
+        <p>
+          These flows should not match IP fragments with nonzero offset.
+        </p>
+
+        <p>
+          Details TBD.  Not yet implemented.
+        </p>
+      </li>
+
+      <li>
+        Drop other IP traffic to this router.  These flows drop any other
+        traffic destined to an IP address of this router that is not already
+        handled by one of the flows above, which amounts to ICMP (other than
+        echo requests) and fragments with nonzero offsets.  For each IP address
+        <var>A</var> owned by the router, a priority-60 flow matches
+        <code>ip4.dst == <var>A</var></code> and drops the traffic.
+      </li>
+    </ul>
+
+    <p>
+      The flows above handle all of the traffic that might be directed to the
+      router itself.  The following flows (with lower priorities) handle the
+      remaining traffic, potentially for forwarding:
+    </p>
+
+    <ul>
+      <li>
+        Drop Ethernet local broadcast.  A priority-50 flow with match
+        <code>eth.bcast</code> drops traffic destined to the local Ethernet
+        broadcast address.  By definition this traffic should not be forwarded.
+      </li>
+
+      <li>
+        Drop IP multicast.  A priority-50 flow with match
+        <code>ip4.mcast</code> drops IP multicast traffic.
+      </li>
+
+      <li>
+        <p>
+          ICMP time exceeded.  For each router port <var>P</var>, whose IP
+          address is <var>A</var>, a priority-40 flow with match <code>inport
+          == <var>P</var> &amp;&amp; ip4.ttl == {0, 1} &amp;&amp;
+          !ip.later_frag</code> matches packets whose TTL has expired, with the
+          following actions to send an ICMP time exceeded reply:
+        </p>
+
+        <pre>
+icmp4 {
+    icmp4.type = 11; /* Time exceeded. */
+    icmp4.code = 0;  /* TTL exceeded in transit. */
+    ip4.dst = ip4.src;
+    ip4.src = <var>A</var>;
+    ip4.ttl = 255;
+    next;
+};
+        </pre>
+
+        <p>
+          Not yet implemented.
+        </p>
+      </li>
+
+      <li>
+        TTL discard.  A priority-30 flow with match <code>ip4.ttl == {0,
+        1}</code> and actions <code>drop;</code> drops other packets whose TTL
+        has expired, that should not receive a ICMP error reply (i.e. fragments
+        with nonzero offset).
+      </li>
+
+      <li>
+        Next table.  A priority-0 flows match all packets that aren't already
+        handled and uses actions <code>next;</code> to feed them to the ingress
+        table for routing.
+      </li>
+    </ul>
+
+    <h3>Ingress Table 2: IP Routing</h3>
+
+    <p>
+      A packet that arrives at this table is an IP packet that should be routed
+      to the address in <code>ip4.dst</code>.  This table implements IP
+      routing, setting <code>reg0</code> to the next-hop IP address (leaving
+      <code>ip4.dst</code>, the packet's final destination, unchanged) and
+      advances to the next table for ARP resolution.
+    </p>
+
+    <p>
+      This table contains the following logical flows:
+    </p>
+
+    <ul>
+      <li>
+        <p>
+          Routing table.  For each route to IPv4 network <var>N</var> with
+          netmask <var>M</var>, a logical flow with match <code>ip4.dst ==
+          <var>N</var>/<var>M</var></code>, whose priority is the number of
+          1-bits in <var>M</var>, has the following actions:
+        </p>
+
+        <pre>
+ip4.ttl--;
+reg0 = <var>G</var>;
+next;
+        </pre>
+
+        <p>
+          (Ingress table 1 already verified that <code>ip4.ttl--;</code> will
+          not yield a TTL exceeded error.)
+        </p>
+
+        <p>
+          If the route has a gateway, <var>G</var> is the gateway IP address,
+          otherwise it is <code>ip4.dst</code>.
+        </p>
+      </li>
+
+      <li>
+        <p>
+          Destination unreachable.  For each router port <var>P</var>, which
+          owns IP address <var>A</var>, a priority-0 logical flow with match
+          <code>in_port == <var>P</var> &amp;&amp; !ip.later_frag &amp;&amp;
+          !icmp</code> has the following actions:
+        </p>
+
+        <pre>
+icmp4 {
+    icmp4.type = 3; /* Destination unreachable. */
+    icmp4.code = 0; /* Network unreachable. */
+    ip4.dst = ip4.src;
+    ip4.src = <var>A</var>;
+    ip4.ttl = 255;
+    next(2);
+};
+        </pre>
+
+        <p>
+          (The <code>!icmp</code> check prevents recursion if the destination
+          unreachable message itself cannot be routed.)
+        </p>
+
+        <p>
+          These flows are omitted if the logical router has a default route,
+          that is, a route with netmask 0.0.0.0.
+        </p>
+      </li>
+    </ul>
+
+    <h3>Ingress Table 3: ARP Resolution</h3>
+
+    <p>
+      Any packet that reaches this table is an IP packet whose next-hop IP
+      address is in <code>reg0</code>.  (<code>ip4.dst</code> is the final
+      destination.)  This table resolves the IP address in <code>reg0</code>
+      into an output port in <code>outport</code> and an Ethernet address in
+      <code>eth.dst</code>, using the following flows:
+    </p>
+
+    <ul>
+      <li>
+        <p>
+          Known MAC bindings.  For each IP address <var>A</var> whose host is
+          known to have Ethernet address <var>HE</var> and reside on router
+          port <var>P</var> with Ethernet address <var>PE</var>, a priority-200
+          flow with match <code>reg0 == <var>A</var></code> has the following
+          actions:
+        </p>
+
+        <pre>
+eth.src = <var>PE</var>;
+eth.dst = <var>HE</var>;
+outport = <var>P</var>;
+output;
+        </pre>
+
+        <p>
+          MAC bindings can be known statically based on data in the
+          <code>OVN_Northbound</code> database.  For router ports connected to
+          logical switches, MAC bindings can be known statically from the
+          <code>addresses</code> column in the <code>Logical_Port</code> table.
+          For router ports connected to other logical routers, MAC bindings can
+          be known statically from the <code>mac</code> and
+          <code>network</code> column in the <code>Logical_Router_Port</code>
+          table.
+        </p>
+      </li>
+
+      <li>
+        <p>
+          Unknown MAC bindings.  For each non-gateway route to IPv4 network
+          <var>N</var> with netmask <var>M</var> on router port <var>P</var>
+          that owns IP address <var>A</var> and Ethernet address <var>E</var>,
+          a logical flow with match <code>ip4.dst ==
+          <var>N</var>/<var>M</var></code>, whose priority is the number of
+          1-bits in <var>M</var>, has the following actions:
+        </p>
+
+        <pre>
+arp {
+    eth.dst = ff:ff:ff:ff:ff:ff;
+    eth.src = <var>E</var>;
+    arp.sha = <var>E</var>;
+    arp.tha = 00:00:00:00:00:00;
+    arp.spa = <var>A</var>;
+    arp.tpa = ip4.dst;
+    arp.op = 1;  /* ARP request. */
+    outport = <var>P</var>;
+    output;
+};
+        </pre>
+
+        <p>
+          TBD: How to install MAC bindings when an ARP response comes back.
+          (Implement a "learn" action?)
+        </p>
+
+        <p>
+          Not yet implemented.
+        </p>
+      </li>
+    </ul>
+
+    <h3>Egress Table 0: Delivery</h3>
+
+    <p>
+      Packets that reach this table are ready for delivery.  It contains
+      priority-100 logical flows that match packets on each enabled logical
+      router port, with action <code>output;</code>.
+    </p>
+
 </manpage>
index 1ed3cc7..e6e9f3e 100644 (file)
@@ -221,16 +221,24 @@ allocate_tnlid(struct hmap *set, const char *name, uint32_t max,
     return 0;
 }
 \f
-/* The 'key' comes from nb->header_.uuid or sb->external_ids:logical-switch. */
+/* The 'key' comes from nbs->header_.uuid or nbr->header_.uuid or
+ * sb->external_ids:logical-switch. */
 struct ovn_datapath {
     struct hmap_node key_node;  /* Index on 'key'. */
-    struct uuid key;            /* nb->header_.uuid. */
+    struct uuid key;            /* (nbs/nbr)->header_.uuid. */
 
-    const struct nbrec_logical_switch *nb;   /* May be NULL. */
+    const struct nbrec_logical_switch *nbs;  /* May be NULL. */
+    const struct nbrec_logical_router *nbr;  /* May be NULL. */
     const struct sbrec_datapath_binding *sb; /* May be NULL. */
 
     struct ovs_list list;       /* In list of similar records. */
 
+    /* Logical router data (digested from nbr). */
+    ovs_be32 gateway;
+
+    /* Logical switch data. */
+    struct ovn_port *router_port;
+
     struct hmap port_tnlids;
     uint32_t port_key_hint;
 
@@ -239,13 +247,15 @@ struct ovn_datapath {
 
 static struct ovn_datapath *
 ovn_datapath_create(struct hmap *datapaths, const struct uuid *key,
-                    const struct nbrec_logical_switch *nb,
+                    const struct nbrec_logical_switch *nbs,
+                    const struct nbrec_logical_router *nbr,
                     const struct sbrec_datapath_binding *sb)
 {
     struct ovn_datapath *od = xzalloc(sizeof *od);
     od->key = *key;
     od->sb = sb;
-    od->nb = nb;
+    od->nbs = nbs;
+    od->nbr = nbr;
     hmap_init(&od->port_tnlids);
     od->port_key_hint = 0;
     hmap_insert(datapaths, &od->key_node, uuid_hash(&od->key));
@@ -284,7 +294,8 @@ ovn_datapath_from_sbrec(struct hmap *datapaths,
 {
     struct uuid key;
 
-    if (!smap_get_uuid(&sb->external_ids, "logical-switch", &key)) {
+    if (!smap_get_uuid(&sb->external_ids, "logical-switch", &key) &&
+        !smap_get_uuid(&sb->external_ids, "logical-router", &key)) {
         return NULL;
     }
     return ovn_datapath_find(datapaths, &key);
@@ -303,42 +314,85 @@ join_datapaths(struct northd_context *ctx, struct hmap *datapaths,
     const struct sbrec_datapath_binding *sb, *sb_next;
     SBREC_DATAPATH_BINDING_FOR_EACH_SAFE (sb, sb_next, ctx->ovnsb_idl) {
         struct uuid key;
-        if (!smap_get_uuid(&sb->external_ids, "logical-switch", &key)) {
-            ovsdb_idl_txn_add_comment(ctx->ovnsb_txn,
-                                      "deleting Datapath_Binding "UUID_FMT" that "
-                                      "lacks external-ids:logical-switch",
-                         UUID_ARGS(&sb->header_.uuid));
+        if (!smap_get_uuid(&sb->external_ids, "logical-switch", &key) &&
+            !smap_get_uuid(&sb->external_ids, "logical-router", &key)) {
+            ovsdb_idl_txn_add_comment(
+                ctx->ovnsb_txn,
+                "deleting Datapath_Binding "UUID_FMT" that lacks "
+                "external-ids:logical-switch and "
+                "external-ids:logical-router",
+                UUID_ARGS(&sb->header_.uuid));
             sbrec_datapath_binding_delete(sb);
             continue;
         }
 
         if (ovn_datapath_find(datapaths, &key)) {
             static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
-            VLOG_INFO_RL(&rl, "deleting Datapath_Binding "UUID_FMT" with "
-                         "duplicate external-ids:logical-switch "UUID_FMT,
-                         UUID_ARGS(&sb->header_.uuid), UUID_ARGS(&key));
+            VLOG_INFO_RL(
+                &rl, "deleting Datapath_Binding "UUID_FMT" with "
+                "duplicate external-ids:logical-switch/router "UUID_FMT,
+                UUID_ARGS(&sb->header_.uuid), UUID_ARGS(&key));
             sbrec_datapath_binding_delete(sb);
             continue;
         }
 
         struct ovn_datapath *od = ovn_datapath_create(datapaths, &key,
-                                                      NULL, sb);
+                                                      NULL, NULL, sb);
         list_push_back(sb_only, &od->list);
     }
 
-    const struct nbrec_logical_switch *nb;
-    NBREC_LOGICAL_SWITCH_FOR_EACH (nb, ctx->ovnnb_idl) {
+    const struct nbrec_logical_switch *nbs;
+    NBREC_LOGICAL_SWITCH_FOR_EACH (nbs, ctx->ovnnb_idl) {
         struct ovn_datapath *od = ovn_datapath_find(datapaths,
-                                                    &nb->header_.uuid);
+                                                    &nbs->header_.uuid);
         if (od) {
-            od->nb = nb;
+            od->nbs = nbs;
             list_remove(&od->list);
             list_push_back(both, &od->list);
         } else {
-            od = ovn_datapath_create(datapaths, &nb->header_.uuid, nb, NULL);
+            od = ovn_datapath_create(datapaths, &nbs->header_.uuid,
+                                     nbs, NULL, NULL);
             list_push_back(nb_only, &od->list);
         }
     }
+
+    const struct nbrec_logical_router *nbr;
+    NBREC_LOGICAL_ROUTER_FOR_EACH (nbr, ctx->ovnnb_idl) {
+        struct ovn_datapath *od = ovn_datapath_find(datapaths,
+                                                    &nbr->header_.uuid);
+        if (od) {
+            if (!od->nbs) {
+                od->nbr = nbr;
+                list_remove(&od->list);
+                list_push_back(both, &od->list);
+            } else {
+                /* Can't happen! */
+                static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
+                VLOG_WARN_RL(&rl,
+                             "duplicate UUID "UUID_FMT" in OVN_Northbound",
+                             UUID_ARGS(&nbr->header_.uuid));
+                continue;
+            }
+        } else {
+            od = ovn_datapath_create(datapaths, &nbr->header_.uuid,
+                                     NULL, nbr, NULL);
+            list_push_back(nb_only, &od->list);
+        }
+
+        od->gateway = 0;
+        if (nbr->default_gw) {
+            ovs_be32 ip, mask;
+            char *error = ip_parse_masked(nbr->default_gw, &ip, &mask);
+            if (error || !ip || mask != OVS_BE32_MAX) {
+                static struct vlog_rate_limit rl
+                    = VLOG_RATE_LIMIT_INIT(5, 1);
+                VLOG_WARN_RL(&rl, "bad 'gateway' %s", nbr->default_gw);
+                free(error);
+            } else {
+                od->gateway = ip;
+            }
+        }
+    }
 }
 
 static uint32_t
@@ -373,8 +427,9 @@ build_datapaths(struct northd_context *ctx, struct hmap *datapaths)
             od->sb = sbrec_datapath_binding_insert(ctx->ovnsb_txn);
 
             char uuid_s[UUID_LEN + 1];
-            sprintf(uuid_s, UUID_FMT, UUID_ARGS(&od->nb->header_.uuid));
-            const struct smap id = SMAP_CONST1(&id, "logical-switch", uuid_s);
+            sprintf(uuid_s, UUID_FMT, UUID_ARGS(&od->key));
+            const char *key = od->nbs ? "logical-switch" : "logical-router";
+            const struct smap id = SMAP_CONST1(&id, key, uuid_s);
             sbrec_datapath_binding_set_external_ids(od->sb, &id);
 
             sbrec_datapath_binding_set_tunnel_key(od->sb, tunnel_key);
@@ -393,10 +448,19 @@ build_datapaths(struct northd_context *ctx, struct hmap *datapaths)
 \f
 struct ovn_port {
     struct hmap_node key_node;  /* Index on 'key'. */
-    const char *key;            /* nb->name and sb->logical_port */
+    char *key;                  /* nbs->name, nbr->name, sb->logical_port. */
+    char *json_key;             /* 'key', quoted for use in JSON. */
 
-    const struct nbrec_logical_port *nb; /* May be NULL. */
-    const struct sbrec_port_binding *sb; /* May be NULL. */
+    const struct nbrec_logical_port *nbs;        /* May be NULL. */
+    const struct nbrec_logical_router_port *nbr; /* May be NULL. */
+    const struct sbrec_port_binding *sb;         /* May be NULL. */
+
+    /* Logical router port data. */
+    ovs_be32 ip, mask;          /* 192.168.10.123/24. */
+    ovs_be32 network;           /* 192.168.10.0. */
+    ovs_be32 bcast;             /* 192.168.10.255. */
+    struct eth_addr mac;
+    struct ovn_port *peer;
 
     struct ovn_datapath *od;
 
@@ -405,13 +469,20 @@ struct ovn_port {
 
 static struct ovn_port *
 ovn_port_create(struct hmap *ports, const char *key,
-                const struct nbrec_logical_port *nb,
+                const struct nbrec_logical_port *nbs,
+                const struct nbrec_logical_router_port *nbr,
                 const struct sbrec_port_binding *sb)
 {
     struct ovn_port *op = xzalloc(sizeof *op);
-    op->key = key;
+
+    struct ds json_key = DS_EMPTY_INITIALIZER;
+    json_string_escape(key, &json_key);
+    op->json_key = ds_steal_cstr(&json_key);
+
+    op->key = xstrdup(key);
     op->sb = sb;
-    op->nb = nb;
+    op->nbs = nbs;
+    op->nbr = nbr;
     hmap_insert(ports, &op->key_node, hash_string(op->key, 0));
     return op;
 }
@@ -424,6 +495,8 @@ ovn_port_destroy(struct hmap *ports, struct ovn_port *port)
          * private list and once we've exited that function it is not safe to
          * use it. */
         hmap_remove(ports, &port->key_node);
+        free(port->json_key);
+        free(port->key);
         free(port);
     }
 }
@@ -462,24 +535,111 @@ join_logical_ports(struct northd_context *ctx,
     const struct sbrec_port_binding *sb;
     SBREC_PORT_BINDING_FOR_EACH (sb, ctx->ovnsb_idl) {
         struct ovn_port *op = ovn_port_create(ports, sb->logical_port,
-                                              NULL, sb);
+                                              NULL, NULL, sb);
         list_push_back(sb_only, &op->list);
     }
 
     struct ovn_datapath *od;
     HMAP_FOR_EACH (od, key_node, datapaths) {
-        for (size_t i = 0; i < od->nb->n_ports; i++) {
-            const struct nbrec_logical_port *nb = od->nb->ports[i];
-            struct ovn_port *op = ovn_port_find(ports, nb->name);
-            if (op) {
-                op->nb = nb;
-                list_remove(&op->list);
-                list_push_back(both, &op->list);
-            } else {
-                op = ovn_port_create(ports, nb->name, nb, NULL);
-                list_push_back(nb_only, &op->list);
+        if (od->nbs) {
+            for (size_t i = 0; i < od->nbs->n_ports; i++) {
+                const struct nbrec_logical_port *nbs = od->nbs->ports[i];
+                struct ovn_port *op = ovn_port_find(ports, nbs->name);
+                if (op) {
+                    if (op->nbs || op->nbr) {
+                        static struct vlog_rate_limit rl
+                            = VLOG_RATE_LIMIT_INIT(5, 1);
+                        VLOG_WARN_RL(&rl, "duplicate logical port %s",
+                                     nbs->name);
+                        continue;
+                    }
+                    op->nbs = nbs;
+                    list_remove(&op->list);
+                    list_push_back(both, &op->list);
+                } else {
+                    op = ovn_port_create(ports, nbs->name, nbs, NULL, NULL);
+                    list_push_back(nb_only, &op->list);
+                }
+
+                op->od = od;
+            }
+        } else {
+            for (size_t i = 0; i < od->nbr->n_ports; i++) {
+                const struct nbrec_logical_router_port *nbr
+                    = od->nbr->ports[i];
+
+                struct eth_addr mac;
+                if (!eth_addr_from_string(nbr->mac, &mac)) {
+                    static struct vlog_rate_limit rl
+                        = VLOG_RATE_LIMIT_INIT(5, 1);
+                    VLOG_WARN_RL(&rl, "bad 'mac' %s", nbr->mac);
+                    continue;
+                }
+
+                ovs_be32 ip, mask;
+                char *error = ip_parse_masked(nbr->network, &ip, &mask);
+                if (error || mask == OVS_BE32_MAX || !ip_is_cidr(mask)) {
+                    static struct vlog_rate_limit rl
+                        = VLOG_RATE_LIMIT_INIT(5, 1);
+                    VLOG_WARN_RL(&rl, "bad 'network' %s", nbr->network);
+                    free(error);
+                    continue;
+                }
+
+                char name[UUID_LEN + 1];
+                snprintf(name, sizeof name, UUID_FMT,
+                         UUID_ARGS(&nbr->header_.uuid));
+                struct ovn_port *op = ovn_port_find(ports, name);
+                if (op) {
+                    if (op->nbs || op->nbr) {
+                        static struct vlog_rate_limit rl
+                            = VLOG_RATE_LIMIT_INIT(5, 1);
+                        VLOG_WARN_RL(&rl, "duplicate logical router port %s",
+                                     name);
+                        continue;
+                    }
+                    op->nbr = nbr;
+                    list_remove(&op->list);
+                    list_push_back(both, &op->list);
+                } else {
+                    op = ovn_port_create(ports, name, NULL, nbr, NULL);
+                    list_push_back(nb_only, &op->list);
+                }
+
+                op->ip = ip;
+                op->mask = mask;
+                op->network = ip & mask;
+                op->bcast = ip | ~mask;
+                op->mac = mac;
+
+                op->od = od;
             }
-            op->od = od;
+        }
+    }
+
+    /* Connect logical router ports, and logical switch ports of type "router",
+     * to their peers. */
+    struct ovn_port *op;
+    HMAP_FOR_EACH (op, key_node, ports) {
+        if (op->nbs && !strcmp(op->nbs->type, "router")) {
+            const char *peer_name = smap_get(&op->nbs->options, "router-port");
+            if (!peer_name) {
+                continue;
+            }
+
+            struct ovn_port *peer = ovn_port_find(ports, peer_name);
+            if (!peer || !peer->nbr) {
+                continue;
+            }
+
+            peer->peer = op;
+            op->peer = peer;
+            op->od->router_port = op;
+        } else if (op->nbr && op->nbr->peer) {
+            char peer_name[UUID_LEN + 1];
+            snprintf(peer_name, sizeof peer_name, UUID_FMT,
+                     UUID_ARGS(&op->nbr->peer->header_.uuid));
+            op->peer = ovn_port_find(ports, peer_name);
         }
     }
 }
@@ -487,13 +647,37 @@ join_logical_ports(struct northd_context *ctx,
 static void
 ovn_port_update_sbrec(const struct ovn_port *op)
 {
-    sbrec_port_binding_set_type(op->sb, op->nb->type);
-    sbrec_port_binding_set_options(op->sb, &op->nb->options);
     sbrec_port_binding_set_datapath(op->sb, op->od->sb);
-    sbrec_port_binding_set_parent_port(op->sb, op->nb->parent_name);
-    sbrec_port_binding_set_tag(op->sb, op->nb->tag, op->nb->n_tag);
-    sbrec_port_binding_set_mac(op->sb, (const char **) op->nb->addresses,
-                               op->nb->n_addresses);
+    if (op->nbr) {
+        sbrec_port_binding_set_type(op->sb, "patch");
+
+        const char *peer = op->peer ? op->peer->key : "<error>";
+        const struct smap ids = SMAP_CONST1(&ids, "peer", peer);
+        sbrec_port_binding_set_options(op->sb, &ids);
+
+        sbrec_port_binding_set_parent_port(op->sb, NULL);
+        sbrec_port_binding_set_tag(op->sb, NULL, 0);
+        sbrec_port_binding_set_mac(op->sb, NULL, 0);
+    } else {
+        if (strcmp(op->nbs->type, "router")) {
+            sbrec_port_binding_set_type(op->sb, op->nbs->type);
+            sbrec_port_binding_set_options(op->sb, &op->nbs->options);
+        } else {
+            sbrec_port_binding_set_type(op->sb, "patch");
+
+            const char *router_port = smap_get(&op->nbs->options,
+                                               "router-port");
+            if (!router_port) {
+                router_port = "<error>";
+            }
+            const struct smap ids = SMAP_CONST1(&ids, "peer", router_port);
+            sbrec_port_binding_set_options(op->sb, &ids);
+        }
+        sbrec_port_binding_set_parent_port(op->sb, op->nbs->parent_name);
+        sbrec_port_binding_set_tag(op->sb, op->nbs->tag, op->nbs->n_tag);
+        sbrec_port_binding_set_mac(op->sb, (const char **) op->nbs->addresses,
+                                   op->nbs->n_addresses);
+    }
 }
 
 static void
@@ -764,8 +948,8 @@ lport_is_enabled(const struct nbrec_logical_port *lport)
 static bool
 has_stateful_acl(struct ovn_datapath *od)
 {
-    for (size_t i = 0; i < od->nb->n_acls; i++) {
-        struct nbrec_acl *acl = od->nb->acls[i];
+    for (size_t i = 0; i < od->nbs->n_acls; i++) {
+        struct nbrec_acl *acl = od->nbs->acls[i];
         if (!strcmp(acl->action, "allow-related")) {
             return true;
         }
@@ -855,8 +1039,8 @@ build_acls(struct ovn_datapath *od, struct hmap *lflows)
     }
 
     /* Ingress or Egress ACL Table (Various priorities). */
-    for (size_t i = 0; i < od->nb->n_acls; i++) {
-        struct nbrec_acl *acl = od->nb->acls[i];
+    for (size_t i = 0; i < od->nbs->n_acls; i++) {
+        struct nbrec_acl *acl = od->nbs->acls[i];
         bool ingress = !strcmp(acl->direction, "from-lport") ? true :false;
         enum ovn_stage stage = ingress ? S_SWITCH_IN_ACL : S_SWITCH_OUT_ACL;
 
@@ -892,49 +1076,62 @@ build_acls(struct ovn_datapath *od, struct hmap *lflows)
     }
 }
 
-/* Updates the Logical_Flow and Multicast_Group tables in the OVN_SB database,
- * constructing their contents based on the OVN_NB database. */
 static void
-build_lflows(struct northd_context *ctx, struct hmap *datapaths,
-             struct hmap *ports)
+build_lswitch_flows(struct hmap *datapaths, struct hmap *ports,
+                    struct hmap *lflows, struct hmap *mcgroups)
 {
     /* This flow table structure is documented in ovn-northd(8), so please
      * update ovn-northd.8.xml if you change anything. */
 
-    struct hmap lflows = HMAP_INITIALIZER(&lflows);
-    struct hmap mcgroups = HMAP_INITIALIZER(&mcgroups);
-
-    /* Ingress table 0: Admission control framework (priorities 0 and 100). */
+    /* Build pre-ACL and ACL tables for both ingress and egress.
+     * Ingress tables 1 and 2.  Egress tables 0 and 1. */
     struct ovn_datapath *od;
     HMAP_FOR_EACH (od, key_node, datapaths) {
+        if (!od->nbs) {
+            continue;
+        }
+
+        build_acls(od, lflows);
+    }
+
+    /* Logical switch ingress table 0: Admission control framework (priority
+     * 100). */
+    HMAP_FOR_EACH (od, key_node, datapaths) {
+        if (!od->nbs) {
+            continue;
+        }
+
         /* Logical VLANs not supported. */
-        ovn_lflow_add(&lflows, od, S_SWITCH_IN_PORT_SEC, 100, "vlan.present",
+        ovn_lflow_add(lflows, od, S_SWITCH_IN_PORT_SEC, 100, "vlan.present",
                       "drop;");
 
         /* Broadcast/multicast source address is invalid. */
-        ovn_lflow_add(&lflows, od, S_SWITCH_IN_PORT_SEC, 100, "eth.src[40]",
+        ovn_lflow_add(lflows, od, S_SWITCH_IN_PORT_SEC, 100, "eth.src[40]",
                       "drop;");
 
         /* Port security flows have priority 50 (see below) and will continue
          * to the next table if packet source is acceptable. */
     }
 
-    /* Ingress table 0: Ingress port security (priority 50). */
+    /* Logical switch ingress table 0: Ingress port security (priority 50). */
     struct ovn_port *op;
     HMAP_FOR_EACH (op, key_node, ports) {
-        if (!lport_is_enabled(op->nb)) {
+        if (!op->nbs) {
+            continue;
+        }
+
+        if (!lport_is_enabled(op->nbs)) {
             /* Drop packets from disabled logical ports (since logical flow
              * tables are default-drop). */
             continue;
         }
 
         struct ds match = DS_EMPTY_INITIALIZER;
-        ds_put_cstr(&match, "inport == ");
-        json_string_escape(op->key, &match);
+        ds_put_format(&match, "inport == %s", op->json_key);
         build_port_security("eth.src",
-                            op->nb->port_security, op->nb->n_port_security,
+                            op->nbs->port_security, op->nbs->n_port_security,
                             &match);
-        ovn_lflow_add(&lflows, op->od, S_SWITCH_IN_PORT_SEC, 50,
+        ovn_lflow_add(lflows, op->od, S_SWITCH_IN_PORT_SEC, 50,
                       ds_cstr(&match), "next;");
         ds_destroy(&match);
     }
@@ -942,37 +1139,48 @@ build_lflows(struct northd_context *ctx, struct hmap *datapaths,
     /* Ingress table 2: Destination lookup, broadcast and multicast handling
      * (priority 100). */
     HMAP_FOR_EACH (op, key_node, ports) {
-        if (lport_is_enabled(op->nb)) {
-            ovn_multicast_add(&mcgroups, &mc_flood, op);
+        if (!op->nbs) {
+            continue;
+        }
+
+        if (lport_is_enabled(op->nbs)) {
+            ovn_multicast_add(mcgroups, &mc_flood, op);
         }
     }
     HMAP_FOR_EACH (od, key_node, datapaths) {
-        ovn_lflow_add(&lflows, od, S_SWITCH_IN_L2_LKUP, 100, "eth.mcast",
+        if (!od->nbs) {
+            continue;
+        }
+
+        ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, 100, "eth.mcast",
                       "outport = \""MC_FLOOD"\"; output;");
     }
 
     /* Ingress table 3: Destination lookup, unicast handling (priority 50), */
     HMAP_FOR_EACH (op, key_node, ports) {
-        for (size_t i = 0; i < op->nb->n_addresses; i++) {
+        if (!op->nbs) {
+            continue;
+        }
+
+        for (size_t i = 0; i < op->nbs->n_addresses; i++) {
             struct eth_addr mac;
 
-            if (eth_addr_from_string(op->nb->addresses[i], &mac)) {
+            if (eth_addr_from_string(op->nbs->addresses[i], &mac)) {
                 struct ds match, actions;
 
                 ds_init(&match);
-                ds_put_format(&match, "eth.dst == %s", op->nb->addresses[i]);
+                ds_put_format(&match, "eth.dst == "ETH_ADDR_FMT,
+                              ETH_ADDR_ARGS(mac));
 
                 ds_init(&actions);
-                ds_put_cstr(&actions, "outport = ");
-                json_string_escape(op->nb->name, &actions);
-                ds_put_cstr(&actions, "; output;");
-                ovn_lflow_add(&lflows, op->od, S_SWITCH_IN_L2_LKUP, 50,
+                ds_put_format(&actions, "outport = %s; output;", op->json_key);
+                ovn_lflow_add(lflows, op->od, S_SWITCH_IN_L2_LKUP, 50,
                               ds_cstr(&match), ds_cstr(&actions));
                 ds_destroy(&actions);
                 ds_destroy(&match);
-            } else if (!strcmp(op->nb->addresses[i], "unknown")) {
-                if (lport_is_enabled(op->nb)) {
-                    ovn_multicast_add(&mcgroups, &mc_unknown, op);
+            } else if (!strcmp(op->nbs->addresses[i], "unknown")) {
+                if (lport_is_enabled(op->nbs)) {
+                    ovn_multicast_add(mcgroups, &mc_unknown, op);
                     op->od->has_unknown = true;
                 }
             } else {
@@ -980,15 +1188,19 @@ build_lflows(struct northd_context *ctx, struct hmap *datapaths,
 
                 VLOG_INFO_RL(&rl,
                              "%s: invalid syntax '%s' in addresses column",
-                             op->nb->name, op->nb->addresses[i]);
+                             op->nbs->name, op->nbs->addresses[i]);
             }
         }
     }
 
     /* Ingress table 3: Destination lookup for unknown MACs (priority 0). */
     HMAP_FOR_EACH (od, key_node, datapaths) {
+        if (!od->nbs) {
+            continue;
+        }
+
         if (od->has_unknown) {
-            ovn_lflow_add(&lflows, od, S_SWITCH_IN_L2_LKUP, 0, "1",
+            ovn_lflow_add(lflows, od, S_SWITCH_IN_L2_LKUP, 0, "1",
                           "outport = \""MC_UNKNOWN"\"; output;");
         }
     }
@@ -996,7 +1208,11 @@ build_lflows(struct northd_context *ctx, struct hmap *datapaths,
     /* Egress table 2: Egress port security multicast/broadcast (priority
      * 100). */
     HMAP_FOR_EACH (od, key_node, datapaths) {
-        ovn_lflow_add(&lflows, od, S_SWITCH_OUT_PORT_SEC, 100, "eth.mcast",
+        if (!od->nbs) {
+            continue;
+        }
+
+        ovn_lflow_add(lflows, od, S_SWITCH_OUT_PORT_SEC, 100, "eth.mcast",
                       "output;");
     }
 
@@ -1007,31 +1223,283 @@ build_lflows(struct northd_context *ctx, struct hmap *datapaths,
      * Priority 150 rules drop packets to disabled logical ports, so that they
      * don't even receive multicast or broadcast packets. */
     HMAP_FOR_EACH (op, key_node, ports) {
-        struct ds match;
-
-        ds_init(&match);
-        ds_put_cstr(&match, "outport == ");
-        json_string_escape(op->key, &match);
-        if (lport_is_enabled(op->nb)) {
-            build_port_security("eth.dst",
-                                op->nb->port_security, op->nb->n_port_security,
-                                &match);
-            ovn_lflow_add(&lflows, op->od, S_SWITCH_OUT_PORT_SEC, 50,
+        if (!op->nbs) {
+            continue;
+        }
+
+        struct ds match = DS_EMPTY_INITIALIZER;
+        ds_put_format(&match, "outport == %s", op->json_key);
+        if (lport_is_enabled(op->nbs)) {
+            build_port_security("eth.dst", op->nbs->port_security,
+                                op->nbs->n_port_security, &match);
+            ovn_lflow_add(lflows, op->od, S_SWITCH_OUT_PORT_SEC, 50,
                           ds_cstr(&match), "output;");
         } else {
-            ovn_lflow_add(&lflows, op->od, S_SWITCH_OUT_PORT_SEC, 150,
+            ovn_lflow_add(lflows, op->od, S_SWITCH_OUT_PORT_SEC, 150,
                           ds_cstr(&match), "drop;");
         }
 
         ds_destroy(&match);
     }
+}
 
-    /* Build pre-ACL and ACL tables for both ingress and egress.
-     * Ingress tables 1 and 2.  Egress tables 0 and 1. */
+static bool
+lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
+{
+    return !lrport->enabled || *lrport->enabled;
+}
+
+static void
+add_route(struct hmap *lflows, struct ovn_datapath *od,
+          ovs_be32 network, ovs_be32 mask, ovs_be32 gateway)
+{
+    char *match = xasprintf("ip4.dst == "IP_FMT"/"IP_FMT,
+                            IP_ARGS(network), IP_ARGS(mask));
+
+    struct ds actions = DS_EMPTY_INITIALIZER;
+    ds_put_cstr(&actions, "ip4.ttl--; reg0 = ");
+    if (gateway) {
+        ds_put_format(&actions, IP_FMT, IP_ARGS(gateway));
+    } else {
+        ds_put_cstr(&actions, "ip4.dst");
+    }
+    ds_put_cstr(&actions, "; next;");
+
+    /* The priority here is calculated to implement longest-prefix-match
+     * routing. */
+    ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING,
+                  count_1bits(ntohl(mask)), match, ds_cstr(&actions));
+    ds_destroy(&actions);
+    free(match);
+}
+
+static void
+build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
+                    struct hmap *lflows)
+{
+    /* This flow table structure is documented in ovn-northd(8), so please
+     * update ovn-northd.8.xml if you change anything. */
+
+    /* XXX ICMP echo reply */
+
+    /* Logical router ingress table 0: Admission control framework. */
+    struct ovn_datapath *od;
+    HMAP_FOR_EACH (od, key_node, datapaths) {
+        if (!od->nbr) {
+            continue;
+        }
+
+        /* Logical VLANs not supported.
+         * Broadcast/multicast source address is invalid. */
+        ovn_lflow_add(lflows, od, S_ROUTER_IN_ADMISSION, 100,
+                      "vlan.present || eth.src[40]", "drop;");
+    }
+
+    /* Logical router ingress table 0: match (priority 50). */
+    struct ovn_port *op;
+    HMAP_FOR_EACH (op, key_node, ports) {
+        if (!op->nbr) {
+            continue;
+        }
+
+        if (!lrport_is_enabled(op->nbr)) {
+            /* Drop packets from disabled logical ports (since logical flow
+             * tables are default-drop). */
+            continue;
+        }
+
+        char *match = xasprintf(
+            "(eth.mcast || eth.dst == "ETH_ADDR_FMT") && inport == %s",
+            ETH_ADDR_ARGS(op->mac), op->json_key);
+        ovn_lflow_add(lflows, op->od, S_ROUTER_IN_ADMISSION, 50,
+                      match, "next;");
+    }
+
+    /* Logical router ingress table 1: IP Input. */
     HMAP_FOR_EACH (od, key_node, datapaths) {
-        build_acls(od, &lflows);
+        if (!od->nbr) {
+            continue;
+        }
+
+        /* L3 admission control: drop multicast and broadcast source, localhost
+         * source or destination, and zero network source or destination
+         * (priority 100). */
+        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_INPUT, 100,
+                      "ip4.mcast || "
+                      "ip4.src == 255.255.255.255 || "
+                      "ip4.src == 127.0.0.0/8 || "
+                      "ip4.dst == 127.0.0.0/8 || "
+                      "ip4.src == 0.0.0.0/8 || "
+                      "ip4.dst == 0.0.0.0/8",
+                      "drop;");
+
+        /* Drop Ethernet local broadcast.  By definition this traffic should
+         * not be forwarded.*/
+        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_INPUT, 50,
+                      "eth.bcast", "drop;");
+
+        /* Drop IP multicast. */
+        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_INPUT, 50,
+                      "ip4.mcast", "drop;");
+
+        /* TTL discard.
+         *
+         * XXX Need to send ICMP time exceeded if !ip.later_frag. */
+        char *match = xasprintf("ip4 && ip.ttl == {0, 1}");
+        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_INPUT, 30, match, "drop;");
+        free(match);
+
+        /* Pass other traffic not already handled to the next table for
+         * routing. */
+        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_INPUT, 0, "1", "next;");
     }
 
+    HMAP_FOR_EACH (op, key_node, ports) {
+        if (!op->nbr) {
+            continue;
+        }
+
+        /* L3 admission control: drop packets that originate from an IP address
+         * owned by the router or a broadcast address known to the router
+         * (priority 100). */
+        char *match = xasprintf("ip4.src == {"IP_FMT", "IP_FMT"}",
+                                IP_ARGS(op->ip), IP_ARGS(op->bcast));
+        ovn_lflow_add(lflows, op->od, S_ROUTER_IN_IP_INPUT, 100,
+                      match, "drop;");
+        free(match);
+
+        /* ARP reply.  These flows reply to ARP requests for the router's own
+         * IP address. */
+        match = xasprintf(
+            "inport == %s && arp.tpa == "IP_FMT" && arp.op == 1",
+            op->json_key, IP_ARGS(op->ip));
+        char *actions = xasprintf(
+            "eth.dst = eth.src; "
+            "eth.src = "ETH_ADDR_FMT"; "
+            "arp.op = 2; /* ARP reply */ "
+            "arp.tha = arp.sha; "
+            "arp.sha = "ETH_ADDR_FMT"; "
+            "arp.tpa = arp.spa; "
+            "arp.spa = "IP_FMT"; "
+            "outport = %s; "
+            "inport = \"\"; /* Allow sending out inport. */ "
+            "output;",
+            ETH_ADDR_ARGS(op->mac),
+            ETH_ADDR_ARGS(op->mac),
+            IP_ARGS(op->ip),
+            op->json_key);
+        ovn_lflow_add(lflows, op->od, S_ROUTER_IN_IP_INPUT, 90,
+                      match, actions);
+
+        /* Drop IP traffic to this router. */
+        match = xasprintf("ip4.dst == "IP_FMT, IP_ARGS(op->ip));
+        ovn_lflow_add(lflows, op->od, S_ROUTER_IN_IP_INPUT, 60,
+                      match, "drop;");
+        free(match);
+    }
+
+    /* Logical router ingress table 2: IP Routing.
+     *
+     * A packet that arrives at this table is an IP packet that should be
+     * routed to the address in ip4.dst. This table sets reg0 to the next-hop
+     * IP address (leaving ip4.dst, the packet’s final destination, unchanged)
+     * and advances to the next table for ARP resolution. */
+    HMAP_FOR_EACH (op, key_node, ports) {
+        if (!op->nbr) {
+            continue;
+        }
+
+        add_route(lflows, op->od, op->network, op->mask, 0);
+    }
+    HMAP_FOR_EACH (od, key_node, datapaths) {
+        if (!od->nbr) {
+            continue;
+        }
+
+        if (od->gateway) {
+            add_route(lflows, od, 0, 0, od->gateway);
+        }
+    }
+    /* XXX destination unreachable */
+
+    /* Local router ingress table 3: ARP Resolution.
+     *
+     * Any packet that reaches this table is an IP packet whose next-hop IP
+     * address is in reg0. (ip4.dst is the final destination.) This table
+     * resolves the IP address in reg0 into an output port in outport and an
+     * Ethernet address in eth.dst. */
+    HMAP_FOR_EACH (op, key_node, ports) {
+        if (op->nbr) {
+            /* XXX ARP for neighboring router */
+        } else if (op->od->router_port) {
+            const char *peer_name = smap_get(
+                &op->od->router_port->nbs->options, "router-port");
+            if (!peer_name) {
+                continue;
+            }
+
+            struct ovn_port *peer = ovn_port_find(ports, peer_name);
+            if (!peer || !peer->nbr) {
+                continue;
+            }
+
+            for (size_t i = 0; i < op->nbs->n_addresses; i++) {
+                struct eth_addr ea;
+                ovs_be32 ip;
+
+                if (ovs_scan(op->nbs->addresses[i],
+                             ETH_ADDR_SCAN_FMT" "IP_SCAN_FMT,
+                             ETH_ADDR_SCAN_ARGS(ea), IP_SCAN_ARGS(&ip))) {
+                    char *match = xasprintf("reg0 == "IP_FMT, IP_ARGS(ip));
+                    char *actions = xasprintf("eth.src = "ETH_ADDR_FMT"; "
+                                              "eth.dst = "ETH_ADDR_FMT"; "
+                                              "outport = %s; "
+                                              "output;",
+                                              ETH_ADDR_ARGS(peer->mac),
+                                              ETH_ADDR_ARGS(ea),
+                                              peer->json_key);
+                    ovn_lflow_add(lflows, peer->od,
+                                  S_ROUTER_IN_ARP, 200, match, actions);
+                    free(actions);
+                    free(match);
+                }
+            }
+        }
+    }
+
+    /* Logical router egress table 0: Delivery (priority 100).
+     *
+     * Priority 100 rules deliver packets to enabled logical ports. */
+    HMAP_FOR_EACH (op, key_node, ports) {
+        if (!op->nbr) {
+            continue;
+        }
+
+        if (!lrport_is_enabled(op->nbr)) {
+            /* Drop packets to disabled logical ports (since logical flow
+             * tables are default-drop). */
+            continue;
+        }
+
+        char *match = xasprintf("outport == %s", op->json_key);
+        ovn_lflow_add(lflows, op->od, S_ROUTER_OUT_DELIVERY, 100,
+                      match, "output;");
+        free(match);
+    }
+}
+
+/* Updates the Logical_Flow and Multicast_Group tables in the OVN_SB database,
+ * constructing their contents based on the OVN_NB database. */
+static void
+build_lflows(struct northd_context *ctx, struct hmap *datapaths,
+             struct hmap *ports)
+{
+    struct hmap lflows = HMAP_INITIALIZER(&lflows);
+    struct hmap mcgroups = HMAP_INITIALIZER(&mcgroups);
+
+    build_lswitch_flows(datapaths, ports, &lflows, &mcgroups);
+    build_lrouter_flows(datapaths, ports, &lflows);
+
     /* Push changes to the Logical_Flow table to database. */
     const struct sbrec_logical_flow *sbflow, *next_sbflow;
     SBREC_LOGICAL_FLOW_FOR_EACH_SAFE (sbflow, next_sbflow, ctx->ovnsb_idl) {
@@ -1042,7 +1510,7 @@ build_lflows(struct northd_context *ctx, struct hmap *datapaths,
             continue;
         }
 
-        enum ovn_datapath_type dp_type = DP_SWITCH; /* XXX no routers yet. */
+        enum ovn_datapath_type dp_type = od->nbs ? DP_SWITCH : DP_ROUTER;
         enum ovn_pipeline pipeline
             = !strcmp(sbflow->pipeline, "ingress") ? P_IN : P_OUT;
         struct ovn_lflow *lflow = ovn_lflow_find(
@@ -1061,8 +1529,8 @@ build_lflows(struct northd_context *ctx, struct hmap *datapaths,
 
         sbflow = sbrec_logical_flow_insert(ctx->ovnsb_txn);
         sbrec_logical_flow_set_logical_datapath(sbflow, lflow->od->sb);
-        sbrec_logical_flow_set_pipeline(sbflow,
-                                        pipeline ? "ingress" : "egress");
+        sbrec_logical_flow_set_pipeline(
+            sbflow, pipeline == P_IN ? "ingress" : "egress");
         sbrec_logical_flow_set_table_id(sbflow, table);
         sbrec_logical_flow_set_priority(sbflow, lflow->priority);
         sbrec_logical_flow_set_match(sbflow, lflow->match);
index 55da9ee..1d9104e 100644 (file)
@@ -1078,12 +1078,28 @@ tcp.flags = RST;
       constructed for each supported encapsulation.
     </column>
 
-    <column name="external_ids" key="logical-switch" type='{"type": "uuid"}'>
-      Each row in <ref table="Datapath_Binding"/> is associated with some
-      logical datapath.  <code>ovn-northd</code> uses this key to store the
-      UUID of the logical datapath <ref table="Logical_Switch"
-      db="OVN_Northbound"/> row in the <ref db="OVN_Northbound"/> database.
-    </column>
+    <group title="OVN_Northbound Relationship">
+      <p>
+        Each row in <ref table="Datapath_Binding"/> is associated with some
+        logical datapath.  <code>ovn-northd</code> uses these keys to track the
+        association of a logical datapath with concepts in the <ref
+        db="OVN_Northbound"/> database.
+      </p>
+
+      <column name="external_ids" key="logical-switch" type='{"type": "uuid"}'>
+        For a logical datapath that represents a logical switch,
+        <code>ovn-northd</code> stores in this key the UUID of the
+        corresponding <ref table="Logical_Switch" db="OVN_Northbound"/> row in
+        the <ref db="OVN_Northbound"/> database.
+      </column>
+
+      <column name="external_ids" key="logical-router" type='{"type": "uuid"}'>
+        For a logical datapath that represents a logical router,
+        <code>ovn-northd</code> stores in this key the UUID of the
+        corresponding <ref table="Logical_Router" db="OVN_Northbound"/> row in
+        the <ref db="OVN_Northbound"/> database.
+      </column>
+    </group>
 
     <group title="Common Columns">
       The overall purpose of these columns is described under <code>Common
index ce9148d..c76b5dc 100644 (file)
@@ -512,12 +512,13 @@ AT_CLEANUP
 
 AT_BANNER([OVN end-to-end tests])
 
-AT_SETUP([ovn -- 3 HVs, 3 VIFs/HV, 1 logical switch])
+# 3 hypervisors, one logical switch, 3 logical ports per hypervisor
+AT_SETUP([ovn -- 3 HVs, 1 LS, 3 lports/HV])
 AT_SKIP_IF([test $HAVE_PYTHON = no])
 ovn_start
 
 # Create hypervisors hv[123].
-# Add vif1[123] to hv1, vif2[123] to hv2, vif3[123].
+# Add vif1[123] to hv1, vif2[123] to hv2, vif3[123] to hv3.
 # Add all of the vifs to a single logical switch lsw0.
 # Turn on port security on all the vifs except vif[123]1.
 # Make vif13, vif2[23], vif3[123] destinations for unknown MACs.
@@ -679,8 +680,7 @@ for i in 1 2 3; do
 done
 AT_CLEANUP
 
-
-AT_SETUP([ovn -- 3 HVs, 1 VIFs/HV, 1 gateway, 1 logical switch])
+AT_SETUP([ovn -- 3 HVs, 1 VIFs/HV, 1 GW, 1 LS])
 AT_SKIP_IF([test $HAVE_PYTHON = no])
 ovn_start
 
@@ -835,3 +835,166 @@ for i in 1 2 3; do
     echo
 done
 AT_CLEANUP
+
+# 3 hypervisors, 3 logical switches with 3 logical ports each, 1 logical router
+AT_SETUP([ovn -- 3 HVs, 3 LS, 3 lports/LS, 1 LR])
+AT_SKIP_IF([test $HAVE_PYTHON = no])
+ovn_start
+
+# Logical network:
+#
+# Three logical switches ls1, ls2, ls3.
+# Three VIFs on each: lp1[123], lp2[123], lp3[123].
+# One logical router lr connected to ls[123].
+ovn-nbctl \
+    -- create Logical_Router name=lr0 ports=@lrp1,@lrp2,@lrp3 \
+    -- --id=@lrp1 create Logical_Router_Port name=lrp1 \
+       network=192.168.1.254/24 mac='"00:00:00:00:ff:01"' \
+    -- --id=@lrp2 create Logical_Router_Port name=lrp2 \
+       network=192.168.2.254/24 mac='"00:00:00:00:ff:02"' \
+    -- --id=@lrp3 create Logical_Router_Port name=lrp3 \
+       network=192.168.3.254/24 mac='"00:00:00:00:ff:03"'
+for i in 1 2 3; do
+    lrp_uuid=`ovn-nbctl get Logical_Router_Port lrp$i _uuid`
+    ovn-nbctl \
+        -- lswitch-add ls$i \
+        -- lport-add ls$i lrp$i-attachment \
+        -- set Logical_Port lrp$i-attachment type=router \
+                                             options:router-port=$lrp_uuid \
+                                             addresses='"00:00:00:00:ff:0'$i'"'
+    for j in 1 2 3; do
+        ovn-nbctl \
+            -- lport-add ls$i lp$i$j \
+            -- lport-set-addresses lp$i$j "f0:00:00:00:00:$i$j 192.168.$i.$j"
+    done
+done
+
+# Physical network:
+#
+# Three hypervisors hv[123].
+# lp1[123] spread across hv[123]: lp11 on hv1, lp12 on hv2, lp13 on hv3.
+# lp2[123] spread across hv[23]: lp21 and lp22 on hv2, lp23 on hv3.
+# lp3[123] all on hv3.
+
+# Given the name of a logical port, prints the name of the hypervisor
+# on which it is located.
+vif_to_hv() {
+    case $1 in dnl (
+        11) echo 1 ;; dnl (
+        12 | 21 | 22) echo 2 ;; dnl (
+        13 | 23 | 3?) echo 3 ;;
+    esac
+}
+
+net_add n1
+for i in 1 2 3; do
+    sim_add hv$i
+    as hv$i
+    ovs-vsctl add-br br-phys
+    ovn_attach n1 br-phys 192.168.0.$i
+done
+for i in 1 2 3; do
+    for j in 1 2 3; do
+        hv=`vif_to_hv $i$j`
+        as hv$hv ovs-vsctl \
+            -- add-port br-int vif$i$j \
+            -- set Interface vif$i$j external-ids:iface-id=lp$i$j \
+                                     options:tx_pcap=hv$hv/vif$i$j-tx.pcap \
+                                     options:rxq_pcap=hv$hv/vif$i$j-rx.pcap \
+                                     ofport-request=$i$j
+    done
+done
+
+# Pre-populate the hypervisors' ARP tables so that we don't lose any
+# packets for ARP resolution (native tunneling doesn't queue packets
+# for ARP resolution).
+ovn_populate_arp
+
+# Allow some time for ovn-northd and ovn-controller to catch up.
+# XXX This should be more systematic.
+sleep 1
+
+# test_packet INPORT SRC_MAC DST_MAC SRC_IP DST_IP OUTPORT...
+#
+# This shell function causes a packet to be received on INPORT.  The packet's
+# content has Ethernet destination DST and source SRC (each exactly 12 hex
+# digits) and Ethernet type ETHTYPE (4 hex digits).  The OUTPORTs (zero or
+# more) list the VIFs on which the packet should be received.  INPORT and the
+# OUTPORTs are specified as lport numbers, e.g. 11 for vif11.
+trim_zeros() {
+    sed 's/\(00\)\{1,\}$//'
+}
+for i in 1 2 3; do
+    for j in 1 2 3; do
+        : > $i$j.expected
+    done
+done
+test_packet() {
+    # This packet has bad checksums but logical L3 routing doesn't check.
+    local inport=$1 src_mac=$2 dst_mac=$3 src_ip=$4 dst_ip=$5
+    local packet=$3$208004500001c0000000040110000$4$50035111100080000
+    shift; shift; shift; shift; shift
+    hv=hv`vif_to_hv $inport`
+    as $hv ovs-appctl netdev-dummy/receive vif$inport $packet
+    #as $hv ovs-appctl ofproto/trace br-int in_port=$inport $packet
+    for outport; do
+        ins=`echo $inport | sed 's/^\(.\).*/\1/'`
+        outs=`echo $outport | sed 's/^\(.\).*/\1/'`
+        if test $ins = $outs; then
+            # Ports on the same logical switch receive exactly the same packet.
+            echo $packet
+        else
+            # Routing decrements TTL and updates source and dest MAC
+            # (and checksum).
+            echo f000000000${outport}00000000ff0${outs}08004500001c00000000"3f1101"00${src_ip}${dst_ip}0035111100080000
+        fi | trim_zeros >> $outport.expected
+    done
+}
+
+as hv1 ovn-sbctl dump-flows
+as hv1 ovs-ofctl dump-flows br-int
+
+# Send packets between all pairs of source and destination ports:
+#
+# 1. Unicast IP packets are delivered to exactly one lport (except
+#    that packets destined to their input ports are dropped).
+#
+# 2. Broadcast IP packets are delivered to all lports except the input port.
+for is in 1 2 3; do
+    for js in 1 2 3; do
+        bcast=
+        s=$is$js
+        smac=f000000000$s
+        sip=c0a80${is}0${js}
+        for id in 1 2 3; do
+            for jd in 1 2 3; do
+                d=$id$jd
+                dip=c0a80${id}0${jd}
+                if test $is = $id; then dmac=f000000000$d; else dmac=00000000ff0$is; fi
+                if test $d != $s; then unicast=$d; else unicast=; fi
+
+                test_packet $s $smac $dmac $sip $dip $unicast #1
+
+                if test $id = $is && test $jd != $js; then bcast="$bcast $d"; fi
+            done
+        done
+        test_packet $s $smac ffffffffffff $sip ffffffff $bcast #2
+    done
+done
+
+# Allow some time for packet forwarding.
+# XXX This can be improved.
+sleep 1
+
+# Now check the packets actually received against the ones expected.
+for i in 1 2 3; do
+    for j in 1 2 3; do
+        file=hv`vif_to_hv $i$j`/vif$i$j-tx.pcap
+        echo $file
+        $PYTHON "$top_srcdir/utilities/ovs-pcap.in" $file | trim_zeros > $i$j.packets
+        cp $i$j.expected expout
+        AT_CHECK([cat $i$j.packets], [0], [expout])
+        echo
+    done
+done
+AT_CLEANUP