Turning Mathematics Problems into Games: Reinforcement Learning
and Gröbner bases together solve Integer Feasibility Problems

Yue Wu, Jesús A. De Loera Equal contribution
Abstract

Can agents be trained to answer difficult mathematical questions by playing a game? We consider the integer feasibility problem, a challenge of deciding whether a system of linear equations and inequalities has a solution with integer values. This is a famous NP-complete problem with applications in many areas of Mathematics and Computer Science. Our paper describes a novel algebraic reinforcement learning framework that allows an agent to play a game equivalent to the integer feasibility problem. We explain how to transform the integer feasibility problem into a game over a set of arrays with fixed margin sums. The game starts with an initial state (an array), and by applying a legal move that leaves the margins unchanged, we aim to eventually reach a winning state with zeros in specific positions. To win the game the player must find a path between the initial state and a final terminal winning state, if one exists. Finding such a winning state is equivalent to solving the integer feasibility problem. The key algebraic ingredient is a Gröbner basis of the toric ideal for the underlying axial transportation polyhedron. The Gröbner basis can be seen as a set of connecting moves (actions) of the game. We then propose a novel RL approach that trains an agent to predict moves in continuous space to cope with the large size of action space. The continuous move is then projected onto the set of legal moves so that the path always leads to valid states. As a proof of concept we demonstrate in experiments that our agent can play well the simplest version of our game for 2-way tables. Our work highlights the potential to train agents to solve non-trivial mathematical queries through contemporary machine learning methods used to train agents to play games.

\nocopyright\affiliations

University of California, Davis
yvwu@ucdavis.edu, deloera@math.ucdavis.edu

Introduction

Reinforcement learning has seen tremendous success in recent years, especially in playing games at levels that achieve superhuman performances Mnih et al. (2015); Silver et al. (2017); Guo et al. (2016); Vinyals et al. (2019)). The philosophical principle we introduce in this paper is to try to reformulate non-trivial mathematical problems as games and then try to adapt reinforcement learning techniques to play those games. By winning the game we solve the original mathematical problem. Of course this first requires (at least for now) a human creating the right game for the given mathematical problem. As a proof of concept, in this paper we use the integer feasibility problem (IFP). In its simplest form, the IFP is the decision question of whether a polyhedral system has an integer vector solution (note that any system of equations and inequalities can be reduced to this standard form). This famous NP-complete problem is very important in mathematical optimization, discrete mathematics, algebra, and other areas of mathematics. What we propose here is to turn it into a game on arrays.

As we explain below we transform every instance of the integer IFP into a positional game over arrays or tables with fixed margin sums belonging to the axial transportation we use an old polynomial-time algorithm from De Loera and Onn (2004a) that canonically rewrites an instance of the IFP as the face of a axial transportation polytope. Axial transportation polytopes are convex polytopes whose points are arrays or tables with fixed axial sums of entries Yemelichev et al. (1984). The second idea is that we know well the toric (polynomial ring) ideals of axial transportation polytopes have an explicit North-West Gröbner basis that connects all lattice points. The game starts with an initial state (an initial array), and by applying a legal move that leaves the margins unchanged, we aim to reach a winning state with zeros in specific positions. To win the game the player must find a path between the initial state and a final terminal winning state. Finding such a winning state is equivalent to solving the integer feasibility problem. As we see later the winning position, if one exists, is essentially a table with very specific zero entries. The Gröbner basis of the toric ideal for the underlying transportation polyhedron as a set generating all connecting moves that the agent can be trained to select the moves. Of course our games sometimes have no solution, then our method, at least for now, may have to exhaust all positions before concluding this. Reinforcement learning tools are then used to search the game space

Reinforcement learning on games has achieved great success and a key point of our paper is that the success can be extended to train artificial agents to play these games with a mathematical origin. We make a successful practical demonstration of these ideas with experiments in the simplest form of our games, the situation for 2D tables where we trained an agent to predict the moves. Our game can be viewed as a variant of the stochastic shortest path problem Bertsekas and Tsitsiklis (1991); Bertsekas and Yu (2013) where an agent takes a path to reach pre-defined goal states. However, unlike any of the games like Go, maze, etc, our table game has a very large and highly (algebraically) structured action space. In 2-way tables, the minimum Grobner basis contains moves of degree ( non-zero entries). Although these most basic moves are sufficient for solving the game, the progress they make at each step is limited, which can cause extremely long episodes. We instead consider much more advanced moves that are integer combinations of the minimum moves. These complicated moves will be very effective for solving the table games faster. New techniques are required to compute these advanced moves because they are not only very large in number but also difficult to construct. We treat it as a regression problem and make the agent predict continuous moves which are then projected onto its closest lattice point in a constraint set specified by an integer program. We make the whole framework differentiable and thus can be trained end-to-end.

We conduct experiments on 2-way table and the results are promising, which sheds light on the potential of Gröbner basis approaches for integer feasibility problems.

Basic notions: polyhedra and Gröbner bases

We begin describing how testing integer feasibility is actually equivalent to playing a certain game on finite set of arrays inside a polytope. We begin with some notation: Let , and be three positive integers. Let , , and be three rational vectors of lengths , and , respectively, with non-negative entries. We consider the 3-way transportation polytope with entries defined by 1-marginals,

The margins are sums of the certain entries of an 3-way array of numbers. We begin with stating the first algorithmic step of this paper:

Theorem 1.

Any rational convex polyhedron can be written as a face of some -way transportation polytope with -marginals supported by a hyperplane of the form , where is a finite subset of indices. The sizes , the set , and marginals can be computed explicitly in polynomial time on the size of the input. Specifically, the face is given by certain entries forced to be zero.

This theorem is a particular case of a much more general theory further developed in De Loera and Onn (2004a, b). Given a convex polytope (for which we wonder whether there is a lattice point), the very first step is to use Theorem 1 to transfer the points of as 3-way tables. We construct -marginals for a -way transportation polytope , and a set of triples , such that is the face of of those points with for all in .

Unlike general polytopes, for axial transportation polytopes with given 1-marginals it is trivial to decide integer feasibility and to find an integer vertex of . This can be done via the famous greedy north-west-up corner rule Hoffman (1963); Yemelichev et al. (1984). In fact, integral points of will exists as long as the polytope is non-empty and the 1-marginals are integral. To check if is nonempty over the reals (i.e., LP feasibility), all we have to do is check whether the sum of the entries in the three margins are equal.

The tables inside will be the state space for the game. Next, in order to detect whether we have an integral solution of , we can use another property of axial transportation polytopes, namely we prove that there exists an explicit Gröbner basis for the toric ideal of axial transportation polytopes which will give an integer point in if one exists. As explained in Sturmfels (1996) one can rely on the Gröbner basis elements as moves, which move us from one feasible integer point, a table, in to another. We stress that although Gröbner bases have been criticized as having too many elements and potentially many elements of large entries (hence being hard to compute), in our situation we have nicer structure that aids to practical performance. Unlike prior situations our Gröbner bases can be generated on the fly and their elements have entries only due to the special structure of axial transportation polytopes. We move from table to table that satisfy same margins. In other words, we get around the nasty structure of an arbitrary polyhedron via the canonical embedding as a face of an axial transportation polyhedron (which is a larger polyhedron, but it is simpler). By our results in Section Integer Feasibility Testing is a Game on Tables the pivots will find a feasible point in if and only if one exists.

Let us explain next more about Gröbner bases and the way to always find an initial table. While general transportation polytopes can be as nasty as arbitrary polytopes, axial transportation polytopes have several noteworthy computational advantages. First, checking real feasibility is trivial, the polytope is feasible if and only if the sum of the entries in each of the 1-margins equal the same value. More strongly, there will be an integer solution if the margins are integer.

A special case of axial transportation polytopes that is familiar to most readers, consists of 2-way transportation polytopes. They appear in most college-level optimization courses as bipartite network-flow problems. The Hitchcock transportation problem: , s.t. . An initial feasible solution can be obtained from a greedy procedure for certain sequences of the variables, the northwest-corner rule. This algorithm proceeds along a sequence of entries and maximizes each variable in turn with respect to the margins that bound it. Its running time is . In general the algorithm constructs only a feasible, not an optimal solution. Fortunately, Hoffman (1963) gave a sufficient and necessary condition on such that the algorithm always constructs an optimal solution for arbitrary demand and supply vectors and and cost vector . But for some cost vectors and special sequences the solution will be optimal. Sequences and cost matrices which fulfill the property above are called Monge sequences.

For this paper, we need to find an initial integer table efficiently for -way axial transportation polytopes. The good news is that, once again, a very similar greedy algorithm for the classical -way problem explained above can be applied to obtain a feasible solution for this more general case. In pseudocode this algorithm will read as follows in the case of a 3-way axial transportation polytope: Take an arbitrary order of the variables, say the sequence , and perform the subsequent greedy algorithm:

For to do

  1. Set

Again, given a sequence of triples of indices, this greedy algorithm maximizes each variable of in turn. When the algorithm ends it will give always a feasible solution, in fact an integer solution when the margins are integer. In Rudolf (1998) we have a necessary and sufficient condition on and the cost matrix which guarantees that the solution is in fact an optimal LP solution for costs associated to each entry:

Lemma 1.

The generalized north-west rule algorithm finds a feasible solution for the three-dimensional axial transportation problem for all right-hand-side vectors , , and whose sum of entries are equal. The solution is integer if the vectors are integer. Moreover if there is a cost matrix cost matrix , which is a three-dimensional Monge sequence in the sense of Rudolf (1998), then the solution found is an optimal linear programming solution for the minimization problem.

In the rest of this section, we will briefly introduce the notion of Gröbner bases relevant for our project problems. Recall a polynomial ideal is a set of polynomials in that satisfies two properties: (1) If are in then (2) If and then is in . a Gröbner basis of an ideal is a special finite generating set for with special computational properties. Their computational powers include the ability to answer membership questions for the ideal, computing intersections of ideals, computing projections of ideals, etc. Gröbner bases in general can be computed with the well-known Buchberger algorithm. We are only interested in special kinds of ideals, called toric ideals whose Gröbner bases are better behaved: Given a matrix with integer entries, the toric ideal is the ideal generated by the binomials of the form such that . Gröbner bases of toric ideals have been explored in the literature (see Sturmfels (1996); De Loera et al. (2013); Cox et al. (2005) and references therein). If we find a Gröbner basis for , it is well known that the vectors will have the following properties. Let be any polytope that could be defined by the matrix and by a choice of an integral right-hand-side vector . If we form a graph whose vertices are the lattice points of and we connect any pair of them by an edge if there is a vector in such that with , then the resulting graph is connected (Sturmfels, 1996, Chapter 4). Moreover if we orient the edges of this graph according to a term order we used to compute the Gröbner basis above (where the tail of an edge is bigger than its head) this directed graph will have a unique sink. Thus from any lattice point of there is an“augmenting” path to this sink. We will call the process of traversing such an augmenting path a reduction. Moreover, we will refer to the elements in as moves. It has been shown in Hoşten and Sullivant (2002) that while toric ideals of general transportation polytopes are as nasty as those for general polytopes, axial transportation polytopes enjoy a rich decomposable structure that essentially allow us to build Gröbner bases from the Gröbner bases of their slices. For instance, for 2-way tables we know everything about the Gröbner bases of their toric ideals

Theorem 2.

Let be the -matrix which is the matrix of the linear transformation that computes the row and column sums of a given -way table. Then is (totally) unimodular and hence its minimal universal Gröbner basis consists of its circuits. These circuits are -way tables whose row and column sums are zero and with entries in of minimal support.

For axial transportation polytope of size , can we find a similarly nice Gröbner basis for some term order. We explain several ways next.

Integer Feasibility Testing is a Game on Tables

We have seen that from Theorem 1 the (integer) tables with specified margins in the construction represent all the lattice solutions of the original IFP. Those will be the states of the game. In order to check the integer feasibility of we need to have a Gröbner basis (test set) of the axial transportation polyhedron such that the normal form of the North-West-corner rule integer initial solution is a feasible solution of (if such a solution exists). Of course, such a Gröbner basis exists right away: a Gröbner basis of the axial transportation arrays with respect to an elimination term order where all variables corresponding to the ”forbidden” entries of the arrays are bigger than the ”enabled” entries will do the job. In principle, we are done. However in the rest of the section we explain how to find a more efficient solution avoiding to use Buchberger algorithm.

We construct such a Gröbner basis building from the Gröbner bases of -way transportation problems slices of -way tables (see Theorem 2 and discussion there). Moreover, we will prove that we do not need to explicitly compute and store this Gröbner basis in advance (which is a very large set of vectors). It is enough to compute an element of the Gröbner basis ”on the fly” that will improve the current feasible solution. For our construction we will follow the ideas presented in Hoşten and Sullivant (2002). There a similar construction was given for any decomposable statistical problem. Fortunately, the -way axial problems are decomposable, so we can use those techniques. But we will do this in a slightly more general way. The first set of moves that will make up a Gröbner basis is obtained as follows. Let be, for a fixed index value , the 2-way transportation problem defined as

Let be different Gröbner bases of -way transportation problems. Note that if the table with entries is a Gröbner basis element, then for each fixed the -way table with entries and for is a valid move for the transportation problem . Now let be the set of moves obtained this way from all elements in for all values of . The following theorem is a modification of Theorem 4.13 in Hoşten and Sullivant (2002), we omit its proof here:

Theorem 3.

Let be Gröbner bases of the -way transportation problem. Let be the term order on the entries of the axial transportation problem where and the entries in the th horizontal slice are ordered with respect to the term order . Then is a Gröbner basis with respect to .

There is a second set of moves for the original transportation problem obtained from 2-way transportation problems. Now we will describe these moves. This time let be the -way planar transportation problem. Let be a Gröbner basis for this problem. Suppose is an element in where and are nonnegative tables with entries and . We can “lift” such an element to a move for the original problem as follows. First note that is homogeneous, i.e., . Second because we are in the setting of a -way transportation problem, for each we have . Hence we can represent as

Here we allow that some indices repeated if the corresponding entry (or ) in the table is bigger than one. Now given a sequence of indices (again repetitions are allowed) we get a move for the transportation problem :

We let to be the set of all moves obtained from all Gröbner basis elements in using all possible liftings. Similarly we can define . Now we claim we can put together , , and to get a Gröbner basis for the toric ideal of -way axial transportation problem. However, we first need to describe the appropriate term order.

Given a -way table we can compute “projection” tables (marginals) in any axial direction. For instance would be the -way table whose entry is .

Lemma 2.

Let and be term orders for and planar tables and let be term orders forthe planar tables. If is the term order for -way tables described in Theorem 3, then the relation on such -way tables given by X X’ if

is a term order.

The proof of Lemma 2 is in the Appendix.

Theorem 4.

Let and be Gröbner bases for the and planar transportation problems with respect to the term orders and , respectively. Also let be Gröbner bases for planar transportation problems with respect to the term orders . Let be the term order for the tables given in Theorem 3. Then the set

is a Gröbner basis for the -way axial transportation problems with respect to the term order of Lemma 2.

Now we are ready to construct a Gröbner basis for -way axial transportation problems with which we will solve the integer feasibility problem for any polytope after it has been encoded as a -way axial transportation polytope . In order to do this we will describe term orders , and and then use Theorem 4. Recall that by the construction of , the tables corresponding to the feasible solutions of will be a face given by forbidden entries which need to be set to zero. If is such a table where all the enabled entries are positive, then we get forbidden entries for tables , which are forced to be equal to zero. Also each horizontal slice will have its own forbidden entries. Let , and be these forbidden entries, and let , , and be their complements, namely the enabled entries. We let be the weight vector where if and if . We define and in a similar way. We define the term order to be any elimination term order where (and the entries and within themselves are ordered in an arbitrary but fixed way) which refines the ordering giving by . In other words, given two tables and , if the weight of is bigger than the weight of then we declare . In the case of equality, we resort to the elimination term order that breaks the tie. Note that if the support of is contained in and that of is not, then we immediately declare . We define and similarly in which the forbidden entries are eliminated.

Input: A rational polytope presented in its representation .

Output: YES or NO depending on whether contains an integer lattice point.

  1. Compute the encoding of as a face of a -way axial transportation polytope .

  2. Use the North-West-corner rule to find an initial table that is a feasible integer solution in .

  3. Use Gröbner basis elements with respect to constructed above to get the unique sink . If the weight of or of is not zero, output NO.

  4. Otherwise, using the Gröbner basis elements in and of weight zero (and their liftings) generate a set of tables such that and are the set of -way and tables with the same row and column sums as and , and with their support in and respectively.

  5. For each reduce just using the Gröbner basis elements in to obtain . If the weight of with respect to is zero for all , output YES. If no in gives rise to such an , output NO.

Algorithm 1 Integer Feasibility Testing Game
Theorem 5.

Algorithm 1 solves the integer feasibility of any rational polytope by first transforming it into an 3-way axial transportation polytope with specific 1-margins and finding a sequence of Gröbner basis moves from an initial 3-way array in to another 3-way array in with zeros in specified entries if one exists (equivalent to finding an integer solution for ).

Learning to Play Games on 2-way Tables

As a proof of concept, we now describe our method of playing table games using reinforcement learning on the polyhedral Gröbner bases systems we presented. Here we only consider the simpler special case of 2-way tables because we have an explicit list of all Gröbner basis moves already (See Theorem 2). Similar ideas can be extended to 3-way axial transportation polytopes. In this family, the state space consists of tables of a 2-way transportation polytope with fixed 1-margins,

The action space is the Gröbner basis generators that connect any two tables in . They are computed using Theorem 2. It is noteworthy that the action space can be defined with more flexibility because one can consider the minimum set of actions that connects all tables, or with more complicated moves that are linear combinations of the minimum moves. There is a trade-off between the action space and the efficiency of solving the game. We refine our action space as follows.

Note that any element of the action space is generated as integer combination of the Gröbner basis of Theorem 2, which suggests that for any two states , there exists a path connecting using moves from . In the most general form, our table games can be formulated as follows.

where is the set of valid moves for (table) :

We pre-defined the terminating state of the game by specifying a set of entries such that as the terminating condition which mimics the forbidden entries of the axial transportation polytope computed in Theorem 5. The reward function penalizes every step when the goal state is not reached:

(1)

The reward function encourages a policy that can reach a terminating state with the minimum number of steps.

Reinforcement Learning Framework

Our RL framework is based on Twin Delayed DDPG (TD3 Fujimoto et al. (2018); Lillicrap et al. (2016)). TD3 is an off-policy method which requires a replay buffer to collect a set of transitions . There are two critic networks and and an actor network . Both critic networks are trained by minimizing the mean squared Bellman error:

The target is defined as the minimum of two fixed target critic networks to prevent overestimation of the function.

Structured Action Prediction

The interaction between the actor and the environment. The actor network
Figure 1: The interaction between the actor and the environment. The actor network predicts a continuous move , which is then projected to a discrete in the set before being applied to the environment.

Due to the fact that our action space is discrete, one natural way is to predict the action as a classification problem at each time step. However, its feasibility is hindered by the large number of actions to compute and store in advance. We instead make the actor network predict a continuous action . Then, we obtain the integral solution by projecting to the feasible set of discrete actions encoded by the following integer program.

s.t.
(2)

and we obtain the projected discrete action by . We add constraint (2) to exclude the action with all zeros. An illustration of the process is shown in Figure 1. Splitting into two binary variables and not only speeds up the projection but also provides more flexibility on the constraints of the actions. For instance, we can bound the number of non-zero elements in the action by adding to the system.

To train the critic network, we consider the projection operator a deterministic part of the environment and directly learn . Similar strategies can be found in Dulac-Arnold et al. (2015). It is worthwhile to mention that one can also learn . The non-differentiable projection layer can be tackled by straight-through estimators Bengio et al. (2013); van den Oord et al. (2017). However, for every mini-batch update, calculating a target in the temporal difference learning requires the projection operations to obtain . The projection will become a bottleneck and significantly slows down the whole training process. We also observed in experiments that learning has no obvious performance gain. The actor network is updated for each mini-batch by gradient ascent

(3)

Learning from Demonstrations

Learning from demonstrations is an effective strategy to improve sample efficiency. Due the the large action space of our problem, we generate a set of demonstrations using a greedy algorithm to speed up the learning. We make sure that a fixed portion of samples in the mini-batch are drawn from the demonstration during each training step. Since the demonstrations generated by the greedy algorithm are not perfect, we adopt the Q-filter strategy Nair et al. (2017) to only enforce a supervised learning loss when the demonstrated action has a higher value than the actor’s action. Thus, the demonstration loss on mini-batch using demonstrated action can be summarized as:

The learning curve of our RL agent compared with the behavior cloning baseline in
(a)
The learning curve of our RL agent compared with the behavior cloning baseline in
(b)
The learning curve of our RL agent compared with the behavior cloning baseline in
(c)
Figure 2: The learning curve of our RL agent compared with the behavior cloning baseline in table games.
The learning curve of the RL agent compared to BC model on
(a)
The learning curve of the RL agent compared to BC model on
(b)
The learning curve of the RL agent compared to BC model on
(c)
Figure 3: The learning curve of the RL agent compared to BC model on tables.

Experiments

Data Collection

The goal entries in are selected randomly and remain fixed. The initial state is also randomly generated, and we randomly generate forbidden entries and fill in other entries with random numbers. To start an episode, we randomly initialize the starting table . To improve the training stability, we add lower and upper bounds on the 1-margins, i.e., . As a result, the 1-margins are fully specified by the initial state and remain the same throughout the episode. The moves of the demonstrating greedy algorithm are calculated state-by-state by selecting a valid move that maximizes the values of the entries in .

Network Architecture and Baseline

The actor and critics are parameterized by convolutional neural networks. networks consist of multiple convolutional blocks where each block has a convolution layer followed by batch normalization (BN) and ReLU. The number of blocks depends on the size of the board. All convolution layers maintain the same spatial dimensions. The final block of the actor network has a convolution layer followed by tanh. In the critic network, we apply a global average pooling layer before the final linear layer to predict the value.

We also train a baseline model on the demonstration set using behavior cloning (BC) Ross et al. (2010); Osa et al. (2018). The BC model is also a neural network with the same architecture as the actor network.

Training Details

We collect 100 demonstrations prior to the RL training and we sample of transitions from the demonstration buffer for every mini-batch update. We use Adam Kingma and Ba (2015) as the optimizer with a learning rate for both actor and critics. The discount factor is . We use a batch size of . The exploration of DDPG-styled algorithm is achieved by adding a Gaussian noise to the actor We use throughout the experiment.

The learning curve of the RL agent compared to BC model on
Figure 4: The learning curve of the RL agent compared to BC model on tables.

Results

To demonstrate the ability of our method to learn complex moves, we test our agent on 2-way table of various grid (table) sizes with increasing maximum elements. Larger maximum entries in the table would lead to longer horizons. The results on table games is shown in Figure 2. The mean negative episode length is shown in as the curve and the shaded region denotes the half standard deviation. The BC model can achieve relatively good performance but deteriorates as the upper bound on the entry increases. The RL model outperforms the BC model by a margin and the advantage becomes more significant with larger upper bounds. In Figure 3 we observe similar performance on games. We tested three 1-margin bounds . The BC model does not perform well even with . On the other hand, our RL agent is able to consistently reach the goal tables with relatively short paths. The results on tables are shown in Figure 4. The increase of the grid size also increases the action space exponentially. The RL agent can solve the problem whereas the BC model has a very low success rate and it takes longer steps to reach the goal state.

Generalization Test

Table 1: We test the trained model with margin bound on increased margins and compare the how the success rate varies. The success rates drop as we increase the bound, but the model is able to maintain a decent success rate which shows its generalization ability.

In this section we wonder if the model can solve games that are unseen in the training data. We set up the experiment by training the model on the fixed upper 1-margin bound with grid size and , and test its success rates on larger bounds with the same corresponding grid size. The results in Table 1 demonstrate that the trained model can solve games that it never experiences during the policy training step. This further suggests that the model can successfully recognize patterns in the table and produces corresponding moves regardless of their magnitudes.

Conclusions and Future Work

We proposed an algorithm that converts the integer feasibility problem into a game on tables. We formulated the table game as a reinforcement learning problem and developed novel techniques to tackle the algebraic structure of the action space. The experimental results on 2-way tables show the potential of solving integer feasibility problems using the Gröbner bases approach. Training on 3-way tables is the next stage of our work. Since IFP is an NP-complete problem, in some cases our game algorithm may have to visit all (exponentially many) tables before concluding there is no solution. It is an open problem to explore ways to cut the search as our games have no solution sometimes.

References

  • Y. Bengio, N. Léonard, and A. C. Courville (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR abs/1308.3432. Cited by: Structured Action Prediction.
  • D. P. Bertsekas and J. N. Tsitsiklis (1991) An analysis of stochastic shortest path problems. 16 (3), pp. 580–595. External Links: ISSN 0364-765X Cited by: Introduction.
  • D. P. Bertsekas and H. Yu (2013) Stochastic shortest path problems under weak conditions. Cited by: Introduction.
  • D. A. Cox, J. Little, and D. O’Shea (2005) Using algebraic geometry. Second edition, Graduate Texts in Mathematics, Vol. 185, Springer, New York. External Links: ISBN 0-387-20706-6, MathReview Entry Cited by: Basic notions: polyhedra and Gröbner bases.
  • J. A. De Loera, R. Hemmecke, and M. Köppe (2013) Algebraic and geometric ideas in the theory of discrete optimization. MOS-SIAM Series on Optimization, Vol. 14, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA; Mathematical Optimization Society, Philadelphia, PA. External Links: ISBN 978-1-611972-43-6, MathReview (Alexander I. Barvinok) Cited by: Basic notions: polyhedra and Gröbner bases.
  • J. A. De Loera and S. Onn (2004a) All rational polytopes are transportation polytopes and all polytopal integer sets are contingency tables. In Integer Programming and Combinatorial Optimization, 10th International IPCO Conference, New York, NY, USA, June 7-11, 2004, Proceedings, G. L. Nemhauser and D. Bienstock (Eds.), Lecture Notes in Computer Science, Vol. 3064, pp. 338–351. External Links: Link, Document Cited by: Introduction, Basic notions: polyhedra and Gröbner bases.
  • J. A. De Loera and S. Onn (2004b) The complexity of three-way statistical tables. SIAM J. Comput. 33 (4), pp. 819–836. External Links: Link, Document Cited by: Basic notions: polyhedra and Gröbner bases.
  • G. Dulac-Arnold, R. Evans, P. Sunehag, and B. Coppin (2015) Reinforcement learning in large discrete action spaces. CoRR abs/1512.07679. Cited by: Structured Action Prediction.
  • S. Fujimoto, H. van Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. CoRR abs/1802.09477. Cited by: Reinforcement Learning Framework.
  • X. Guo, S. Singh, R. L. Lewis, and H. Lee (2016) Deep learning for reward design to improve monte carlo tree search in ATARI games. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, S. Kambhampati (Ed.), pp. 1519–1525. External Links: Link Cited by: Introduction.
  • A. J. Hoffman (1963) On simple linear programming problems. In Proc. Sympos. Pure Math., Vol. VII, pp. 317–327. External Links: MathReview (E. M. L. Beale) Cited by: Basic notions: polyhedra and Gröbner bases, Basic notions: polyhedra and Gröbner bases.
  • S. Hoşten and S. Sullivant (2002) Gröbner bases and polyhedral geometry of reducible and cyclic models. J. Combin. Theory Ser. A 100 (2), pp. 277–301. External Links: ISSN 0097-3165, Document, Link, MathReview (Joachim Krauth) Cited by: Basic notions: polyhedra and Gröbner bases, Integer Feasibility Testing is a Game on Tables, Integer Feasibility Testing is a Game on Tables.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, Cited by: Training Details.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning.. In ICLR, Cited by: Reinforcement Learning Framework.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning.. Nature 518 (7540), pp. 529–533. Cited by: Introduction.
  • A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel (2017) Overcoming exploration in reinforcement learning with demonstrations. CoRR abs/1709.10089. Cited by: Learning from Demonstrations.
  • T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and J. Peters (2018) An algorithmic perspective on imitation learning. CoRR abs/1811.06711. Cited by: Network Architecture and Baseline.
  • S. Ross, G. J. Gordon, and J. A. Bagnell (2010) No-regret reductions for imitation learning and structured prediction. CoRR abs/1011.0686. External Links: Link Cited by: Network Architecture and Baseline.
  • R. Rudolf (1998) On Monge sequences in -dimensional arrays. Linear Algebra Appl. 268, pp. 59–70. External Links: ISSN 0024-3795, Document, Link, MathReview Entry Cited by: Basic notions: polyhedra and Gröbner bases, Lemma 1.
  • D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. P. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis (2017) Mastering the game of go without human knowledge. Nat. 550 (7676), pp. 354–359. External Links: Link, Document Cited by: Introduction.
  • B. Sturmfels (1996) Gröbner bases and convex polytopes. University Lecture Series, Vol. 8, American Mathematical Society, Providence, RI. External Links: ISBN 0-8218-0487-1, Document, Link, MathReview (P. Schenzel) Cited by: Basic notions: polyhedra and Gröbner bases, Basic notions: polyhedra and Gröbner bases.
  • A. van den Oord, O. Vinyals, and k. kavukcuoglu (2017) Neural discrete representation learning. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Cited by: Structured Action Prediction.
  • O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, Ç. Gülçehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. P. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver (2019) Grandmaster level in starcraft II using multi-agent reinforcement learning. Nat. 575 (7782), pp. 350–354. External Links: Link, Document Cited by: Introduction.
  • V. A. Yemelichev, M. M. Kovalëv, and M. K. Kravtsov (1984) Polytopes, graphs and optimisation. Cambridge University Press, Cambridge. Note: Translated from the Russian by G. H. Lawden External Links: ISBN 0-521-25597-X, MathReview Entry Cited by: Introduction, Basic notions: polyhedra and Gröbner bases.

Appendix

In this appendix we include technical proofs of some of the main theorems. Including an explanation of how any rational polyhedron can be rewritten as a face of a 3-way axial transportation polytopes with fixed 1-marginals.

Proofs of Lemma 2 and Theorem 5

Proof. (Lemma 2).

If it is clear that either or , and the relation is compatible with adding the same table to and . So we just need to show that is transitive. So let and . There is a total of nine possibilities to be checked depending on how the tables are aligned, so we give one of these for illustration. Suppose because but , and also suppose that because . Then . Hence . ∎

Proof.

(of Theorem 5) First we show that if is feasible then the -weight of and the -weight of must be zero. Suppose is a table corresponding to a feasible solution of . Then clearly the -weight of and the -weight of are zero. But then if one of these weights for were bigger than zero we would get a contradiction to the assumption that is the unique sink obtained by reduction of with respect to . Now because the -weight of both and are zero, and since is the unique sink of -way tables with row and column sums equal to those of , there is a sequence of Gröbner basis elements in which reduce to . These elements must have -weight zero. This means that we can reverse these moves and use their liftings to obtain a table such that . Note that , and we can use the same argument above to conclude that one can reach to a table using lifted elements of with -weight zero such that . Of course, we also have . The table will be an element of in Step 4 above, if we use the (reversed) elements of and of respective weights zero to generate all possible -way and tables with the same row and column sums as and . By this construction, for each the horizontal slices and have identical row and column sums. Since the -weight of is zero, if the same weight of is not zero, one can find a sequence of Gröbner basis elements in (and hence in ) to obtain where the -weight of is zero. Because and , the table corresponds to a feasible solution of . ∎

Preprocessing: Coefficient Reduction

Let where is an integer matrix and is an integer vector. We represent it as a polytope , in polynomial-time, with a -valued matrix of coefficients, as follows. Consider any variable and let be the maximum number of bits in the binary representation of the absolute value of any . We introduce variables , and relate them by the equations . The representing injection is defined by , embedding as . Consider any term of the original system. Using the binary expansion with all , we rewrite this term as . It is easy to see that this procedure provides a new representation, and we get the following.

Lemma 3.

Any rational polytope is polynomial-time representable as a polytope with -valued defining matrix .

Representing Polytopes as 3-way Transportation Polytopes with 1-marginals and forbidden entries

Let where is an integer matrix and is an integer vector: we assume that is bounded and hence a polytope, with an integer upper bound (which can be derived from the Cramer’s rule bound) on the value of any coordinate of any .

For each variable , let be the largest between the sum of the positive coefficients of over all equations and the sum of absolute values of the negative coefficients of over all equations,

Let , , and . We now describe how to construct vectors , and a set of triples - the “enabled”, non-“forbidden” entries - such that the polytope is represented as the corresponding transportation polytope of arrays with plane-sums and only entries indexed by enabled,

We also indicate the injection giving the desired embedding of the coordinates as the coordinates and the representation of as (see paragraph following Theorem 1).

Basically, each equation will be encoded in a “horizontal plane” (the last plane is included for consistency and its entries can be regarded as “slacks”); and each variable , will be encoded in a “vertical box” , where is the natural partition of with , namely with .

Now, all “vertical” plane-sums are set to the same value , that is, for . All entries not in the union of the variable boxes will be forbidden. We now describe the enabled entries in the boxes; for simplicity we discuss the box , the others being similar. We distinguish between the two cases and . In the first case, ; the box, which is just the single line , will have exactly two enabled entries for suitable , to be defined later. We set , namely embed . We define the complement of the variable to be (and likewise for the other variables). The vertical sums then force , so the complement of is also embedded. Next, consider the case . For each , the line (respectively, ) will contain one enabled entry (respectively, . All other entries of will be forbidden. Again, we set , namely embed ; it is then not hard to see that, again, the vertical sums force and for each . Therefore, both and are each embedded in distinct entries.

To clarify the above description it is helpful to visualize the matrix whose entries are the vertical line-sums .

Next we encode the equations by defining the horizontal plane-sums and the indices above as follows. For , consider the th equation . Define the index sets and , and set . The last coordinate of is set for consistency with to be . Now, with the complement of variable as above, the -th equation can be rewritten as

To encode this equation, we simply “pull down” to the corresponding th horizontal plane as many copies of each variable or by suitably setting or . By the choice of there are sufficiently many, possibly with a few redundant copies which are absorbed in the last hyperplane by setting or . For instance, if , the first variable has as above, its coefficient in the fourth equation is positive, its coefficient in the seventh equation is negative, and for , then we set (so embedding as ), , and .

This way, all equations are suitably encoded, and Theorem 1 follows from the construction outlined above and Lemma 3.

Theorem 1 (Restated).

Any rational polytope is polynomial-time representable as a plane-sum entry-forbidden -way transportation polytope