A WEIGHTED HP MODEL FOR PROTEIN FOLDING WITH DIAGONAL CONTACTS
∗Hans-Joachim B¨ ockenhauer
1and Dirk Bongartz
2Abstract. The HP model is one of the most popular discretized mod- els for attacking the protein folding problem,i.e., for the computational prediction of the tertiary structure of a protein from its amino acid sequence. It is based on the assumption that interactions between hydrophobic amino acids are the main force in the folding process.
Therefore, it distinguishes between polar and hydrophobic amino acids only and tries to embed the amino acid sequence into a two- or three- dimensional grid lattice such as to maximize the number of contacts, i.e., of pairs of hydrophobic amino acids that are embedded into neigh- boring positions of the grid.
In this paper, we propose a new generalization of the HP model which overcomes one of the major drawbacks of the original HP model, namely the bipartiteness of the underlying grid structure which se- verely restricts the set of possible contacts. Moreover, we introduce the (biologically well-motivated) concept of weighted contacts, where each contact gets assigned a weight depending on the spatial distance between the embedded amino acids. We analyze the applicability of ex- isting approximation algorithms for the original HP model to our new setting and design a new approximation algorithm for this generalized model.
Mathematics Subject Classification.68W25, 92C40.
Keywords and phrases.Protein folding, HP model, approximation algorithms.
∗This work was partially supported by SNF grant200021-109252/1.
1Department of Computer Science, ETH Zurich, Switzerland;[email protected]
2Gymnasium St. Wolfhelm, Schwalmtal, Germany;[email protected]
c EDP Sciences 2007
Article published by EDP Sciences and available at http://www.rairo-ita.org or http://dx.doi.org/10.1051/ita:2007023
1. Introduction
To determine the three-dimensional structure of proteins is one of the most important and challenging problems in computational biology with many applica- tions,e.g.in the area of pharmaceutics and drug design.
Proteins are chains of smaller molecular entities, the so-called amino acids, they are expected to fold in space according to the chemical characteristics of these amino acids [2, 3]. In particular, one distinguishes between hydrophobic and hydrophilic (polar) amino acids, and interactions between hydrophobic amino acids are assumed to be the driving force in the folding of proteins.
To study the process of protein folding incorporating these hydrophobic forces, Dill et al. [7, 8] introduced the HP model. In this model, one searches for an embedding of a protein, given in terms of a string over 0 and 1 (encoding for polar and hydrophobic amino acids, respectively), into a grid. Counting the number of ones placed on adjacent positions (that are not already adjacent in the string) in the grid, gives a measure of the stability of the folding. This setting was algorithmically analyzed by Hart and Istrail [9], who proposed a 4-approximation for this problem, which was further improved by Newman [10] obtaining a 3- approximation for two-dimensional grids. Further results on this topic include other lattice types, heuristic algorithms, or extended models. An overview of the existing literature is for instance given in the survey paper [5].
A major drawback of the original HP model is the bipartiteness of the under- lying grid. One possible approach for overcoming this drawback was proposed in [1], where the embedding of protein sequences into a triangular grid lattice was studied.
Another approach was taken in [4], where the authors studied the HP problem on a rectangular grid with additional diagonals. While this removes the bipar- titeness of the underlying grid, it introduces a new weakness to the model. In particular, 45◦, 90◦, 135◦, and 180◦ become feasible angles between two consec- utive amino acids in this model. Even though the angles in every previously considered underlying lattice for the HP problem do not perfectly fit the chemical reality, especially the 45◦ angles seem to be problematic.
On the other hand, binding forces are not restricted to possible folding angles of the molecule. Therefore, to avoid these sharp angles and the bipartiteness as well, we will now introduce a kind oftwo-level lattice, where the folding is restricted to the grid of the original HP model, but the binding forces are also effective along the diagonals in this grid structure.
Moreover, we want to account for the different distance between two embedded amino acids along a diagonal edge or along a horizontal/vertical edge of the grid.
More formally, for a given folding, we will evaluate this folding by weighting each adjacency of two hydrophobic amino acids along a horizontal or vertical edge of the underlying grid lattice by 1 and each adjacency of two hydrophobic amino acids along a diagonal edge by a parameter α, where 0 ≤ α ≤ 1. We refer to the resulting model as the α-DC-HP2d model in the following. The resulting maximization problem is studied in this paper.
first give some basic notions and observations in Section 2 and present some upper bounds on the overall contact weight that might be achieved by any conformation in Section 3. After that, we turn our attention to approximation algorithms and investigate the performance of a classical algorithm for the HP model in the con- text of this model and of a specially designed algorithm in Sections 4.2 and 4.3, respectively. We conclude the paper by a discussion of the proposed algorithms and the presentation of some open problems in Section 5.
2. Preliminary notions and observations
Before we start with the formal description of the problem, we first define the two underlying lattice types.
Definition 2.1.Thetwo-dimensional grid latticeis the infinite graphL = (V, E) with vertex setV =Z2 and edge setE ={{x, x} |x, x∈Z2,|x−x|= 1}.
The two-dimensional grid lattice with diagonals is the infinite graph L = (V, E) with vertex setV =Z2and edge setE={{x, x} |x, x ∈Z2,|x−x| ≤√
2}.
We consider the input of our problem, namely a chain of amino acids, as a string over {0,1} encoding the hydrophobicity of each amino acid. We will use a 1 to denote a hydrophobic amino acid and a 0 to denote a polar (hydrophilic) amino acid.
Definition 2.2. Let p = p1. . . pm be a string of length m over the alphabet {0,1}, and letL= (V, E)∈ {L,L}be a lattice. Aconformation ofpinto Lis an injective function ϕ:{1, . . . , m} →V from the positions of the string to the vertices of the lattice that assigns adjacent positions inpto adjacent vertices inL, i.e., {ϕ(i), ϕ(i+ 1)} ∈E for all 1≤i≤m−1. These edges{ϕ(i), ϕ(i+ 1)} ∈E for 1≤i≤m−1 are calledbinding edges.
An edge{x, x} ofLis called acontact edge, if it is no binding edge, but there existi, j∈ {1, . . . , m} such thatϕ(i) =x,ϕ(j) =x, andpi=pj = 1.
Whenever a conformation of a string is given, we will call a vertex of the lattice to which there was assigned a one [zero] by this conformation aone-vertex [zero- vertex] or simply a one [zero]. The vertices of the lattice which are not occupied by a one or a zero are calledunused. A binding edge connecting a one with a zero will be called analternation edge and a non-binding edge adjacent to a one that is no contact edge is called aloss edge.
A further refinement of our model concerns the intensity of the chemical forces along the different types of edges inL. As the adjacency of two amino acidsvia a diagonal edge implies a greater distance and thus a smaller chemical binding force as between two amino acids which are connected by a horizontal/vertical edge, we introduce an additional parameterα(0≤α≤1) to measure this loss of binding power relatively to the binding power given by horizontal/vertical edges.
In particular, we will count a contact weight of 1 for each horizontal/vertical contact edge and a contact weight ofαfor each diagonal contact edge.
Now, the resulting problem is defined as follows.
Definition 2.3. The protein folding problem in the 2-dimensional α-DC-HP2d
model, denoted asα-DC-HP2d problem, is the following optimization problem:
Input: A stringp =p1. . . pm over the alphabet{0,1} and a parameter α with 0≤α≤1.
Feasible solutions: All conformations of pinL.
Costs: The cost of a conformation ϕ is the overall contact weight of all contact edges ofϕin L,i.e.,
cost(ϕ) =
contact edges inL
1 +
contact edges inL\ L
α.
Goal: Maximization.
Note that the conformation is restricted to be an embedding in L, but the overall contact weight is computed with respect to L, i.e., we also allow for diagonal contacts.
Using this definition, we have removed the possibility of sharp 45◦ angles by restricting the conformation to L which is the lattice also used in the original HP model, and furthermore we got around the weakness of bipartiteness ofLby counting contacts in theL lattice.
Please note that, although we defined the α-DC-HP2d problem with respect to computing the overall contact weight ofcontact edges, we will for convenience locally count for every one-vertex of the lattice the number of incident contact edges, and we will call these incident contact edges thecontacts of this one1. By summing up the number of contacts over all ones, we will count every contact edge exactly twice. Since we will use this way of counting both for the contacts achieved by our algorithm and for the number of hypothetically possible contacts, this will not affect the approximation ratio.
For an input stringpover the alphabet{0,1}, we denote a maximal block of a single one as asingleton, a maximal block of two ones as apair, and a maximal block of three ones as atriple inp.
We will investigate our algorithms with respect to asymptotic approximation only. Thus, we can in particular assume without loss of generality that input strings for theα-DC-HP2d problem will start and end with zeros.
Moreover, as we are looking for a folding inL, also the bipartiteness of this grid structure will play a role. Therefore, for a stringpover{0,1}, letodds(p) and evens(p) denote the number of ones on odd and even positions inp, respectively.
Letµ = 2·min{odds(p),evens(p)}. Thus, µ denotes the maximal number of horizontal/vertical contacts that could be established for p. Note that, due to the bipartiteness ofL, horizontal/vertical contacts can occur between ones with different parity only. Hence, at most 2·min{odds(p),evens(p)} contact edges
1Please note that the contact edges according to our definition are sometimes called contacts in the existing literature on the HP model.
the upper bound also provided by [9, 10].) Therefore, the number of possible horizontal/vertical contacts is bounded by 2µ.
Clearly, as theα-DC-HP2dproblem forα= 0 is exactly the original HP problem, theα-DC-HP2dproblem inherits its hardness from the hardness result for the two- dimensional HP problem [6].
Theorem 2.4. Theα-DC-HP2d problem is NP-hard.
We would like to point out that, concerning the analysis of loss edges, we do not need to distinguish between vertices of the grid which are labeled by zero and those which are completely unlabeled. Thus, for convenience, we can assume that unlabeled vertices are labeled with zero for our considerations.
3. Upper bounds on the overall contact weight
In this section we will establish some upper bounds on the overall contact weight that could be achieved for theα-DC-HP2d problem.
The first bound easily results from counting the number of possible neighbors.
Lemma 3.1. Let p∈ {0,1} be a string starting and ending with a zero, let ϕbe a conformation ofp in L, and let ndenote the number of ones in p. Then the overall contact weight is at most2µ+ 4αn.
Proof. Each one has at most 4 horizontal/vertical neighbors in the grid. Two of these neighboring positions are occupied with the adjacent positions in the input string and thus cannot serve as contacts. Therefore, we achieve the same bound on the number of horizontal/vertical contacts as in the original HP model, which is 2µas discussed above. Moreover, there are at most 4 possible diagonal contacts incident to any 1-vertex that each contributeαto the overall contact weight.
A more sophisticated upper bound is based on the observation that a zero, which is adjacent to a one in the input string, will prevent the ones in its neighborhood from being incident to the maximal number of six contact edges.
To establish this result, we consider the number (and the weight) of loss edges that are unavoidable in the neighborhood of an alternation edge. Here, we follow the similar argument from [1,4]. Informally, the neighborhood of an edgeeconsists of the intersection of the sets of vertices that are adjacent to both of the endpoints ofeand of those edges that are adjacent toeas shown in Figure 1.
Definition 3.2. Lete={x, y}be any edge inL = (V(L), E(L)). We define theneighborhood ofeas the subgraphN(e) = (V, E) ofL, where
V = {v∈V(L)| {v, x},{v, y} ∈E(L)} ∪ {x, y} and E = {{x, v},{y, v} ∈E(L)|v∈V}.
x x
y y
(a) (b)
Figure 1. The neighborhoodN(e) of the edgee={x, y}, ifeis a horizontal/vertical edge (a), or ifeis a diagonal edge (b). The edges belonging to the neighborhood are shown as dotted lines.
With the following lemma, we will analyze the neighborhood of a loss edge.
Lemma 3.3. Let pbe an input string for the α-DC-HP2d problem, letϕ be any conformation of p, and let l ={x, y} be any loss edge of ϕ, wherey denotes the embedded1-vertex.
(i) If l is a horizontal/vertical edge and y is a singleton, there are at most four alternation edges inside N(l).
(ii) Iflis a horizontal/vertical edge andyis not a singleton, there are at most three alternation edges insideN(l).
(iii) Iflis a diagonal edge, there are at most two alternation edges insideN(l).
Proof. Since both vertices xandy can be adjacent to at most two binding edges, it follows immediately that there are at most four alternation edges adjacent tol.
But in this case,y is forced to be a singleton. Otherwise, y could be incident to at most one alternation edge. This immediately gives the proof for (i) and (ii).
For (iii) consider Figure 1b, there can be at most two alternation edges insideN(l) here, since, if there are two alternation edges, the remaining vertices inN(l) are labeled zero and one, respectively. Thus, none of the remaining edges inN(l) can be alternation edges. Please note that also the diagonal edge crossingl cannot be an alternation edge, since the embedding is restricted to the rectangular grid
lattice only.
Lemma 3.4. If there exist two orthogonal alternation edges which are adjacent to a horizontal/vertical loss edgel such as shown in Figure 2, then there are
(i) at most three alternation edges insideN(l), if the vertex z is a one, and (ii) at most two alternation edges insideN(l), if the vertexz is a zero.
Proof. The edge f1 cannot be an alternation edge since otherwise the vertex y would be incident to three binding edges, which proves (i). Ifzis a zero, then both endpoints off2 are zeros, hencef2 is no alternation edge, which proves (ii).
Lemma 3.5. Let p = 0+b10+b20+. . .0+bk0+ be an input string for the α-DC- HP2dproblem for some0≤α≤1, wherebi∈ {1}+for1≤i≤k, letn=k
i=1|bi| be the number of ones inp. Then the overall contact weight in any conformation is at most2n+ 4α·n−2k·min{23, α}.
y z
e1
e2
f1
f2
l
Figure 2. A vertical loss edgel in the neighborhood of at most three alternation edges.
e
e e e e e
e e
e e
(a) (b) (c) (d) (e)
(f) (g) (h) (i) (j)
Figure 3. The neighborhood of an alternation edge. The alter- nation edgeeunder consideration is depicted by a bold line, the possible further alternation edges in its neighborhood are depicted by bold dashed lines, and the loss edges in its neighborhood are shown as thin lines.
Proof. Since every one can have at most two horizontal/vertical contacts and four diagonal contacts, we know that the overall contact weight is at most 2n+4α·n. Let ϕbe any conformation ofp. We will now analyze the weight of the indespensible losses around the alternation edges of p. For this we will investigate the loss edges in the neighborhood of any alternation edgee. We will distinguish ten cases according to the labeling of the vertices in this neighborhoodN(e). These cases are depicted in Figure 3, all cases not shown in this figure are obviously symmetric to one of the shown cases.
Every subfigure shows all edges between zero-vertices and one-vertices within N(e). The more of these edges are also alternation edges, the smaller the weight of the loss edges inN(e) becomes. Thus, the maximal number of alternation edges
Table 1. The maximal total weights of loss edges in the neigh- borhood of an alternation edge.
case (a), (j) (b), (h) (c), (i) (d), (f) (e) (g) weight of loss edges 12+32·α 32·α 13+12 ·α α 23 2·α
insideN(e) is also shown in the figure. The weight of the remaining loss edges can be calculated as follows.
Case (a): In this case, there are one vertical and two diagonal loss edges. The upper diagonal loss edge is adjacent to two alternation edges and thus can be counted with weight α2 according to Lemma 3.3 (iii), while the lower diagonal loss edge can be counted with weightα, since there cannot be another alternation edge besidese in its neighborhood. The vertical loss edge can be inside the neighborhoods of at most two alternation edges, as it satisfies the preconditions of Lemma 3.4 (ii), thus its weight can be counted as 12. Altogether, this gives a weight of 12+32·αin this case.
Case (b): In this case, N(e) contains three diagonal loss edges. All of these edges could be adjacent to two alternation edges, thus their weight can be counted as 32·αaltogether.
Case (c): Here, the neighborhood ofecontains one vertical and one diagonal loss edge. The vertical loss edge is adjacent to at most three alternation edges, according to Lemma 3.4 (i). Hence it contributes a weight of13. The diagonal loss edge is adjacent to two alternation edges and thus contributes a weight of α2. This adds up to an overall weight of 13+12·α.
Case (d): In this case, there are two diagonal loss edges, both of which are adjacent to two alternation edges. The overall weight thus sums up toα. Case (e): The neighborhood ofecontains two vertical loss edges in this case.
Lemma 3.4 (i) is applicable to both of these edges, thus the total weight is 23 in this case.
Case (f ): Here, there are two diagonal loss edges, which both can be adjacent to two alternation edges, this results in an overall weight ofα.
Case (g): In this case, there are four diagonal loss edges. All of them could be adjacent to two alternation edges, hence the total weight can be estimated as 2·α.
Case (h): This case is symmetric to case (b).
Case (i): This case is symmetric to case (c).
Case (j): This case is symmetric to case (a).
We summarize the results of the particular cases in Table 1.
An easy calculation shows that the weight associated with one alternation edge takes its minimum value in case (e), if α > 23, and in cases (d) or (f), if α≤ 23.
This proves the claim of the lemma.
Figure 4. A conformation of the string p2 from Lemma 3.7.
Binding edges are shown by bold lines, contact edges are depicted as dashed lines.
Theorem 3.6. Let p= 0+b10+b20+. . .0+bk0+ be an input string for the α-DC- HP2dproblem for some0≤α≤1, wherebi∈ {1}+for1≤i≤k, letn=k
i=1|bi| be the number of ones inp. Then the overall contact weight in any conformation is at most
4α·n+ min{2µ,2n−2k·min{23, α}}.
Proof. The claim follows immediately from Lemmas 3.1 and 3.5.
The next lemma shows that the bound from Theorem 3.6 cannot be improved for the case thatµis small. To prove this, we construct a string that will nearly fit the proved upper bound. The idea is to fold this string into the shape of a square, where at each edge sidemones interspaced by single zeros occur and also the corners are labeled by zeros. This will result in a chessboard like pattern. The assumed string thus consists of (2m+ 1)2 = 4m2+ 4m+ 1 = 1 + 2(2(m2+m)) symbols, ones and zeros alternating with each other.
Lemma 3.7. For any m∈N, the stringpm= 0(10)2(m2+m)can be embedded into L such that it achieves a overall contact weight of4α·2(m2+m)−2α·4minLα. Proof. A conformation ofp2is shown in Figure 4, the generalization to other values ofmis straightforward. In this conformation, every inner one-vertex achieves four contacts of weight α, and every one-vertex at the border of the conformation achieves two such contacts, which adds up to the claimed weight.
Please note that, for the strings pm from Lemma 3.7, the parameter µis zero since all ones are on even positions in the string. Thus, the conformation as de- scribed above meets the upper bound from Theorem 3.6 up to an additive second- order term of Ω(α·√
n), wherendenotes as usual the number of ones.
4. Approximation algorithms
4.1. HP algorithms in the α-DC-HP2d model
Our goal is now to compute embeddings of 0/1 strings intoL that achieve as many contacts as possible. In principle, all algorithms for the original HP problem, where we do not consider diagonal contact edges, yield feasible solutions for the α-DC-HP2d problem. So it seems to be meaningful to analyze for instance the algorithms given by Hart and Istrail [9] and Newman [10] with respect to this model.
However, these algorithms cannot guarantee any approximation ratio in general.
This is due to the fact that the structure of the embedding computed by these algorithms heavily depends on the upper bound derived from the bipartiteness of the underlying grid and therefore partitions the ones in the input string rigorously into an odd and an even part according to their index in the string. If, in the extreme, the input string contains only ones at even positions, the so far proposed HP-algorithms guarantee no contact at all. Anyway, we will examine the algo- rithm proposed by Newman with respect to our approach of weighted diagonals, to investigate for which cases it might nevertheless be reasonable.
Clearly, ifµis aboutn, the number of ones in the input, Newman’s algorithm will achieve a constant approximation ratio (at least 9), since for at least 13 of all possible contacts (we have 4 potential diagonal contacts and 2 potential horizon- tal/vertical contacts) it achieves at least one third of the possible overall contact weight.
4.2. Newman’s algorithm in the α-DC-HP2d model
To be selfcontained, we first rephrase the algorithm proposed by Newman in [10] and essentially perform the same analysis with additionally counting diagonal edges.
Since Newman’s algorithm is originally designed for the classical HP model, where we ignore diagonal edges, it was compared in [10] to the upper bound of 2µ only. Therefore, it was quite reasonable and sufficient for proving a 3- approximation to assume that all inputs have the same number of even and odd ones. We will include this assumption also here, although it does not remain meaningful in our context. However, at the end of this section we will discuss the possibility of relaxations of this assumption.
So, letp be a 0/1 string withevens(p) =odds(p) that starts and ends with a zero.2 Furthermore, we may also consider the loopp◦ instead ofp, where the end positions ofpare joined to form a cycle. If there exists a folding guaranteeing a certain overall contact weight forp◦, then the same folding obviously guarantees the same overall contact weight forp.
2As we deal with asymptotic approximation only, this restriction concerning the end positions ofpdoes not matter.
struct an advantageous folding of the substrings on the left- and right-hand side of this point subsequently. This will eventually result in a staircase-like arrangement of odd ones on one side and even ones on the other side.
Newman established the following result to guarantee the existence of an ap- propriate folding point.
Lemma 4.1(Newman [10], Lem. 2.2). Letp=p0. . . pm−1be a string over{0,1}, wherep0=pm−1= 0and evens(p) =odds(p), and letp◦denote the corresponding loop. Then there exists a point px such that if we go around p◦ in one direction (i.e., either clockwise or counter-clockwise) starting at px to any pointpj, then
odds(pxpx+1. . . pj)≥evens(pxpx+1. . . pj),
and if we go aroundp◦ in the other direction frompx−1 to any pointpk, then evens(px−1px−2. . . pk)≥odds(px−1px−2. . . pk).
(Here, the indices j andk are considered to be modulom.)
Informally, we may simply say that there exists a point such that going into one direction starting from that particular point, we will always meet at least as many odd ones as even ones, and going into the other direction, we will always meet at least as many even ones as odd ones at any point.
Having determined such a point as described in the previous lemma, we may use it as a folding point for the algorithm. Before we actually start with the presentation of the algorithm we introduce some notation that will allow us to specify the folding.
Let p= p0. . . pm−1 be our input string over {0,1} and p◦ the corresponding loop.
Firstly, without loss of generality, we can assume that there is a pointpxas in Lemma 4.1, such that going in clockwise direction we haveodds(pxpx+1. . . pj)≥ evens(pxpx+1. . . pj), for any point pj, and vice versa for the counter-clockwise direction.
By the ith odd one we denote the i-th odd one starting from px+1 going in clockwise direction. Similarly, by the ith even one we denote the i-th even one starting frompx−2 going in counter-clockwise direction3. We denote the substring starting at the element ofp◦ directly following the (i−1)th odd one up to and including the ith odd one by Sodd(i). Its corresponding length is denoted as lodd(i) + 1. Similarly, we denote the substring starting at the element ofp◦directly following the (i−1)th even one up to and including theith even one bySeven(i).
Its corresponding length is denoted asleven(i) + 1. Note thatlodd(i) and leven(i) thus denote the number of intermediate positions between two consecutive odd (respectively even) ones, and are always odd integers. Furthermore, by oddpos(i) we map theith odd one to its actual position in the string, and in the same way we denote by evenpos(i) the mapping of the ith even one to its actual position.
Figure 5 illustrates these notations.
3We use this particular definition for theith ones here, sincepx−2 andpx+1will be paired in Newman’s algorithm.
px−2
px−1 px
px+1
px+2
(j−1)th odd one
jth odd one (i−1)th even one
ith even one EVEN side
evens(pxpx−1. . . py) ≥ odds(pxpx−1. . . py)
ODD side
odds(pxpx+1. . . py) ≥ evens(pxpx+1. . . py)
Seven(i)
Sodd(j)
Figure 5. Illustration of the notions required in the presentation of Newman’s algorithm.
Algorithm 1Newman’s algorithm Input: A loopp◦ over{0,1}.
(1) Forp◦ compute a folding pointpxas considered in Lemma 4.1.
(2) Arrange the symbols arround px (i.e., px−2, px−1, px, px+1) according to Figure 6. Seti:= 1 andj := 1 (these will be the counters for walking in counter-clockwise and clockwise direction, respectively).
(3) Distinguish four cases according to the lengths ofSeven(i) andSodd(j).
a If leven(i) = lodd(j) = 1, then fold Seven(i), Seven(i + 1) and Sodd(j), Sodd(j+ 1) according to Figure 7.
Seti:=i+ 2 andj:=j+ 2.
b Ifleven(i)≥3 andlodd(j)≥3, then perform essentially the same fold- ing ofSeven(i), Seven(i+ 1) andSodd(j), Sodd(j+ 1) as in the previous case, additionally arranging the intermediate symbols in appropriate side arms (see Fig. 8).
Seti:=i+ 2 andj:=j+ 2.
c If leven(i) = 1 and lodd(j) ≥ 3, then fold Seven(i), Seven(i+ 1) and Sodd(j) according to Figure 9.
Seti:=i+ 2 andj:=j+ 1.
d Ifleven(i)≥3 andlodd(j) = 1, then foldSeven(i) andSodd(j), Sodd(j+ 1) according to Figure 10.
Seti:=i+ 1 andj:=j+ 2.
(4) Iterate the folding process described in Step 2 until Seven(i) and Sodd(j) overlap at some point.
px−2 px+1
Figure 6. Arrangement of the symbols around the folding pointpx. We are now ready to present Newman’s algorithm. As usual, we describe the algorithm in terms of folding patterns to prevent an irksome and confusing index- based notation. Although Newman’s algorithm was originally not designed to account for diagonals, we include also diagonal contacts in the drawings of the folding patterns.
There is no special reason for pairingpx−2 andpx+1 instead ofpx−1 andpx+2
in Step 2. However, a pairing of px−1 and px+1 is impossible due to the parity constraints inL, so without loss of generality, we have to decide for one of both.
The situation described in the cases 3(a) and 3(b) is quite advantageous. Un- fortunately, it is not clear how to come up with an equally favorable folding for the case, where the number of intermediate zeros is one in one of the considered substrings and greater or equal 3 in the other one (cases 3(c) and 3(d)).
Next, we follow the same lines in the analysis of Newman’s algorithm as in [10], but additionally account for diagonal edges.
Lemma 4.2. Newman’s algorithm asymptotically guarantees an overall contact weight of at least 23µ+13αµfor theα-DC-HP2d problem.
Proof. Letp=p0. . . pm−1 be a string over {0,1}and p◦ its corresponding loop.
Denote by i∗ and j∗ the values of i and j after the last iteration of Newman’s algorithm on the inputp◦,i.e., i∗ andj∗ are the first values fori andj such that px−2px−3. . . pevenpos(i∗) andpx+1px+2. . . poddpos(j∗)overlap each other.
Before we go into the details of the proof, we will first give a informal outline and describe some necessary considerations. Our plan is to estimate the number of odd ones participating in the folding and thus contributing to the overall contact weight. To do so, we have to look at all odd ones that are considered by the folding performed by Newman’s algorithm, these are roughly the odd ones in px+1px+2. . . poddpos(j∗). But for a rigorous estimation we have to take into account that, due to the overlapping ofpx+1px+2. . . poddpos(j∗)andpx−2px−3. . . pevenpos(i∗)
some odd ones “at the end” ofpx+1px+2. . . poddpos(j∗)might not contribute to the folding. In particular,poddpos(j∗)cannot be paired by the algorithm, since it occurs inside px−2px−3. . . pevenpos(i∗). Moreover, alsopoddpos(j∗−1) might be unpaired if the situation of Case (a) or Case (b) occurs butSeven(i+1) andSodd(j+1) already overlap in some point. Theni∗=i+1 andj∗=j+1 and we do not pairj=j∗−1.
Furthermore, to estimate the number of odd ones inpx+1px+2. . . poddpos(j∗)with respect to the overall number of odd ones or toµ(which will eventually help us to establish an approximation ratio), we have to account for the situation around the folding point, too. Here, Newman’s algorithm folds the positionspx−2, px−1, px,
(i−1)th even one (j−1)th odd one
Figure 7. Folding pattern if bothleven(i) =lodd(j) = 1.
(i−1)th even one (j−1)th odd one
Figure 8. Folding pattern if bothleven(i)≥3 andlodd(j)≥3.
andpx+1 according to Figure 6. At most two of these positions may be odd ones and therefore we have to consider them in our estimation.
Now, after estimating the number of odd ones participating in the folding, we will have a closer look on the types of the foldings and determine the corresponding contact weight for each type and sum up. We will finally express the estimated contact weight in terms ofodds(p◦) and ignore all additive constants, since we are interested in the asymptotic amount of contact weight only.
Now, let us continue with the proof. From the setting of i∗ and j∗, we can conclude that at least odds(px+1px+2. . . poddpos(j∗))−2 odd ones participate in some contacts. We have to subtract 2, sincepoddpos(j∗−2) might be the last odd one (in clockwise direction) that is paired by the algorithm. On the other hand, at mostodds(px−2px−3. . . pevenpos(i∗)) + 1 odd ones potentially do not participate
px might be an odd one, too. The overlapping of px−2px−3. . . pevenpos(i∗) and px+1px+2. . . poddpos(j∗) guarantees that we do not underestimate the number of unpaired odd ones.
By Lemma 4.1 we know
odds(px−2px−3. . . pevenpos(i∗))≤evens(px−2px−3. . . pevenpos(i∗)). (1) Furthermore, since px+1px+2. . . poddpos(j∗) and px−2px−3. . . pevenpos(i∗) are over- lapping and eitherpx−1 orpx might be an odd one, too, the following holds
odds(p◦)≤odds(px+1px+2. . . poddpos(j∗))+odds(px−2px−3. . . pevenpos(i∗))+1. (2) Combining equations (1) and (2), we obtain
odds(p◦)≤odds(px+1px+2. . . poddpos(j∗)) +evens(px−2px−3. . . pevenpos(i∗)) + 1.
(3) In the following analysis, we will pair together as many of the foldings of type (c) and (d) as possible and denote these as type (c-d) folds. We assume without loss of generality thatutype (c) folds remain unpaired. (The case of unpaired type (d) folds is symmetric.) Then, the number of odd ones participating in type (a), (b), or (c-d) folds is
odds(px+1px+2. . . poddpos(j∗))−2−2u, (4) since there occur 2 odd ones in each type (c) fold.
In these folds of type (a), (b), or (c-d), the number of odd ones matches the number of even ones. Moreover, there are u additional even ones involved in type (c) folds. Again, we have to take into account that two even ones may remain unconsidered in px−2px−3. . . pevenpos(i∗) due to its overlapping with px+1px+2. . . poddpos(j∗). Hence,
evens(px−2px−3. . . pevenpos(i∗))≤odds(px+1px+2. . . poddpos(j∗))−u+ 2. (5) Combining equations (3) and (5) yields
odds(p◦)≤odds(px+1px+2. . . poddpos(j∗)) +odds(px+1px+2. . . poddpos(j∗))−u+ 3. (6) This is equivalent to
odds(px+1px+2. . . poddpos(j∗))≥ odds(p◦)
2 +u
2 −3
2· (7)
Next, we consider how much contact weight is contributed by the odd ones in px+1px+2. . . poddpos(j∗). For the two odd ones participating in a type (a) or type (b) fold, we obtain a contact weight of 6 + 4α, since the folding establishes three horizontal/vertical contact edges and two diagonal contact edges. For the three odd ones participating in a type (c-d) fold, we obtain a contact weight of 8 + 4α.
(i−1)th even one (j−1)th odd one
Figure 9. Folding pattern ifleven(i) = 1 andlodd(j)≥3.
(i−1)th even one (j−1)th odd one
Figure 10. Folding pattern if bothleven(i)≥3 andlodd(j) = 1.
In these cases we can thus guarantee a contact weight of 6+42α = 3 + 2αor 8+43α =
8
3+43αfor each odd one on average.
For theuremaining type (c) folds, we achieve a contact weight of 2 +αfor each odd one. Furthermore, in each of theuremaining folds of type (c), two odd ones occur.
This implies that we can guarantee at least a contact weight of 8
3+4 3α
·
odds(px+1px+2. . . poddpos(j∗))−2−2u
+ 2·(2 +α)u. (8) Estimating odds(px+1px+2. . . poddpos(j∗)) according to equation (7), we can bound this value from below as follows:
8 3 +4
3α
·(odds(px+1px+2. . . pj∗)−2−2u) + 2·(2 +α)u
≥ 8
3 +4 3α
·
odds(p◦)
2 +u
2 −2u−7 2
+ 2·(2 +α)u
= 4
3odds(p◦)−4u+2
3αodds(p◦)−2αu+ 4u+ 2αu−7 2
8 3 +4
3α
·
additive constant−2 3 +3α and obtain an asymptotical contact weight of at least
4
3odds(p◦) +2
3αodds(p◦).
Since we assumedodds(p◦) =evens(p◦) we can set odds(p◦) = µ2, leading to an overall contact weight achieved by Newman’s algorithm of
2 3µ+1
3αµ
which completes the proof.
Theorem 4.3.Newman’s algorithm is an asymptotic approximation algorithm for the α-DC-HP2d problem with a ratio of
4α·n+ min{2µ,2n−2k·min{23, α}}
2
3µ+13αµ
·
Proof. To compute the approximation ratio of Newman’s algorithm, we compute the fraction of the upper bound from Theorem 3.6 and the overall contact weight guaranteed by the algorithm according to Lemma 4.2 and obtain a ratio of
4α·n+ min{2µ,2n−2k·min{23, α}}
2
3µ+13αµ ·
To give an idea of the ratio established above, we now discuss the corresponding ratios for specific values ofαandµ.
• Clearly, ifµ= 0, i.e., if there cannot be any horizontal/vertical contacts at all, the approximation ratio is infinite.
• Forα= 0, the α-DC-HP2d problem corresponds to the original HP prob- lem and the approximation ratio is 3 (as already shown in [10]).
• Forµ=n, we can guarantee a ratio of 6+122+αα which is worst forα= 1 and yields a ratio of 6 in this case.
Extending the analysis of Newman’s algorithm to concern all ones included in the input and not to assumeodds(p◦) =evens(p◦), appears to be quite problematic due to the following reasons.
• Applying the folding strategy of Newman’s algorithm inevitably implies that only min{odds(p◦),evens(p◦)} ones are considered.
• Refining the analysis with respect to the block length seems to be a rea- sonable idea, since longer blocks of consecutive ones that participate in the staircase-like folding, will imply even more contact weight. Neverthe- less, it remains unclear, how we can guarantee that certain blocks really participate in the folding. However, it appears to be a suitable heuristic to obtain the condition min{odds(p◦),evens(p◦)} of the algorithm by ig- noring singletons first and longer blocks after that. Because in each block
that is not a singleton both, odd and even ones will occur, we will obtain an improved contact weight in this case independent whether the block belongs to the “odd side” or to the “even side”.
4.3. Algorithm Meander
In this section, we present an alternative approach for computing a good folding for theα-DC-HP2d problem. This algorithm is not based on the computation of a folding point, but computes a folding while traversing the input string and analyzing the particular situation locally.
Algorithm 2Meander
Input: A stringpover{0,1}.
Execute: Walk along p and embed each long block (of length at least 4) in a meander-like way into two or three consecutive rows of the grid such that there are either two or three consecutive ones in the leftmost and rightmost column of the embedding of this block as shown in Figure 11m and n. Embed each shorter block into a single column of the grid. Place consecutive blocks next to each other as shown in Figure 11a to l, arranging the zeros in appropriate side-arms.
Lemma 4.4. For a given input p, algorithm Meanderguarantees an overall con- tact weight of at leastl1·2α+ (n−l1)·min{1 +95α,43+23α}, wherel1denotes the number of singletons, andn denotes the overall number of ones inp.
Proof. Let in the following a transition denote a pair of embedded consecutive blocks. To prove this lemma, we count the average contact weight contributed by a block participating in the transitions shown in Figure 11a–l. Here, we always assume a worst-case scenario.
Let us consider the different types of blocks separately.
(1) Letxdenote a singleton. It is quite obvious in this case that, with respect to the achieved contact weight, the worst-case in the folding performed by algorithm Meander occurs, if x is framed by two other singletons which are both connected toxby an odd number of intermediate zeros. In this casexcontributes a contact weight of 2α(see Fig. 11b).
(2) Letydenote a pair of ones. In this case, we have to consider different kinds of transitions, namely all types of transitions where a pair may participate in, i.e., those shown in Figure 11c, d, g–j. Actually, we would have to consider all possible combinations of these cases for the transitions of the right- and left-hand side of y. But, since we are only interested in the worst-case situation, we can restrict ourselves to the consideration of the same transitions on both sides. The worst case there is clearly also the worst case concerning all possible combinations.
(a) (b) (c) (d)
(e) (f) (g) (h)
(i) (j) (k) (l)
(m) (n)
Figure 11. Transitions between blocks; solid lines denote bind- ing edges, dashed lines denote contact edges, and parts of the embedding which are not considered (including all zeros) are in- dicated by dotted lines.
• y is framed by two transitions of type Figure 11c or d.
We determine the contribution to the overall contact weight of this particular transition and then multiply it by two to account for the second transition. The ones in y are incident to contacts of total weight 1 +α. Moreover, the involved singleton contributes also 1 +α, but according to the worst-case scenario discussed in the previous case, we only counted α for this transition of this singleton. Thus, we may count an additional 1 for the contact weight ofyin this case.
Summing up, y contributes 2 +α for each transition it participates in, and hence 2·(2 +α) altogether. Finally, on average each one iny contributes 12·2·(2 +α) = 2 +α.
• y is framed by two transitions of type Figure 11g.
Here, the ones in y contribute 2·(1 +α) for each transition. Mul- tiplying by two (to account for two transitions) and dividing by two (to compute the average) will give again a contact weight of 2·(1 +α) guaranteed by each one inyin this case. (As no singleton participates in this transition, we do not count any additional contact weight.)
• y is framed by two transitions of type Figure 11h.
Here, the ones iny contribute 1 + 2αfor each transition. Following the same argument as above, this leads to a contact weight of 1 + 2α for each one in yon average.
• y is framed by two transitions of type Figure 11i or (j).
Here, the ones iny contribute 2 + 3αfor each transition. Following the same argument as above, this leads to a contact weight of 2 + 3α for each one in yon average.
This implies that we can guarantee a contact weight of at least min{2 +α,2·(1 +α),1 + 2α,2 + 3α}= 1 + 2α for each one occuring in a pair.
(3) Let z denote a triple of ones. In this case, we have to consider different kinds of transitions, namely all types of transitions where a triple may participate in,i.e., those shown in Figure 11e, f, i–l. Again, we only have to consider pairs of the same transitions to detect the worst-case.
• z is framed by two transitions of type Figure 11e.
Here, the ones in z contribute 1 +αfor each transition. Moreover, for each transition we can guarantee an additional contact weight of 1 for the involved singleton. Thus, both transitions contribute 4 + 2α in total. Hence, each one inz contributes 43+23αon average.
• z is framed by two transitions of type Figure 11f.
Here, the ones in z contribute 1 + 2αfor each transition. Moreover, for each transition we can guarantee an additional contact weight of 1 +αfor the involved singleton. Thus, both transitions contribute 4 + 6αin total. Hence, each one inz contributes 43+ 2αon average.
• z is framed by two transitions of type Figure 11i or j.
Here, the ones inz contribute 2 + 3αfor each transition. Thus, both transitions contribute 4+6αin total. Hence, each one inzcontributes
4
3+ 2αon average.
• z is framed by two transitions of type Figure 11k.
Here, the ones inz contribute 3 + 4αfor each transition. Thus, both transitions contribute 6+8αin total. Hence, each one inzcontributes 2 + 83αon average.
• z is framed by two transitions of type Figure 11l.
Here, the ones inz contribute 2 + 4αfor each transition. Thus, both transitions contribute 4+8αin total. Hence, each one inzcontributes
4
3+83αon average.
For triples of ones, we can thus guarantee a contact weight of at least min{34+23α,43+ 2α,43+ 2α,43+83α}= 43+23α
for each one on average.
Fig. 11m). To compute the average contribution of each one in this case, we can simply determine the total contribution to the contact weight by inner ones and then additionally add half of the worst-case contribution of a pair for each of the two borders. Let thereforecp= 2 + 4αdenote the minimum contribution for a pair of ones(for the whole pair and not for each one in a pair on average). To easily compute the contributed contact weight, observe that the folding shown in Figure 11m containsm−22 square shaped regions, and each of these contributes two diagonal contact edges and one horizontal contact edge. Additionally, we have to consider the contact edges incident to the borders of the folding. Then, we obtain an average contact weight of
m−2
2 ·(4α+ 2) + 2· 12·cp
m = 1 + 2α.
(5) Letwodenote blocks of ones of lengthm, wherem≥5 andmis odd (see Fig. 11n). To compute the average contribution of each one in this case, we can simply determine the total contribution to the contact weight by inner ones and then additionally add half of the worst-case contribution of a pair for one border and half of the worst-case contribution for a triple for the other border. Let, therefore, cp = 2 + 4α denote the minimum contribution for a pair (for the whole pair and not for each one in a pair on average) of ones, and similarly let ct = 4 + 2α denote the minimum contribution for a triple of ones. Again, we count the number of square shaped regions in the folding shown in Figure 11n, which is m−23 here.
Then, we obtain an average contact weight of
m−3
2 ·(4α+ 2) + 2α+12·cp+12·ct
m ≥1 + 2α−1
5α= 1 +9 5α.
Here, the worst-case occurs for blocks of length 5.
Now, in what follows, we analyze the singletons and non-singletons separately. For singletons we can in the worst-case only guarantee a contact weight of 2αfor each one. To estimate the least contribution of contact weight made by other blocks, we have to compute the minimum of all the above computed average contributions per one,i.e., we have to determine
min
1 + 2α,4 3 +2
3α,1 + 2α,1 + 9 5α
= 1 + 95α , if 0≤α≤ 175
4
3+23α , if 175 < α≤1.
Namely, except for singletons, the worst case will either occur for the situation shown in Figure 11n or for the one shown in Figure 11e, depending on the choice ofα.