3D Image Processing and Communication in Camera Sensor Networks

(1)

– Free Viewpoint Television Networking –

A Dissertation Presented to the

Faculty of Graduate School of Engineering Department of Information Electronics

Nagoya University

in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

By

Mehrdad Panahpour Tehrani

September 2004

(2)

(3)

(4)

(5)

Acknowledgements

The presented thesis is the product of many directions, supports, and encouragements. First of all, I would like to thank my supervisor Professor Masayuki Tanimoto for pointing me toward an interesting topic, and for his constant support, guidance, insights, enthusiasm and hard working attitude which will always inspire me. I would also like to thank the other member of Tanimoto Laboratory, Professor Toshiaki Fujii for his deep guidance, and fruitful collaboration over my doctor course. Sincere thanks to Professor Tadahiko Kimoto for his insightful guidance for the first two years of my study. I would also like to thank Professor Ken-ichi Sato for his insight comments on my thesis. My sincere thanks to all colleagues in our laboratory for helping me whenever I needed and for leaving me the congenial and companionable atmosphere that had the pleasure to enjoy for my student life time in Japan. I would like to express my special thanks to my colleagues-mate and good friend, Purim Na Bangchang, for his friendship and great insight discussion and suggestions during my research that made this work come this far. Michael Droese deserves for my sincere thanks for his great help with his scientific expertise and computer skills, which made possible part of my own work in the first place for the last one year during his stay in our Laboratory as a researcher. I also want to express my gratitude to Kenji Yamamoto in Tanimoto Laboratory who has been always there to help with his useful suggestions during the preparation of this thesis. I am also thankful to the Japanese ministry of education, culture, sports, science and technology (Monbukagakusho) for awarding me my scholarship. Above all, however, I wish to express my deepest-felt gratitude for my parents whose loving support brought me to this of academic achievement.

(6)

(7)

Chapter 1

Introduction

Visualization of scenes has been a challenge throughout the human history to capture and communicate our three-dimensional world observations. Three-dimensional (3D) representation of scenes using two-dimensional (2D) images is an important research goal of computer graphics. To realize a 3D representation of a scene virtually when the observation of it is difficult to be captured by human eyes, a network of cameras is used as virtual eyes to capture the 3D structure of the scene. This network of cameras is called Camera Sensor Networks (CSNs). The Human Visual System (HVS) enables us to generate a 3D view. However, the observation of a scene from different viewpoints needs well-known rendering methods.

The demand for realtime 2D projection of three-dimensional scene captured using multi-view systems has increased recently. The planar projection needs to be interactively generated using accelerated rendering methods. Traditionally, 3D scenes are described by illumination models, object surfaces, and texture information using Geometry-Based Rendering (GBR) method. For rendering real-world scenes,

(14)

however, this rendering approach has limited versatility. The derivation of correct illumination models from real-world scenes, and acquisition of accurate 3D geometry is a time-consuming task. Hence, rendering quality as well as frame-rate is determined by scene complexity and computation time available for simulating global illumination effects. Therefore, considering realtime operation for 3D representation, GBR is not a good choice. Image-Based Rendering (IBR), as an alternative rendering approach, has lately attracted much attention. This method is based on visual appearance rather than physical model. Image Based Rendering can attain photo-realistic rendering results of real-world scene. The scene can be any real-world object or a computer model. No restrictions are applied to the content of the scene: hair, fur and smoke can be easily rendered using IBR techniques. Free view generation speed using IBR is thereby the independent of scene complexity, as rendering time is proportional only to the number of rendered pixels or maximum disparity between multiview images captured in the CSNs. Free view generation of multiview images needs corresponding search for the matched area to be used in the linear interpolation.

The number of the available scenes captured in CSNs determines the quality of the arbitrary view generation. Covering an area using many cameras is not cost-efficient.

Therefore, a sub-sampled image sets should be used for interpolation. Considering the high quality and fast IBR in the CSNs, an accelerated block matching method with a good detection of correspondence is needed. However, generally, realtime implementation of arbitrary view generation is not practical if only centralized architecture is considered. Hence, inter-node communication is needed to share the processing task in the network, or a distributed architecture can be optimal in CSNs.

Furthermore, highly correlated observation in such a network requires an efficient compression when distributed architecture is used. The coding method should be able to use the spatial redundancy of multiview images in CSNs when there is communication between nodes, which are highly correlated. Note that the problem regarding to the storage of all captured data and transmission, e.g. via Internet, need an efficient data compression that should be considered separately, when only the centralized architecture is required. To achieve convincing performance results of CSNs, optimization of the network parameters is desired. In addition, such a network

(15)

is interfacing with many users, so that the optimization process should be designed considering multiuser case.

In the following Camera Sensor Networks (CSNs) and Free viewpoint TeleVision (FTV) system application and motivation, performance objectives are investigated in general [1], to understand the position of the presented CSN in this thesis. At last the organization of the thesis is introduced.

1.1 Sensor Networking Applications and Motivation

In recent years, the desire for connectivity has caused an exponential growth in network communication. Data networks, in particular, have led this trend due to the increasing exchange of data in Internet services such as World Wide Web, e-mail, and data file transfers. The capabilities needed to deliver such services are characterized by an increasing need for data throughput in the network; application now under development, such as wireless multimedia distribution in the home, indicates that this trend will continue. Wireless Local Area Networks (WLANs) provide an example of this phenomenon. The original (1997) Institute of Electrical and Electronics Engineers (IEEE) WLAN standard, 802.11, had a gross data rate of 2 megabits per second (Mb/s); the most popular variant now is 802.11b, with a rate of 11 Mb/s; and 802.11a, with a rate of 54 Mb/s, is now entering the market. Wireless Personal Area Networks (WPANs), defined as networks employing to fixed infrastructure and having communication links less than 10 meters in length centered on an individual, from another example: the Home RF 1.0 specification released in January 1999 by the Home RF Working Group, has a raw data rate of 800 kb/s with an optional 1.6 Mb/s mode; the Bluetooth^TM 1.0 specification, released in July 1999 by the Bluetooth Special Interest Group (SIG) and later standardized as IEEE 802.15.1, has a raw data rate of 1 Mb/s; and IEEE 802.15.3, released in June 2003, has a maximum raw data rate of 55 Mb/s. Both the 802.11 and 802.15 organizations have begun the definitions of protocols with data throughputs greater than 100 Mb/s.

Other potential network applications exist, however. These applications, which have relaxed throughput requirements and are often measured in a few bits per day, include industrial control and monitoring; home automation and consumer

(16)

electronics; security and; asset tracking and supply chain management; intelligent agriculture; and health monitoring. Because most of these low-data-rate applications involve sensing of one form or another, networks supporting them have been called sensor networks, or Low-Rate WPANs (LR-WPANs), because they require short- range links without a preexisting infrastructure. An overview of applications for sensor networks follows.

• Industrial Control and Monitoring

A large, industrial facility typically has a relatively small control room, surrounded by a relatively large physical plant. The control room has indicators and displays that describe the state of the plant (the state of valves, the condition of equipment, the temperature and the pressure of stored materials, etc.), as well as input devices that control actuators in the physical plant (valves, heaters, etc.) that affect the observed state of the plant. CSNs can play an important role in industrial control and monitoring by providing any viewpoint to the plant.

• Home Automation and Consumer Electronics

One application is the “universal” remote control, a personal digital assistant (PDA) - type device that can control not only the television, DVD player, stereo, and other home electronic equipment, but the lights, curtains, and locks that are also equipped with a wireless sensor network connection. With the universal remote control, one may control the house from the comfort of one’s armchair. Its most intriguing potential, however, comes from the combination of multiple services, such as having the curtains closed automatically when the television is turned on, or perhaps automatically muting the home entertainment system when a call is received on the telephone or the doorbell rings. With the scale and personal computer both connected via a sensor network, one’s weight may be automatically recorded without the need for manual intervention (and the possibility of stretching the truth “just this once”).

• Security

The security system described above for the home can be augmented for use in industrial security application. Such systems, employing proprietary communication protocols, have existed for several years. They can support multiple sensors relevant to industrial security, including camera, passive infrared, magnetic door opening,

(17)

smoke, and broken glass sensors, and sensors for direct human intervention (the

“panic button” sensor requesting immediate assistance).

• Asset Tracking and Supply Chain Management

A very large unit volume application of sensor networks is expected to be asset tracking and supply chain management. Asset tracking can take many forms. One example is the tracking of shipping containers in a large port. Such port facilities may have tens of thousands of containers, some of which are empty and in storage, while others are bound for many different destinations. Sensor networks can be used to advantage in such a situation; by placing sensors on each container, its location can always be determined.

• Intelligent Agriculture and Environmental Sensing

Large farms and ranches may cover several square miles, and they may receive rain only sporadically and only on some portions of the farm. Irrigation is expensive, so it is important to know which fields have received rain, so that irrigation may be omitted, and which fields have not and must be irrigated. Such an application is ideal for sensor networks. The amount of data sent over the network can be very low (as low as one bit - “yes or no” - in response to the “Did it rain today?” query), and the message latency can be on the order of minutes. Yet, costs must be low, and power consumption must be low enough for the entire network to last an entire growing season. CSNs may also be used for sensing of environment, so that the power consumption and the data rate are not low and need to be optimized.

• Health Monitoring

A market for sensor networks that is expected to grow quickly is the field of health monitoring. “Health monitoring” is usually defined as “monitoring of non-life- critical health information”, to differentiate it from medical telemetry, although the definition is broad and nonspecific, and some medical telemetry applications can be considered for sensor networks. Two general classes of health monitoring applications are available for sensor networks. One class is athletic performance monitoring, for example, tracking one’s pulse and respiration rate via wearable sensors and sending the information to a personal computer for later analysis. The other class is at-home health monitoring, for example, personal weight management or cameras to observe the situation of the patient. The patient’s conditions may be

(18)

sent to a personal computer for storage. Other examples are daily blood sugar monitoring and recording by a diabetic, and remote monitoring of patients with chronic disorders.

1.2 Sensor and Related Networks

So far, several networks have been designed and worked in the world. Basically, there are two types of Networks, which are “Data Networks” and “Sensor Networks”.

There are several data networks like the ALOHA system [2], the PRNET System [2,3], Amateur Packet Radio Networks [4], Wireless Local Area Networks (WLANs), and Wireless Personal Area Networks (WPANs). However this thesis emphasizes networks with sensing ability. However, in the following a brief summary of many proposed sensor networks is introduced. Similar to the development of other personal communications systems, the present-day development of sensor networks has many roots, going back at least to the 1978 DARPA-sponsored Distributed Sensor Nets Workshop at Carnegie-Mellon University in Pittsburgh, Pennsylvania (e.g., Lacoss and Walton). Interest due to surveillance systems let to work on the communication and computation trade-offs of sensor networks, including their use in a ubiquitous computing, which they define as requiring a high level of user mobility, and pervasive computing, requiring a low level of user mobility - a definition that meshes better with wireless integrated microsensors (LWIM) project of the mid-1990s and continued with the launch of the SensIT project in 1998, which focuses on wireless, ad hoc networks for large distributed sensor systems. In the following, several sensor networks are introduced.

• WINS

The University of California at Los Angeles, often working in collaboration with the Rockwell Science Center, has had a Wireless Integrated Network Sensors (WINS) project since 1993 [5,6]. It has now been commercialized with the founding of the Sensoria Corporation, San Diego, California, USA in 1998. The program covers almost every aspect of sensor network design, from microelectromechanical system (MEMS) sensor and transceiver integration at the circuit level, signal processing architectures, and network protocol design, to the study of fundamental principles of

(19)

sensing and detection theory. The group envisions that WINS will “provide distributed network and Internet access to sensors, controls, and processors deeply embedded in equipment, facilities, and the environment.” The WINS communication protocol data link layer is based on a Time Division Multiple Access (TDMA) structure; separate slots are negotiated between each pair of nodes at network initiation; the physical layer employs RF spread spectrum techniques for jamming resistance.

• PicoRadio

Jan M. Rabaey of the University of California at Berkeley started the PicoRadio program in 1999 to support “the assembly of an ad hoc network of self-contained mesoscale, low-cost, low-energy sensor and monitor nodes” [7,8]. The physical layer proposed for the PicoRadio network is also direct sequence spread spectrum; the Medium Access Control (MAC) protocol proposed “combines the best of spread spectrum techniques and Carrier Sense Multiple Access (CSMA). A node would randomly select a channel (e.g., a code or time slot) and check for activity. If the channel were active, the node would select another channel from the remaining available channels, until an idle channel was found and the message sent. If an idle channel were not found, the node would back off, setting a random timeout timer for each channel. It would then use the channel with the first expired timer and clear the timers of the other channels. Note that the PicoRadio program at Berkeley is distinct from its perhaps better-known “Smart Dust” program, in which MEMS-based motes

“could be small enough to remain suspended in air, buoyed by air currents, sensing and communicating for hours or days on end.” The Smart Dust physical layer would be based on passive optical transmission, employing a MEMS corner reflector to modulate its reflection of an incident optical signal.

• μAMPS

TheμAMPS [9] program, led by Principal Investigator Anantha Chandrakasan at the Massachusetts Institute of Technology, Cambridge, Massachusetts, USA is focused on the development of a complete system for sensor networks, emphasizing the need for low power operation. Their work has led to the development of a sensor network communication protocol called Low Energy Adaptive Clustering Hierarchy

(20)

(LEACH) [10]. LEACH features a node-clustering algorithm that randomizes the assignment of the high-power-consuming cluster head function to multiple nodes in the network. This spreading of the cluster head function to multiple self-elected nodes lengthens the lifetime of the network.

• Terminodes, MANET, and Other Mobile Ad Hoc Networks

A field related to sensor networks is the study of mobile ad hoc networks; the Terminodes project [11,12] and the Mobile Ad Hoc Networks (MANET) [13]

Working Group of the Internet Engineering Task Force (IETF) are two examples.

Although these programs are concerned with ad hoc networks of nodes with low power consumption, the main issue addressed by them is the routing problem created in an ad hoc network when mobility of network nodes is added. This problem does not arise in sensor networks, the nodes of which are assumed substantially stationary.

Further, mobile ad hoc networks are usually assumed to be composed of Internet Protocol (IP)-addressable nodes; the ability to handle Transmission Control Protocol/Internet Protocol (TCP/IP) is generally considered to be beyond the capability of sensor network nodes due to cost (e.g., required buffer memory size) and power concerns.

• Underwater Acoustic and Deep Space Networks

A sensor network protocol has design features in common with protocols for a wide range of network types, from underwater acoustic networks to deep space radio networks [14-16]. The need in oceanography for underwater acoustic networks, which share the low power consumption, low data throughput, large physical area network coverage, and high message latency tolerance characteristics of sensor networks, led to the development of several systems by the early 1990s. State-of-the- art underwater acoustic networks employ phase shift keying (PSK) in the physical layer. A Multiple Access with Collision Avoidance - derived (MACA-derived) protocol for medium access control, and multi-hop routing techniques – all features that would be familiar to the designer of a sensor network protocol. Similarly, the strict power consumption constraints, ad hoc network architecture, and the tolerance of message latency requirements common in deep space communication networks also are common with sensor networks.

(21)

1.3 Free viewpoint TV (FTV)

Nagoya University, Japan, Graduate School of Engineering, Department of Information Electronics has been involved since the year 2000 in designing a new generation of TV, which is called Free viewpoint TV (FTV) [17]. In the Free viewpoint TeleVision (FTV) system, one user can freely control the viewpoint position of any dynamic real-world scene. This system can cover a limited space.

The system includes 16 clients node with a server node computer. Each node is a PC cluster powered by Intel Pentium III 800MHz as CPU with 256Mbyte RAM. Each PC is a general-purpose PC that has an image capturing board mounted in a PCI bus.

A Gigabit Ethernet connects sensor node PCs with a central node PC. Three camera configurations were installed on the FTV system. One is 16 cameras mounted on a line with 2-cm distance between each, another configuration consists of a two- dimensional camera array on a plane with 2-cm distance between each in x and y directions, and one-dimensional cameras, which are set with 3-degree (20mm) intervals on an arc-array. The closest object is considered to be about 35-cm distance from the cameras plane. In FTV system, color image is used with I = 160 x 120 size.

This system does not need any special hardware such as special computer or depth sensor. The user requests the viewpoint from the server side.

1.4 Camera Sensor Networks (CSNs)

This thesis introduces Camera Sensor Networks (CSN) [18]. To expand the coverage of FTV to a wider area, a distributed sensor network can be used. In a paradigm of Smart Dust, hundreds or thousands of sensor nodes of cubic millimeter dimension can be scattered about any desired environment [19,20]. CSN (Camera Sensor Network) as an extension to the FTV system has hundreds of cameras, distributed densely throughout the environment.

The presented sensor network in this thesis is capturing the image data and communicating them among nodes, whereas the sensor networks do not. Therefore, efficient operation of the FTV – CSNs is important to be investigated, which is the goal of this thesis.

(22)

1.4.1 CSNs Performance Objectives

To meet the requirements of the applications described in section 1.1, a successful sensor network design must have several unique features. The need for these features leads to a combination of interesting technical issues not found in other networks.

• Fast Processing Time

Some sensor network applications require realtime process of the observed data; so that the processing time is needed using distributed architecture is optimized.

• Communication Rate

As already mentioned, sensor networks have limited data throughput requirements when compared with other networks. For design purposes, the maximum desired data rate, when averaged over a one-hour period, may be set to be 512 b/s, although this figure is somewhat arbitrary. The typical data rate is expected to be significantly below that; perhaps 1 b/s or lower in some applications.

Note that this is the data throughput, not the raw data rate as transmitted over the channel, which may be significantly higher. However, in the case of CSNs the communication rate is large according to the huge images data captured by the many cameras in this network. Therefore an efficient communication protocol including a suitable coding method for multi-view images concept in CSNs is required.

• Low Power Consumption

Sensor network applications typically require network components with average power consumption that is as low as possible. Applications involving the monitoring and control of industrial equipment require exceptionally long battery life so that the existing maintenance schedules of the monitored equipment are not compromised.

Other applications, such as environmental monitoring over large areas, may require a very large number of devices that make frequent battery replacement impractical.

Also, certain applications cannot employ a battery at all; network nodes in these applications must get their energy by mining or scavenging energy from the environment. An example of this type is the automobile tire pressure sensor, for which it is desirable to obtain energy from the mechanical or thermal energy present in the tire instead of a battery that may need to be replaced before the tire does.

(23)

In addition to average power consumption, primary power sources with limited average power sourcing capability often have limited peak power sourcing capabilities as well; this factor should also be considered in the system design.

• Multiuser Availability

Many applications of sensor networks require the network to be capable of multiuser operation. Generally, many users request the processed data from network. It is the same as that of proposed in Video on Demand (VoD) network architecture. In the case of CSNs and free viewpoint generation for multiuser, the optimization process needs to consider network parameters, capability of nodes and the priority of design.

• Message Latency

Sensor networks have very liberal Quality of Service (QoS) requirements, because, in general, they do not support isochronous or synchronous communication, and have data throughput limitations that prohibit the transmission of real-time video and, in many applications, voice.

The message latency requirement for sensor networks is, therefore, very relaxed in comparison to that of other WPANs; in many applications, a latency of seconds or minutes is quite acceptable. However, the realtime operation of CSNs for arbitrary view generation cannot accept the message latency.

• Mobility

Sensor network applications, in general, do not require mobility. Because the network is therefore released from the burden of identifying open communication routes, sensor networks suffer less control traffic overhead and may employ simpler routing methods than mobile ad hoc networks (e.g., MANET). Mobile node in CSNs may cause a big challenge of rectification of the mobile camera sensor node over the network considering some stationary nodes in neighborhood in realtime, which is still an opening field to be researched.

• Network Type

A conventional star network employing a single master and one or more slave devices may satisfy many applications. Because the transmission power of the network devices is limited by government regulation and battery life concerns, however, this network design limits the physical area a network may serve to the range of a single device (the master). When additional range is needed, network

(24)

types that support multi-hop routing (e.g., mesh or cluster type) must be employed;

the additional memory and computational cost for routing tables or algorithms, in addition to network maintenance overhead, must be supported without excessive cost or power consumption. It should be recognized that for many applications, sensor networks are of relatively large order (e.g., >256 nodes); device density may also be high. Note that in the distributed architecture of CSNs multi-hob routing must be employed.

• Worldwide Availability

Many of the proposed applications of sensor networks require the network to be capable of operation worldwide. Although, in theory, this capability may be obtained by employing Global Positioning System (GPS) or Global Navigation Satellite System (GLONASS) receivers in each network node and adjusting node behavior according to its location, the cost of adding a second receiver, plus the additional performance flexibility required to meet the varying worldwide requirements, makes this approach economically unviable. It is, therefore, desirable to employ a single band worldwide- one that has minimal variation in government regulatory requirements from country to country - to maximize the total available market for sensor networks.

• Security

The security of sensor networks has two facets of equal importance - how secure the network actually is and how secure network is perceived to be by users and (especially) potential users. The perception of security is important because users have a natural concern when their data (whatever it may be) is transmitted over the air for anyone to receive. Often, an application employing sensor networks replaces an earlier wired version in which users could physically see the wires or cables carrying their information, and know, with reasonable certainty, that no one else was receiving their information or injecting false information for them to receive. The wireless application must work to regain that confidence to attain the wide market needed to lower costs.

• Low Cost

Cost plays a fundamental role in applications adding wireless connectivity to inexpensive or disposable products, and for applications with a large number of

(25)

nodes in the network, such as wireless supermarket price tags. These potential applications require wireless links of low complexity that are low in cost relative to the total product cost. To meet this objective, the communication protocol and network design must avoid the need for high-cost components, such as discrete filters, by employing relaxed analog tolerances wherever possible, and minimize silicon area by minimizing protocol complexity and memory requirements. In addition, however, it should be recognized that one of the largest costs of many networks is administration and maintenance. To be a true low-cost system, the network should be ad hoc and capable of self-configuration and self-maintenance.

An “ad hoc” network in this context is defined to be a network without a predetermined physical distribution, or logical topology, of the nodes. “Self- configuration” is defined to be the ability of network nodes to detect the presence of other nodes and to organize into a structured, functioning network without human intervention. “Self-maintenance” is defined to be the ability of the network to detect, and recover form, faults appearing in either network nodes or communication links, again without human intervention.

1.4.2 Arbitrary Viewpoint Generation in CSNs

According to the huge amount of information transferred in such a network, energy efficiency and functionality of network should be optimized. Camera sensor network presents a significant trade-off between power consumed by processing versus communication. Communication power costs can vastly exceed today’s power- efficient processor demands. Generally, then, developers strive to process information locally to reduce the data transmitted. The layered architecture was designed to help sensor network capitalize on the collective behavior of these complex systems by dynamically increasing communication load (moving to a higher layer) only when doing so is beneficial. Lower layers can be performed independently at any node. Higher layers require communication between nodes. The processing layers architecture is specified based on the network processing task. For arbitrary view generation in camera sensor network, capturing and rectification of captured images belong to lower layers. Interpolation is classified in higher layers.

Hence, distributed processing is needed for arbitrary view generation in CSNs.

(26)

Arbitrary viewpoint generation is the main task of CSNs presented in this thesis like FTV. Therefore, the important objectives for this task for construction of a large- scale FTV as a 3D visual communication system are:

(1) Camera calibration and multiview images rectification: as the main preprocessing tasks before arbitrary view generation and coding of multiview images.

(2) Realtime arbitrary view generation in CSNs: is involved with correspondence search among multi-view images or ray-space interpolation.

Hence, an accelerated block matching algorithm that includes a correct detection of correspondence is needed.

(3) Adaptive Distributed source coding: to make an efficient communication among nodes in CSNs considering high spatial redundancy views, when the processing and communication abilities of nodes are limited.

(4) Optimization of CSNs for multiuser: to calculate an optimum number of nodes for the least of processing time, communication data amount and time- average power consumption in CSNs using Distributed Processing (DP), separately and jointly.

1.5 Thesis Organization

The thesis is organized as follows. Chapter 2 presents the theoretical background and related works. The ray-space method and its rendering are reviewed, and an overview of 3D visual communication system is illustrated. Camera Sensor Network as new sensor network is introduced in chapter 3 and the overview of the experimental system (FTV) as a CSN is shown. Free viewpoint generation using Offset-Block Matching (OBM), described in chapter 4, represents a combined approach, gaining the advantages of the pixel-based and the block-based matching methods in same time. Adaptive Distributed Source Coding (ADSC) of multi-view images, described in chapter 5. The optimization process of multiuser CSNs for arbitrary view generation is explored in chapter 6. The thesis concludes with chapter 7, summarizing the accomplished results. In this thesis, camera calibration and view

(27)

images rectification are preliminary tasks, however they are not the goal of this thesis. Hence, according to their importance to be done before any processing, the theoretical concepts of these algorithms have been represented in appendix A and B, respectively.

(28)

References

[1] Edgar H. Callway, Jr. “Wireless Sensors Networks: Architectures and Protocols”, Auerbach Publications, 2004.

[2] N. Abramson, “The ALOHA System”, Proc. AFIPS Fall joint computer conf., vol. 37, pp. 281-285, 1970.

[3] V. G. Gerf, F. F. Kuo (Ed.), “Packet Communication Technology”, Protocol and Techniques for Data Communication Networks, Prentice Hall. 1981.

[4] P. R. Karn, H. E. Price, and R. J. Diersing, “Packet Radio in the Amateur Service”, IEEE Journal Selected Area in Communication, vol. SAC-3, no. 3, pp. 431-439, May 1985.

[5] http://www.janet.ucla.edu.

[6] G. J. Pottie, and W. J. Kaiser, “Wireless Integrated network sensors”, Communication ACM, vol. 45, no. 12, pp. 17-21, Dec. 2002.

[7] http://bwrc.eecs.brekely.edu.

[8] J. M. Rabaey et al., “PicoRadio Supports Ad hoc Ultra-low Power Wireless Networking”, IEEE Computer, vol. 33, no. 7, pp. 42-48, July 2000.

[9] http://www-mtl.mit.edu.

[10] W. R. Heinzelman, A. Chandraksan, and H. Balakrishnan, “Energy-Efficient Communication Protocol for Wireless Microsensor Networks”, Proc. 33^rd Annu. Hawaii Int. Conf. on System Sciences, pp. 3005-3014, 2000.

[11] http://www.terminodes.org.

[12] J. P. Hubaux et al., “Toward Self-Organized Mobile Ad hoc Networks: The Terminodes Project”, IEEE Communication, vol. 39, no. 1, pp. 118-124, Jan. 2001.

[13] http://www.ietf.org.

[14] B. Baund, “A Communication Protocol for acoustic Ad hoc Networks of autonomous underwater vehicles”, Master Thesis 2001, Florida Atlantic University.

[15] E. M. Sozer, M. Stojanovic, and J. G. Proakis, “Water Acoustic networks”, IEEE Journal Oceanic Eng., vol. 25, no. 1, pp. 71-83, Jan. 2000.

[16] J. G. Proakis et al., “Shallow Water Acoustic Networks”, IEEE Communication, vol. 39, no. 11, pp. 114-119, Nov. 2001.

[17] P. Na Bangchang, T. Fujii, M. Tanimoto, ‘Experimental System of free viewpoint television’, Proc. of IS&T/ SPIE Symposium on Electronic Imaging, Santa Clara, CA,

(29)

USA, Vol. 5006-66, pp. 554-563, Jan 2003.

[18] M. P. Tehrani, P. Na Bangchang, T. Fujii, M. Tanimoto, "The Optimization of Distributed Processing for Arbitrary View Generation in Camera Sensor Networks", The IEICE Transaction on Fundamentals of Electronics, Communication and Computer Sciences, Special section on DSP, vol. E87-A, no. 8, pp. 1863-1870, Aug. 2004.

[19] J. M. Kahn, R. H. Katz, and K. S. J. Pister, “Mobile Networking for Smart Dust”, ACM/IEEE International Conference on Mobile Computing and Networking, Seattle, WA, Aug. 1999.

[20] C. C. Shen, C. Srisathapornphat, C. Jaikaeo, “Sensor information networking architecture and applications”, IEEE Personal Communications, Vol. 8, pp. 52 –59, Aug.

2001.

(30)

(31)

Chapter 2

Background and Related Works

This chapter gives reviews of basic background and related works involved with 3D visual communication systems. The discussion is separated into three main parts.

The first part reviews the basic background of ray-space representation and definition. In the second part, the general 3D visual communication system is covered. At the last part, some examples of 3D visual systems are introduced. Those systems are classified based on their new image rendering method. In addition, it discusses about the advantages and disadvantages of each system approach.

2.1 Ray-Space Representation

2.1.1 Definition of Ray-Space

In visual communication, a neutral representation method, which is independent from input/output systems are very necessary. This representation should be able to

(32)

handle visual data of variety systems such as stereo camera, multi-camera and hologram system in the same manner. Ray-space [1,2] is originally proposed to serve this purpose. It was proposed as a common data formats for 3D image communication [3] .A similar idea has been proposed in computer graphics field for generating photo-realistic images into computer generated virtual world [4,5]. It is one of the “image-based-rendering techniques”, and is called Lumigraph [6,7]. It has been widely used to create photo-realistic virtual world. Both of them are based on the idea that a view image from an arbitrary viewpoint can be generated from collection of real view images. Ray-space method describes 3D spatial information as the information of the ray, which transmits in the space. Due to human visual sensation, which is excited by rays of light incident on the retina, ray-space method suggests that light ray can be served as a primitive element of visual data. Everything that can be seen can be decomposed and converted into a set of light ray data called ray-space data.

Figure 2.1: Definition of plenoptic function [8]

The representation of 3D scene using light ray information is not the new concept.

Actually, it is similar to the concept of plenoptic function proposed about 30 years ago [8], as shown in Figure 2.1. In order to measure the plenoptic function, one can imagine placing an idealized eye at every possible (x,y,z) location and recording the intensity of the light rays passing through the center of the pupil at every possible angle (θ,φ), for every wavelength, λ, at every time t. It is simplest to have the eye

(33)

always look in the same direction, so that the angles (θ,φ) are always computed with respect to an optic axis that is parallel to the z axis. The resulting function takes the form of 7D function as shown in Equation (2.1).

( , , , , , , )

f = f x y zθ ϕ λ t (2.1)

Figure 2.2: Ray-space representation

However, the practical difficulty of capturing, processing and storing the 7D plenoptic function has limited its application to relatively small area. Thus ray-space representation proposed a better way to look at the plenoptic concept. It suggests that:

(1) In the homogeneous media, the effect of λ can be neglected.

(2) Ray-space representation was first proposed for static scenes so they further assume that the effect of time “t” is ignorable. However, this thesis concerns about dynamic scenes and variations in time that must be taken into account. For the simplicity, this chapter will continue by neglecting the effect of time. However, in the rest of this thesis, when it refers to ray- space data, it means the ray-space data for dynamic scene and the effect of time cannot be neglected.

(3) In free space, when radiance of ray that travel through empty space remains constant, any ray travels inside interesting space can be uniquely designated to the ray that intersect with plan z = 0 at the position (x,y).

(34)

Thus the effect of z can also be neglected. For example, the light rays that go through point A and B in Figure 2.2 can be represented by the rays that cut through the object plane at position (x0, y0).

2.1.2 Ray-Space Acquisition and Rendering

Figure 2.3 shows the definition of ray-space. Let (x,y,z) be three space coordinates, and (θ,φ) be the parameters of direction. x and y are the intersection of the ray with XY-plane. θ and φ are the angles of the rays passing through XY-plane with Z-axis in horizontal and vertical directions, respectively as shown in Figure 2.3.

1

2

3 4

5

X

Y

Z

( , ) x y ( , ) θ ϕ

Visual Field (H orizontal)

Visual Field (Vertical)

Cameras Position

Direction

Viewing Zone

Figure 2.3: Acquisition of ray-space data

In free space, a ray going through the space is uniquely designated by the intersection (x,y) with plane z = 0 and its direction (θ,φ). These ray parameters construct a 4D space. In fact, this is a 4D subspace of 5D ray-space (x,y,z,θ,φ). In this ray-parameter, a function f is defined whose value corresponds to an intensity of the specified ray. Thus, all the intensity data of rays can be expressed by Equation (2.1).

This ray parameter is called the ray-space.

( , , , ) and 2 2

f = f x yθ ϕ − ≤ ≤π θ π −π ≤ ≤ϕ π (2.2) In general, the rays that pass through the specific plane can be captured using a pinhole camera. When a camera at (x0,y0,z0) is set, the intensity data of rays that go through the pinhole are given by Equation (2.3).

0 0 0

( , , , , ) , and

f = f x y zθ ϕ x x= y= y z z= (2.3)

(35)

For simplicity, only 2D subspace f(x,θ) of 5D ray-space is considered, which the vertical parallax(φ) and vertical position(y) are neglected. This 2D plane of the ray- space data is Epipolar Plane Image (EPI), which is the cut of ray-space data of Figure 2.4 parallel to xu-plane for a given y. In another word, EPI is the intersection of epipolar plane with camera plane in stereovision, which is well-known in computer graphic field. Therefore, if there is an array of cameras in which all have a common epipolar plane, the intersection of the epipolar plane with all camera planes can be shown in one image. This image is called EPI. Let X, Z be real-space coordinates, and x, u be the ray-space coordinates as shown in Figure 2.4. The rays that pass through a point (X,Z) in the real-space form a line in the ray-space is given by Equation (2.4).

, tan

X = +x tZ t= θ (2.4)

It gives the most important and interesting characteristic of the 4D ray-space representation that a view image corresponds to a cross section image of the ray- space data. Therefore, the view acquisition/display process can be considered the recording/extracting process of the ray-space data along the locus. In the acquisition process, the ray data is sampled in the real space and recorded as the ray-space data along the locus as shown in Figure 2.4.

Figure 2.4: The construction of ray-space data

In the display process, the ray space is cut along the locus and extract the section image of the ray space as shown in Figure 2.5. However, in the case of not-dense

(36)

ray-space data, interpolation is needed. It generates the missing rays between two EPI lines. The EPI lines are in (x) direction for a given u as 1D subspace of 5D ray- space data or the intersection of epipolar plane with one camera plane in computer graphic field. The interpolation task should be done on 2D ray-space data or EPI.

The interpolation generates an EPI line between two EPI lines. In the next section the ray-space interpolation is explained under EPI constraint.

Figure 2.5: Rendering of ray-space data

2.2 3D Representation of Free Viewpoint Methods

In this chapter, the basic concepts of 3D representation of free viewpoint methods are explained. 3D representation of free viewpoint can be applied to both static and dynamic scenes/objects.

2.2.1 Components

In general, it is possible to separate the 3D representation of free viewpoint into three main parts: acquisition, transmission and display parts as demonstrated in Figure 2.6.

3D Data Acquisition 3D Data

Transmission 3D Data Display Figure 2.6: Basic block diagram of 3D visual communication

2.2.1.1 Data Acquisition

In order to capture dynamic 3D contents, the data acquisition system has to be extended from a single camera system to stereo camera or multi-camera systems. The

(37)

multi-camera system is overweight the stereo one as it can acquire larger amount of 3D scene information. The number of cameras varies depend on the system size.

2.2.1.2 Data Compression

The transmission of 3D content is generally similar to the transmission of 2D content in the video-based system. It means that the 3D data can be transmitted through a common communication channel, such as the internet or wireless channel. However, the 3D information of a scene is usually a lot larger than the 2D information of the same scene. Thus a compression is compulsory [9]. However, in a scenario with limited processing and communication abilities of nodes, a distributed source coding scheme is suitable.

2.2.1.3 Display

There are many various ways to display the captured 3D information based on the display devices. Those display devices can be separated into two groups. The first group is the display devices that use in 2D visual system such as normal 2D display or television. The other group includes a special display devices designed for 3D visual system such as binocular glasses or 3D television.

2.2.2 Representation Methods

There are five main representation methods that have been widely used to represent 3D information: (1) Image Domain, (2) Integral Photography, (3) Image-based rendering –IBR, (4) Model-Based Rendering – MBR, and (5) Combination of IBR and MBR representations.

2.2.2.1 Image Domain

In image domain representation, original multi-view images are directly obtained from a multiple camera system.

Data Acquisition: The camera setup can be the parallel setup, convergent setup, or divergent setup. To generate a new view image by e.g. image warping technique, the corresponding points between original images have to be specified. Automatically finding correspondences requires computation costs, such as feature extraction, matching, and correlation calculation. In addition to that, if we want to recover the

(38)

depth value of the corresponding points, camera intrinsic and extrinsic parameters have to be measured through camera calibration.

Data compression: In image domain representation, original multi-view images can be compressed as moving pictures. Each view image can be compressed by using e.g.

JPEG and multi-view image set can be compressed by using disparity compensation.

More sophisticated approach is to use shape or depth of the scene. In [3], 21 view images are compressed to the amount of 1-2 view images.

Display: In image domain representation, no special rendering process is required. It may be warping of original multi-view images or rendering from 3D object to 2D image, and these processes can be done in real-time.

2.2.2.2 Integral Photography

In Integral Photography representation, similar to Image domain representation, original multi-view images are directly obtained from a multiple camera system.

Data Acquisition: In this system, an array of CCD camera densely captures the view.

The camera setup is precisely aligned. In this method, the captured views are optically converted.

Data compression: The IP representation is similar to image domain representation, so that original multi-view images can be compressed as moving pictures.

Display: In IP representation, no special rendering process is required. However, the same view is back projected through multiple microlense array. Therefore, human eye can render.

Note that Integral Photography represents the 3D information in image domain, however the amount of captured information is large, which makes it similar to the ray-space representation concept.

2.2.2.3 Ray-Space, Light Field, and Lumigraph (IBR)

Ray-Space representation is obtained by converting the original multi-view images to “ray” parameters.

Data Acquisition: Since 4D Ray-Space f(x,y,θ,φ) is defined in the world coordinate, the conversion requires camera intrinsic and extrinsic parameters.

Therefore, camera calibration has to be done. In real experimental systems, it is

(39)

difficult to obtain dense Ray-Space data. To solve this problem, it is a good solution to generate intermediate ray data, i.e., to “interpolate” Ray-Space data from sparsely sampled Ray-Space data. In Free-Viewpoint TV system, real-time Ray-Space interpolation by adaptive filtering is realized. Details are seen in [10,11].

Light field or LumigraphL(u,v,s,t) is another format of the Ray-Space. Therefore, the procedure and computation costs to obtain this representation are the same as the Ray-Space described above.

Data Compression: For the Ray-Space compression, the following redundancy of the Ray-Space data can be utilized. One is the same redundancy as that the 2D still images have. A cross section of the Ray-Space corresponds to a 2D still image.

Therefore, the Ray-Space has the same statistical properties as 2D still images. The other redundancy comes from the fact that the intensity of the surface point is similar regardless of the viewing direction. The extreme case is that the objects are covered with the Lambertian surfaces. In this case, the intensities of all the rays that come from the same point are constant regardless of the viewing direction. This redundancy is so high that it should be exploited for the efficient Ray-Space coding.

In [12-14], some experiments on Ray-Space coding such as subband coding, vector quantization are briefly described.

As described above, Light field or Lumigraph is the same concept as the Ray-Space.

Therefore, the same compression scheme can be applied to light field compression.

The original paper of light field [6] mentions the compression briefly. In [6], they applied vector quantization together with Lempel-Ziv coding. It attained 24:1 compression ratio by vector quantization and 5:1 by entropy coding. Their total compression is therefore 120:1. In [29] light field is coded using a block-adaptive coder, yielding between 208 : 1 compression ratio at 44.9 dB and 945 : 1 at 36.2 dB.

Display: Since a view image is a 2D subspace of 4D space, the renderer only resamples a 2D slice of line from the 4D Ray-Space or Light field. This process requires only a memory access procedure and no computation costs.

2.2.2.4 Model Based (MBR)

Model-based representation is obtained by converting the original multi-view images to a 3D-model, and texture information.

(40)

Data Acquisition: Model-based also needs camera calibration and registration as the Ray-Space and Light field. Model-based Rendering usually recovers the geometry of the real scene and then renders it from desired virtual view points. Methods for the automatic construction of 3D models have found applications in many field, including Mobile Robotics, Virtual Reality and Entertainment. These methods generally fall into two categories, active [15-21] using laser and passive [22-24]

using multiple 2D photographs of a scene methods.

Data Compression: To robustly reconstruct a 3D model of a scene, the Multi- Hypothesis Volumetric Reconstruction (MHVR) algorithm is used [25]. In contrast to methods that rely on feature-matching [26-28], no corresponding points needs to be found in MHVR. A voxel model is built directly from calibrated images that depict a static scene from multiple viewpoints, and no image segmentation is necessary. The algorithm is based on discretizing the space containing the object into volume elements (voxels). The discretized volume is typically comprised of some 200³ to 300³ voxels. Initially, each surface voxel is projected into all images visible from its location, and numerous voxel color hypotheses are collected. The voxel is assigned a color hypothesis if its projection into at least two different images. After all surface voxels have been considered, each voxel is again projected into the visible images, and the voxel's color hypotheses are checked for consistency with the images.

If, after considering all visible images, no color hypothesis remains, the voxel is set to be transparent, and a new inner voxel becomes visible. After all surface voxels have been tested, the algorithm starts over collecting color hypotheses for all now- visible voxels and afterwards eliminating wrong hypotheses. The algorithm keeps iterating until no more voxel can be set transparent. A volumetric model of an object consists of several hundred thousand voxel cubes. For the Model-based compression, the following redundancy of data can be utilized. In [29], at compression ratios from 132:1 to 2850:1, the image is reconstructed at PSNR values ranging 42.0 dB to 32.9 dB, respectively.

Display: Model-based, the required viewpoint is constructed simply by having the texture map and 3D model information. In this method, the texture is mapped simply using the 3D information.

(41)

2.2.2.5 Surface Light Field Mapping (IBR & MBR)

Surface light field mapping is a representation method that combines the IBR and MBR approaches.

Data Acquisition: Surface light field also needs camera calibration and registration as the Ray-Space and Light field. The difference lies on that in a surface light field

) , , , (r sθ φ

f , the first pair of parameters of this function (r,s)describes the surface location. This means that we need the shape of a 3D object. In the case of a synthetic scene, since the computer has a model of an object, surface light field can be easily obtained. However, in a real scene, it is a troublesome procedure to measure the 3-D object and construct triangular mesh. Another process of surface light field is data resampling. It includes normalization of texture sizes and resampling of viewing directions. It can be seen an “interpolation” process of light rays. But unlike the Ray- Space interpolation described above, viewing direction is dense enough. Only 0- order interpolation is sufficient.

Data compression: In surface light field, their approach is to approximate light field data using the dimensionality reduction over small surface patches. Precisely, they approximate the discrete 4-dimensional surface light field function f as a sum of a small number of products of lower-dimensional functions:

∑

=

≈ ^K

k

h s r g s

r f

1

, ,( , ) ( , ) )

, , ,

( θ φ θ φ (2.5) where functions g_, and h_, are referred as surface maps and view maps, respectively.

The compression ratio in this stage depends on the truncation of the sum obtained from the decomposition. In [30], they report that 100:1 compression ratio is obtained by the approximation, 10:1 by vector quantization, 6:1 hardware texture compression.

Display: Surface light field requires the most complicated rendering process. First, the algorithm texture maps the surface primitive using the surface map. Then, it texture maps the surface primitive using the fragment of the view map determined by the view-dependent texture coordinates. Finally, it performs pixel-by-pixel multiplication of the results of the two texture mapping operations. Although complicated, the algorithm is tuned to use hardware engine.

3D Image Processing and Communication in Camera Sensor Networks – Free Viewpoint Television Networking –