雨果巴拉:行业北极星Vision Pro过度设计不适合市场

Sony Patent | Transmission Apparatus, Transmission Method, Reception Apparatus, And Reception Method

Patent: Transmission Apparatus, Transmission Method, Reception Apparatus, And Reception Method

Publication Number: 20200294188

Publication Date: 20200917

Applicants: Sony

Abstract

Improvement of the display performance in VR reproduction is achieved. Encoded streams corresponding to respective divided regions (partitions) of a wide viewing angle image are transmitted together with information of the number of pixels and a frame rate of each divided region. On the reception side, the number of divided regions to be decoded corresponding to a display region can be easily set to a decodable maximum on the basis of the decoding capacity and the information of the number of pixels and the frame rate of each divided region of the wide viewing angle image. Therefore, the frequency of switching of the encoded stream with a movement of the display region can be reduced as far as possible, and improvement of the display performance in VR reproduction can be made.

TECHNICAL FIELD

[0001] The present technology relates to a transmission apparatus, a transmission method, a reception apparatus, and a reception method, and particularly to a transmission apparatus and so forth for transmitting a wide viewing angle image.

BACKGROUND ART

[0002] Recently, delivery of VR (Virtual Reality) contents is considered. For example, PTL 1 describes that, on the transmission side, a spherical captured image is plane packed to obtain a projection picture as a wide viewing angle image, and encoded image data of the projection picture is transmitted to the reception side such that VR reproduction is performed on the reception side.

CITATION LIST

Patent Literature

[PTL 1]

[0003] Japanese Patent Laid-Open No. 2016-194784

SUMMARY

Technical Problem

[0004] The feature of VR reproduction resides in implementation of viewer interactive display. If image data of a projection picture is transmitted by one encoded stream, then the decoding load on the reception side is high. It is conceivable to divide a projection picture and transmit encoded streams corresponding to the individual divided regions. On the reception side, it is only necessary to decode an encoded stream of part of the divided regions corresponding to a display region, and increase of the decoding load can be prevented.

[0005] In this case, switching of an encoded stream to be decoded becomes necessary together with movement of the display region. However, upon switching of an encoded stream, there is the possibility that deterioration of the display performance may be caused by disagreement between a motion of the user and the display. Therefore, it is demanded to minimize the frequency of switching of an encoded stream with a movement of a display region.

[0006] The object of the present technology resides in achievement of improvement of the display performance in VR reproduction.

Solution to Problem

[0007] A concept of the present technology resides in a transmission apparatus including a transmission section configured to transmit an encoded stream corresponding to each of divided regions of a wide viewing angle image and transmit information of the number of pixels and a frame rate of each of the divided regions.

[0008] In the present technology, encoded streams corresponding to each of the divided regions (each of the partitions) of the wide viewing angle image are transmitted, and the information of the number of pixels and the frame rate of each of the divided regions is transmitted by the transmission section. For example, the wide viewing angle image may include a projection picture obtained by cutting out and plane packing part or the entirety of a spherical captured image.

[0009] For example, the encoded stream corresponding to each of the divided regions of the wide viewing angle image may be hierarchically encoded. In this case, on the reception side, temporal partial decode can be performed readily. Further, for example, the transmission section may transmit the information of the number of pixels and the frame rate of the divided region together with a container that includes the encoded stream. In this case, the information of the number of pixels and the frame rate of the divided region can be acquired without decoding the encoded streams.

[0010] For example, the encoded stream corresponding to each divided region of the wide viewing angle image may be obtained by individually encoding the divided region of the wide viewing angle image. Further, for example, the encoded stream corresponding to each divided region of the image may be obtained by performing encoding using a tile function for converting each divided region of the wide viewing angle image into a tile. In this case, each of the encoded streams of the divided regions can be decoded independently.

[0011] For example, the transmission section may transmit encoded streams corresponding to all of the respective divided regions of the wide viewing angle image. Alternatively, the transmission section may transmit an encoded stream corresponding to a requested divided region from among the respective divided regions of the wide viewing angle image.

[0012] In this manner, in the present technology, the information of the number of pixels and the frame rate of each of divided regions of the wide viewing angle image is transmitted. Therefore, on the reception side, the number of divided regions to be decoded corresponding to the display region can be easily set to a decodable maximum on the basis of the decoding capacity and the information of the number of pixels and the frame rate of the divided regions of the wide viewing angle image. Consequently, the frequency of switching of the encoded stream with a movement of the display region can be reduced as far as possible and improvement of the display performance in VR reproduction can be achieved.

[0013] Further, another concept of the present technology resides in a reception apparatus including a control section configured to control a process for decoding encoded streams of a predetermined number of divided regions corresponding to a display region from among respective divided regions of a wide viewing angle image to obtain image data of the display region, and a process for calculating a value of the predetermined number on the basis of a decoding capacity and information of the number of pixels and a frame rate associated with each of the encoded streams corresponding to the respective divided regions of the wide viewing angle image.

[0014] In the present technology, the process for decoding encoded streams of a predetermined number of the divided regions corresponding to the display region from among the respective divided regions of the wide viewing angle image to obtain the image data of the display region is controlled by the control section. Further, the process for calculating the value of the predetermined number on the basis of the decoding capacity and the information of the number of pixels and the frame rate associated with each of the encoded streams corresponding to the respective divided regions of the wide viewing angle image is controlled by the control section. For example, the control section may further control a process for requesting a distribution server for transmission of the encoded streams of the predetermined number of divided regions and receiving the encoded streams of the predetermined number of divided regions from the distribution server.

[0015] In this manner, in the present technology, the number of divided regions to he decoded corresponding to the display region is calculated on the basis of the decoding capacity and the information of the number of pixels and the frame rate of the divided region. Therefore, the number of divided regions to be decoded corresponding to the display region can be set easily to a maximum, and the frequency of switching of the encoded stream with a movement of the display region can be reduced as far as possible, so that improvement of the display performance in VR reproduction can be made.

[0016] It is to be noted that, in the present technology, for example, the control section may further control a process for predicting that the display region exceeds a decode range and switching the decode range. This makes it possible to perform display suitable for a destination of movement even in the case where the display region moves. Further, in this case, for example, the control section may predict that the display region exceeds the decode range and switches a decode method to temporal partial decode to enlarge the decode range, and may further control a process for predicting that the display region converges into the decode range before the enlargement and switching the decode method to temporal full decode to reduce the decode range. In this case, by switching the decode method to temporal partial decode, decode becomes possible even if the decode range is expanded. Further, by expanding the decode range, the frequency of switching of the encoded stream with respect to movement of the display region different from the prediction, namely, of the decode range, can be reduced, and further improvement of the display performance in VR reproduction can be made.

Advantageous Effects of Invention

[0017] With the present technology, improvement of the display performance in VR reproduction can be achieved. It is to be noted that the effect described here is not necessarily limited and may be any of advantageous effects described in the present disclosure.

BRIEF DESCRIPTION OF DRAWINGS

[0018] FIG. 1 is a block diagram depicting an example of a configuration of an MPEG-DASH-based stream delivery system.

[0019] FIG. 2 is a view depicting an example of a relationship of respective structures arranged hierarchically in an MPD file.

[0020] FIG. 3 is a block diagram depicting an example of a configuration of a transmission and reception system as an embodiment.

[0021] FIG. 4 is a view schematically depicting an example of a configuration of the entire transmission and reception system.

[0022] FIG. 5 is a view illustrating plane packing for obtaining a projection picture from a spherical captured image.

[0023] FIG. 6 is a view depicting an example of division of a projection picture.

[0024] FIG. 7 is a view illustrating hierarchical encoding.

[0025] FIG. 8 is a view illustrating encoding using a tile function for converting each partition into a tile.

[0026] FIG. 9 is a view depicting an example of a structure of a partition descriptor.

[0027] FIG. 10 is a view depicting the content of principal information in the structure example of the partition descriptor.

[0028] FIG. 11 is a view depicting an example of a description of an MPD file corresponding to a tile-based MP4 stream (tile-based container).

[0029] FIG. 12 is a view depicting an example of description of an MPD file corresponding to an MP4 stream of each partition.

[0030] FIG. 13 is a view schematically depicting an example of an MP4 stream (track) in the case where encoding using a tile function for converting each partition into a tile is performed.

[0031] FIG. 14 is a view schematically depicting an example of an MP4 stream (track) in the case where each partition is encoded individually.

[0032] FIG. 15 is a view depicting an example in which a projection picture of the 8 K/60 Hz class is divided by a partition size of 1920.times.1080 (Full HD).

[0033] FIG. 16 is a view depicting an example in which a projection picture of the 8 K/60 Hz class is divided by a partition size of 1280.times.960 (4 VGA).

[0034] FIG. 17 is a view depicting an example in which a projection picture exceeding 8 K/60 Hz is divided by a partition size of 1280.times.960 (4 VGA).

[0035] FIG. 18 is a view depicting an example in which a projection picture of the 8 K/60 Hz class id divided by a partition size of 1280.times.720 (720p HD).

[0036] FIG. 19 is a view collectively depicting the maximum number of decodable partitions according to partition sizes in a “Level 5.1” decoder.

[0037] FIG. 20 is a view collectively depicting the maximum number of decodable partitions according to partition sizes in a “Level 5.2” decoder.

[0038] FIG. 21 is a view depicting a case in which the number of pixels of each partition is not uniform.

[0039] FIG. 22 is a view depicting an example of movement control of a display region in the case where an HMD is used as a display apparatus.

[0040] FIG. 23 is a view depicting an example of movement control of a display region in the case where a display panel is used as a display apparatus.

[0041] FIG. 24 is a view depicting an example of switching of a delivery stream set with a movement of a display region.

[0042] FIG. 25 is a view depicting an example of switching of a delivery stream set with a movement of a display region.

[0043] FIG. 26 is a view illustrating a case in which it is predicted that a display region exceeds a decode range.

[0044] FIG. 27 is a view depicting a state of switching of a decode range in the case where a display region successively moves.

[0045] FIG. 28 is a view depicting a state of switching of a decode range in the case where a display region successively moves (wide decode mode introduction).

[0046] FIG. 29 is a view depicting a frame rate of each partition in the case where video encoding is ready for a tile.

[0047] FIG. 30 is a view depicting a frame rate of a partition in the case where video encoding encodes each partition into an independent stream.

[0048] FIG. 31 is a view illustrating convergence prediction of a display region.

[0049] FIG. 32 is a view depicting an example of mode change control.

[0050] FIG. 33 is a flow chart depicting an example of a control process for decode range change and mode change by a control section of a service receiver.

[0051] FIG. 34 is a block diagram depicting an example of a configuration of a service transmission system.

[0052] FIG. 35 is a block diagram depicting an example of a configuration of the service receiver.

[0053] FIG. 36 is a view depicting an example of a configuration of a transport stream in the case where video encoding is ready for a tile.

[0054] FIG. 37 is a view depicting an example of a configuration of an MMT stream in the case where video encoding is ready for a tile.

[0055] FIG. 38 is a view depicting an example of a description of an MPD file in the case where a tile stream has a single stream configuration.

[0056] FIG. 39 is a view schematically depicting an example of an MP4 stream (track) in the case where a tile stream has a single stream configuration.

[0057] FIG. 40 is a view depicting an example of a configuration of a transport stream in the case where a tile stream has a single stream configuration.

[0058] FIG. 41 is a view depicting an example of a configuration of an MMT stream in the case where a tile stream has a single stream configuration.

[0059] FIG. 42 is a view schematically depicting another example of an MP4 stream (track) in the case where encoding is performed using a tile function for converting each partition into a tile.

[0060] FIG. 43 is a view schematically depicting a further example of an MP4 stream (track) in the case where each partition is encoded individually.

[0061] FIG. 44 is a view schematically depicting an example of an MP4 stream (track) in the case where a tile stream has a single stream configuration.

DESCRIPTION OF EMBODIMENT

[0062] In the following, a mode for carrying out the invention (hereinafter referred to as an “embodiment”) is described. It is to be noted that the description is given in the following order.

[0063] 1.* Embodiment*

[0064] 2.* Modifications*

1.* Embodiment*

[Overview of MPEG-DASH-Based Stream Delivery System]

[0065] First, an overview of an MPEG-DASH-based stream delivery system to which the present technology can be applied is described.

[0066] FIG. 1 depicts an example of a configuration of an MPEG-DASH-based stream delivery system 30. In this configuration example, a media stream and an MPD (Medial Presentation Description) file are transmitted through a communication network transmission line (communication transmission line). The stream delivery system 30 is configured such that N service receivers 33-1, 33-2, … , 33-N are connected to a DASH stream file server 31 and a DASH MPD server 32 through a CDN (Content Delivery Network) 34.

[0067] The DASH stream file server 31 generates a stream segment of the DASH specification (hereinafter referred to suitably as a “DASH segment”) on the basis of media data of a predetermined content (video data, audio data, subtitle data and so forth) and sends out the segment in response to an HTTP request from a service receiver. The DASH stream file server 31 may be a server designated for streaming or a web (Web) server may be sometimes used also as the DASH stream file server 31.

[0068] Further, the DASH stream file server 31 transmits, in response to a request for a segment of a predetermined stream sent thereto from a service receiver 33 (33-1, 33-2, … , 33-N) through the CDN 34, the segment of the stream to the receiver of the request source through the CDN 34. In this case, the service receiver 33 refers to the value of a rate described in an MPD (Media Presentation Description) file to select a stream of an optimum rate in response to a state of a network environment in which the client is placed, and performs requesting.

[0069] The DASH MPD server 32 is a server that generates an MPD file for acquiring a DASH segment generated by the DASH stream file server 31. The DASH MPD server 32 generates an MPD file on the basis of content metadata from a content management server (not depicted) and an address (url) of the segment generated by the DASH stream file server 31. It is to be noted that the DASH stream file server 31 and the DASH MPD server 32 may be a physically same server.

[0070] In the format of the MPD, for each of streams of videos, audio and so forth, an attribute is described using an element called representation (Representation). For example, in an MPD file, a rate is described in a separate representation for each plurality of video data streams of different rates. The service receiver 33 can refer to the values of the rates to select an optimum stream in response to a state of the network environment in which the service receiver 33 is placed as described hereinabove.

[0071] FIG. 2 depicts an example of a relationship of respective structures arranged hierarchically in an MPD file. As depicted in FIG. 2(a), in a media presentation (Media Presentation) as the entire MPD file, a plurality of periods (Period) partitioned by time intervals exists. For example, the first period starts from 0 second, the next period starts from 100 seconds, and so forth.

[0072] As depicted in FIG. 2(b), in a period, a plurality of adaptation sets (AdaptationSet) exists. Each adaptation set relies upon a difference in media type such as a video, audio or the like, a difference in language even in the same media type, a difference in visual point and so forth. As depicted in FIG. 2(c), in an adaptation set, a plurality of representations (Representation) exists. Each representation relies upon a difference in stream attribute such as a rate.

[0073] As depicted in FIG. 2(d), in a representation, segment info (SegmentInfo) is included. In this segment info, as depicted in FIG. 2(e), an initialization segment (Initialization Segment) and a plurality of media segments (Media Segment) describing information for each of segments (Segment) into which a period is partitioned further finely, exist. In a media segment, information of an address (url) for actually acquiring segment data of video, audio and so forth and other information exist.

[0074] It is to be noted that, between a plurality of representations included in an adaptation set, switching of a stream can be performed freely. Consequently, a stream of an optimum rate can be selected in response to a state of the network environment of the reception side, and video delivery free from interruption can be achieved.

[Example of Configuration of Transmission and Reception System]

[0075] FIG. 3 depicts an example of a configuration of a transmission and reception system 10 as the embodiment. The transmission and reception system 10 is configured from a service transmission system 100 and a service receiver 200. In the transmission and reception system 10, the service transmission system 100 corresponds to the DASH stream file server 31 and the DASH MPD server 32 of the stream delivery system 30 depicted in FIG. 1 described hereinabove. In the transmission and reception system 10, the service receiver 200 corresponds to the service receiver 33 (33-1, 33-2, … , 33-N) of the stream delivery system 30 depicted in FIG. 1 described hereinabove.

[0076] The service transmission system 100 transmits a DASH/MP4 file, namely, an MP4 (ISOBMFF) stream including media streams (media segments) of an MPD file as a meta file and media streams (media segments) of a video, audio and so forth, through a communication network transmission line (refer to FIG. 1).

[0077] In the embodiment, the MP4 stream includes an encoded stream (encoded image data) corresponding to a divided region (partition) obtained by dividing a wide viewing angle image. Here, although the wide viewing angle image is a projection picture obtained by cutting out and plane packing part or the entirety of a spherical captured image, this is not restrictive.

[0078] An encoded stream corresponding to each divided region of a wide viewing angle image is obtained, for example, by individually encoding each divided region of the wide viewing angle image or by performing encoding using a tile function for converting each divided region of a wide viewing angle image into a tile. In the present embodiment, an encoded stream is in a hierarchically encoded form in order to make it possible for the reception side to easily perform temporal partial decoding.

[0079] An encoded stream corresponding to each divided region of a wide viewing angle image is transmitted together with information of the number of pixels and a frame rate of the divided region. In the embodiment, in MP4 that is a container in which an encoded stream of each divided region is included, a descriptor having the number of pixels and the frame rate of the divided region is included.

[0080] It is to be noted that, although it is also conceivable to transmit all encoded streams corresponding to divided regions of a wide viewing angle image, in the present embodiment, an encoded stream or streams corresponding to a divided region or regions requested are transmitted. This makes it possible to prevent a transmission region from being taken uselessly widely and achieve efficient use of a transmission band.

[0081] The service receiver 200 receives the above-described MP4 (ISOBMFF) stream sent thereto from the service transmission system 100 through the communication network transmission line (refer to FIG. 1). The service receiver 200 acquires meta information regarding the encoded stream corresponding to each divided region of the wide viewing angle image from the MPD file.

[0082] The service receiver 200 requests the service transmission system (distribution server) 100 for transmission of a predetermined number of encoded streams corresponding to a display region, receives and decodes the predetermined encoded streams to obtain image data of the display region, and displays an image. Here, in the service receiver 200, a predetermined number of values are determined to a decodable maximum number on the basis of a decoding capacity and the information of the number of pixels and the frame rate associated with the encoded stream corresponding to each divided region of the wide viewing angle image. Consequently, it becomes possible to reduce the frequency of switching of a delivery encoded stream with a movement of the display region by a motion or an operation of a user as far as possible, and the display performance in VR reproduction is improved.

[0083] Further, in the present embodiment, in the service receiver 200, in the case where it is predicted that the display region exceeds the decode range, the decode method is switched from temporal full decode to temporal partial decode, and then in the case where it is predicted that the display region converges into the decode range, the decode method is switched from the temporal partial decode to the temporal full decode. By switching the decode method to the temporal partial decode, the number of divided regions that can be decoded can be increased, and the frequency of switching of the delivery encoded stream with respect to a movement of the display region different from the prediction can be reduced. Thus, the display performance in VR reproduction is further improved.

[0084] FIG. 4 schematically depicts an example of a configuration of the entire transmission and reception system 10. The service transmission system 100 includes a 360.degree. picture capture section 102, a plane packing section 103, a video encoder 104, a container encoder 105, and a storage 106.

[0085] The 360.degree. picture capture section 102 images an imaging target by a predetermined number of cameras to obtain image data of a wide viewing angle image, that is, in the present embodiment, a spherical captured image (360.degree. VR image). For example, the 360.degree. picture capture section 102 performs imaging by a back to back (Back to Back) method using fisheye lenses to obtain a front face image and a rear face image of a very wide viewing angle having a viewing angle of 180.degree. or more individually captured as a spherical captured image.

[0086] The plane packing section 103 cuts out and plane packs part or the entirety of the spherical captured image obtained by the 360.degree. picture capture section 102 to obtain a projection picture. In this case, as the format type of the projection picture, for example, an equirectangular (Equirectangular) format, a cross cubic (Cross-cubic) format or the like is selected. It is to be noted that the plane packing section 103 carries out scheduling for the projection picture as occasion demands to obtain a projection picture of a predetermined resolution.

[0087] FIG. 5(a) depicts an example of a front face image and a rear face image of a very wide viewing angle as a spherical captured image obtained by the 360.degree. picture capture section 102. FIG. 5(b) depicts an example of a projection picture obtained by the plane packing section 103. This example is an example in the case where the format type of the projection picture is the equirectangular format. This example is an example of a case in which the respective images depicted in FIG. 5(a) are cut out along latitudes indicated by broken lines. Further, FIG. 5(c) depicts another example of a projection picture obtained by the plane packing section 103. This example is an example of a case in which the format type of the projection picture is the cross cubic format.

[0088] Referring back to FIG. 4, the video encoder 104 carries out encoding, for example, MPEG4-AVC or HEVC encoding, for the image data of the projection picture from the plane packing section 103 to obtain encoded image data and generates an encoded stream including this encoded image data. In this case, the video encoder 104 divides the projection picture into a plurality of partitions (divided regions) and obtains an encoded stream corresponding to each of the partitions.

[0089] FIG. 6(a) depicts an example of division in the case where the format type of the projection picture is the equirectangular format. Meanwhile, FIG. 6(b) depicts an example of division in the case where the format type of the projection picture is the cross cubic format. It is to be noted that the way of division of a projection picture is not limited to these examples, and, for example, a case in which all partitions have sizes that are not same as each other is also conceivable.

[0090] The video encoder 104 performs, in order to obtain an encoded stream corresponding to each partition of a projection picture, for example, individual encoding of the partitions, collective encoding of the entire projection picture, or encoding using a tile function of converting each partition into a tile. This makes it possible to decode the encoded streams corresponding to the partitions independently of each other on the reception side.

[0091] Here, the video encoder 104 obtains encoded streams corresponding to the partitions by hierarchically encoding the partitions. FIG. 7(a) depicts an example of hierarchical encoding. The axis of ordinate indicates hierarchies. The axis of abscissa indicates a display order (POC: picture order of composition), and the left side is earlier in display time while the right side is later in display time. Each rectangular frame indicates a picture, and a numeral indicates a display order number. A solid line arrow mark indicates a reference relationship between pictures in encoding.

[0092] This example is an example in which the pictures are classified into three hierarchies of a sublayer 2 (Sub layer 2), a sublayer 1 (Sub layer 1), and a full layer (Full layer), and encoding is carried out for image data of pictures in the individual hierarchies. This example is an example in which M=4, namely, three b (B) pictures exist between an I picture and a P picture. It is to be noted that, although a b picture does not become a reference picture, a B picture becomes a reference picture. Here, a picture of “0” corresponds to an I picture; a picture of “1” corresponds to a b picture; a picture of “2” corresponds to a B picture; a picture of “3” corresponds to a b picture; and a picture of “4” corresponds to a P picture.

[0093] In this hierarchical encoding, only the sublayer 2 can be selectively decoded, and in this case, image data of the 1/4 frame rate is obtained. Further, in this hierarchical encoding, the sublayer 1 and the sublayer 2 can be selectively decoded, and in this case, image data of the 1/2 frame rate is obtained. Furthermore, in the present hierarchical encoding, all of the sublayer 1, sublayer 2, and full layer can be decoded, and in this case, image data of the full frame rate is obtained.

[0094] Meanwhile, FIG. 7(b) depicts another example of hierarchical encoding. The axis of ordinate indicates hierarchies. The axis of abscissa indicates a display order (POC: picture order of composition), and the left side indicates earlier display time while the right side indicates later display time. Each of rectangular frames indicates a picture, and a numeral indicates a display order number. A solid line arrow mark indicates a reference relationship between pictures in encoding.

[0095] This example is an example in which pictures are classified into two hierarchies of a sublayer 1 (Sub layer 1) and a full layer (Full Layer), and encoding is carried out for image data of pictures of the individual hierarchies. This example is an example in which M=4, namely, three b pictures exist between an I picture and a P picture. Here, the picture of “0” corresponds to an I picture; the pictures of “1” to “3” correspond to b pictures; and the picture of “4” corresponds to a P picture.

[0096] In this hierarchical encoding, only the sublayer 1 can be selectively decoded, and in this case, image data of the 1/4 frame rate is obtained. Further, in this hierarchical encoding, all of the sublayer 1 and the full layer can be decoded, and in this case, image data of the full frame rate is obtained.

[0097] The container encoder 105 generates a container including an encoded stream generated by the video encoder 104, here, an MP4 stream, as a delivery stream. In this case, a plurality of MP streams individually including encoded streams corresponding to partitions is generated. In the case where encoding using a tile function of converting each partition into a tile is performed, it is also possible to form one MP4 frame including encoded streams corresponding to all partitions as sub streams. However, in the present embodiment, it is assumed that a plurality of MP4 streams each including an encoded stream corresponding to each partition is generated.

[0098] It is to be noted that, in the case where encoding is performed using a tile function for converting each partition into a tile, the container encoder 105 generates a base MP4 stream (base container) including a parameter set of SPS including sublayer information and so forth in addition to a plurality of MP4 streams each including an encoded stream corresponding to the partition.

[0099] Here, encoding using a tile function for converting each partition into a tile is described with reference to FIG. 8. Tiles are obtained by dividing a picture in horizontal and vertical directions and can be encoded and decoded independently of each other. Since a tile allows in-screen prediction in a picture, loop filter, and refreshment of entropy encoding to be refreshed, regions obtained as tiles by division can be encoded and decoded independently of each other.

[0100] FIG. 8(a) depicts an example of a case in which a picture is divided into two partitions in each of vertical and horizontal directions and accordingly into a total of four partitions, and encoding is performed on each of the partitions as a tile. In this case, in regard to the partitions (tiles) a, b, c, and d obtained by the tile division, a list of the byte position of top data of each tile is described in the slice header as depicted in FIG. 8(b) to make independent decoding possible.

[0101] Since the positional relationship of a start block of a tile in a picture can be recognized from a relative position from the top left (top-left) of the picture, also in the case where an encoded stream of each partition (tile) is container-transmitted by a different packet, the original picture can be reconstructed by the reception side. For example, if the encoded streams of the partitions b and d each surrounded by a rectangular frame of a chain line as depicted in FIG. 8(c) are decoded, then display of the partitions (tiles) of b and d becomes possible.

[0102] It is to be noted that, also in the case where an encoded stream of each partition (tile) is container-transmitted by a different packet, sublayer information is arranged in one SPS in a picture. Therefore, meta information such as a parameter set is placed into a tile-based MP4 stream (tile-based container). Then, in the MP4 stream (tile container) of each partition, an encoded stream corresponding to the partition is placed as slice information.

[0103] Further, the container encoder 105 inserts information of the number of pixels and a frame rate of a partition into the layer of the container. In the present embodiment, a partition descriptor (partition descriptor) is inserted into an initialization segment (IS: initialization segment) of the MP4 stream. In this case, a plurality of partition descriptors may be inserted as a maximum frequency in a unit of a picture.

[0104] FIG. 9 depicts an example of a structure (Syntax) of the partition descriptor. Meanwhile, FIG. 10 depicts the content of major information (Semantics) in the structure example. An 8-bit field of “partition_descriptor_tag” indicates a descriptor type and here indicates that the descriptor is a partition descriptor. An 8-bit field of “partition_descriptor_length” indicates a length (size) of the descriptor and indicates the number of succeeding bytes as the length of the descriptor.

[0105] An 8-bit field of “frame_rate” indicates a frame rate (full frame rate) of a partition (division picture). A 1-bit field of “tile_partition_flag” indicates whether or not picture division is performed by a tile method. For example, “1” indicates that the partition is picture-divided by a tile method, and “0” indicates that the partition is not picture-divided by a tile method. A 1-bit field of “tile_base_flag” indicates that, in the case of a tile method, whether or not the partition descriptor is a base container. For example, “1” indicates that the partition descriptor is a base container, and “0” indicates that the partition descriptor is a container other than the base container.

[0106] An 8-bit field of “partition_ID” indicates an ID of the partition. A 16-bit field of “whole_picture_size_horizontal” indicates the number of horizontal pixels of the entire picture. A 16-bit field of “whole_picture_size_vertical” indicates the number of vertical pixels of the entire picture.

[0107] A 16-bit field of “partition_horizontal_start_position” indicates a horizontal start pixel position of the partition. A 16-bit field of “partition_horizontal_end_position” represents a horizontal end pixel position of the partition. A 16-bit field of “partition_vertical_start_position” indicates a vertical start pixel position of the partition. A 16-bit field of “partition_vertical_end_position” represents a vertical end pixel position of the partition. The fields configure position information of the partition with respect to the entire picture and configure information of the number of pixels of the partition.

[0108] An 8-bit field of “number_of_sublayers” indicates the number of sublayers in hierarchical encoding of the partition. An 8-bit field of “sublayer_id” and an 8-bit field of “sublayer_frame_rate” are repeated in a for loop by a number of times equal to the number of sublayers. The field of “sublayer_id” indicates a sublayer ID of the partition, and the field of “sublayer_frame_rate” indicates the frame rate of the sublayer of the partition.

[0109] Referring back to FIG. 4, the storage 106 temporarily accumulates MP4 streams of partitions generated by the container encoder 105. It is to be noted that, in the case where the MP4 streams are divided by the tile method, the storage 106 accumulates also the tile-based MP4 streams. Of the MP4 streams accumulated in this manner, the MP4 stream of a partition whose transmission request is received is transmitted to the service receiver 200. It is to be noted that, in the case where the MP4 streams are in a form divided by the tile method, also the base MP4 stream is transmitted at the same time.

[0110] FIG. 11 depicts an example of a description of an MPD file compatible with a tile-based MP4 stream (tile-based container). In this MPD file, an adaptation set (AdaptationSet) corresponding to one MP4 stream (track) as a tile-based container exists.

[0111] In the adaptation set, by the description of <AdaptationSet mimeType="video/mp4" codecs="hev1.xx.xx.Lxxx,xx,hev1.yy.yy.Lxxx,yy">", an adaptation set (AdaptationSet) with respect to the video stream exists, the video stream is supplied with an MP4 file structure, and presence of an HEVC-encoded video stream (encoded image data) is indicated.

[0112] By the description of <SupplementaryDescriptor schemeIdUri="urn:brdcst:video:format_type" value/>, a format type of the projection picture is indicated. By the description of <SupplementaryDescriptor schemeIdUri="urn:brdcst:video:formatrate" value/>, a frame rate of pictures is indicated.

[0113] By the description of <SupplementaryDescriptor schemeIdUri="urn:brdcst:video:tilepartitionflag" value="1"/>, it is indicated that the partition is picture-divided by the tile method. By <SupplementaryDescriptor schemeIdUri="urn:brdcst:video:tilebaseflag" value/>, it is indicated that the partition is a tile-based container.

[0114] Further, in the adaptation set, a representation (Representation) corresponding to the video stream exists. In this representation, by the descriptions of width=" " height=" " frameRate=" ",codecs="hev1.xx.xx.Lxxx,xx" and level="0", a resolution, a frame rate, and a codec type are indicated, and further, it is indicated that, as tag information, the level “0” is applied. Further, by the description of <BaseURL>videostreamVR.mp4</BaseURL>, it is indicated that the location destination of the MP4 stream is indicated as videostreamVR.mp4.

[0115] FIG. 12 depicts an example of description of an MPD file corresponding to the MP4 stream of each partition. In this MPD file, adaptation sets (AdaptationSet) individually corresponding to a plurality of MP4 streams (tracks) exist. It is to be noted that, in the example depicted, for simplification of the drawing, only two adaptation sets (AdaptationSet) are depicted.

[0116] Description is given of the first adaptation set, and since the other adaptation sets are similar, description of them is omitted. In the adaptation set, by the description of <AdaptationSet mimeType="video/mp4" codecs="hev1.xx.xx.Lxxx,xx,hev1.yy.yy.Lxxx,yy">, an adaptation set (AdaptationSet) with respect to the video stream exists, the video stream is supplied with the MP4 file structure, and presence of the HEVC-encoded video stream (encoded image data) is indicated.

[0117] By the description of <SupplementaryDescriptor schemeIdUri="urn:brdcst:video:format_type" value/>, a format type of the projection picture is indicated. By the description of <SupplementaryDescriptor schemeIdUri="urn:brdcst:video:framerate" value/>, a frame rate of partitions (full frame rate) is indicated.

[0118] By the description of <SupplementaryDescriptor schemeIdUri="urn:brdcst:video:tilepartitionflag" value="1"/>, it is indicated whether or not the partition is picture-divided by the tile method. By the description of <SupplementaryDescriptor schemeIdUri="urn:brdcst:video:tilebaseflag" value="0"/>, it is indicated that the partition is a container other than the tile-based container. By the description of <SupplementaryDescriptor schemeIdUri="urn:brdcst:video:partitionid" value="1"/>, it is indicated that the partition ID is 1.

[0119] By the description of <SupplementaryDescriptor schemeIdUri="urn:brdcst:video:wholepicturesizehorizontal" value/>, the number of horizontal pixels of the whole picture is indicated. By the description of <SupplementaryDescriptor schemeIdUri="urn:brdcst:video:wholepicturesizevertical" value/>, the number of vertical pixels of the whole picture is indicated.

[0120] By the description of <SupplementaryDescriptor schemeIdUri="urn:brdcst:video:partitionstartpositionhorizontal" value/>, a horizontal start pixel position of the partition is indicated. By the description of <SupplementaryDescriptor schemeIdUri="urn:brdcst:video:partitiontartpositionvertical" value/>, a horizontal end pixel position of the partition is indicated. By the description of <SupplementaryDescriptor schemeIdUri="urn:brdcst:video:partitionendpositionhorizontal" value/>, a vertical start pixel position of the partition is indicated. By the description of <SupplementaryDescriptor schemeIdUri="urn:brdcst:video:partitionendpositionvertical" value/>, a vertical end pixel position of the partition is indicated.

[0121] By the description of <SupplementaryDescriptor schemeIdUri="urn:brdcst:video:partitionsublayerid" value/>, a sublayer ID of the partition is indicated. By the description of <SupplementaryDescriptor schemeIdUri="urn:brdcst:video:partitionsublayerframerate" value/>, a frame rate of the sublayer of the partition is indicated. The two descriptions are repeated by a number of times equal to the number of sublayers.

[0122] Further, in the adaptation set, a representation (Representation) corresponding to the video stream exists. In this representation, by the descriptions of width=" " height=" " frameRate=" ",codecs="hev1.xx.xx.Lxxx,xx", and level="0", a resolution, a frame rate, and a codec type are indicated, and further, it is indicated that, as tag information, the level “0” is provided. Further, by the description of <BaseURL>videostreamVR0.mp4</BaseURL>, it is indicated that the location destination of the MP4 stream is indicated as videostreamVR0.mp4.

[0123] FIG. 13 schematically depicts an MP4 stream (track) in the case where encoding using a tile function for converting each partition into a tile is performed. In this case, one tile-based MP4 stream (tile-based container) and MP4 streams (tile containers) of four partitions exist. Each of the MP4 streams is configured such that each random access period begins with an initialization segment (IS: initialization segment), which is followed by boxes of “styp,” “sidx (Segment index box),” “ssix (Sub-segment index box),” “moof” (Movie fragment box),” and “mdat (Media data box).”

[0124] The initialization segment (IS) has a box (Box) structure based on ISOBMFF (ISO Base Media File Format). The partition descriptor (refer to FIG. 9) is inserted in the initialization segment (IS). In the tile-based MP4 stream (tile-based container), the partition descriptor is “tile base flag=1.” Meanwhile, in the MP4 streams (tile containers) of the first to fourth partitions, “partition ID” is 1 to 4.

[0125] In the “styp” box, segment type information is placed. In the “sidx” box, range information of each track (track) is placed, and a position of “moof”/”mdat” is indicated while also a position of each sample (picture) in “mdat” is indicated. In the “ssix” box, classification information of the track (track) is placed, and classification into I/P/B types is made.

[0126] In the “moof” box, control information is placed. In the mdat” box of the tile-based MP4 stream (tile-based container), NAL units of “VPS,” “SPS,” “PPS,” “PSEI,” and “SSEI” are placed. Meanwhile, in the mdat” box of the MP4 stream (tile container) of each partition, a NAL unit of “SLICE” having encoded image data of the individual partition is placed.

[0127] FIG. 14 schematically depicts an MP4 stream (track) in the case where each partition is encoded individually. In this case, MP4 streams of four partitions exist. Each of the MP4 streams is configured such that each random access period begins with an initialization segment (IS: initialization segment), which is followed by boxes of “styp,” “sidx (Segment index box),” “ssix (Sub-segment index box),” “moof” (Movie fragment box),” and “mdat (Media data box),” similarly.

[0128] The initialization segment (IS) has a box (Box) structure based on ISOBMFF (ISO Base Media File Format). The partition descriptor (refer to FIG. 9) is inserted in the initialization segment (IS). In the MP4 streams of the first to fourth partitions, “partition ID” is 1 to 4.

[0129] In the “styp” box, segment type information is placed. In the “sidx” box, range information of each track (track) is placed, and a position of “moof”/”mdat” is indicated while also a position of each sample (picture) in “mdat” is indicated. In the “ssix” box, classification information of the track (track) is placed, and classification into I/P/B types is made.

[0130] In the “moof” box, control information is placed. In the mdat” box of the MP4 stream of each partition, NAL units of “VPS,” “SPS,” “PPS,” “PSEI,” “SLICE,” and “SSEI” are placed.

[0131] Referring back to FIG. 4, the service receiver 200 includes a container decoder 203, a video decoder 204, a renderer 205, and a transmission request section 206. The transmission request section 206 requests the service transmission system 100 for transmission of MP4 streams of a predetermined number of partitions corresponding to a display region from among partitions of a projection picture.

[0132] In this case, the transmission request section 206 determines the predetermined number of values as a maximum decodable value or a value close to the maximum decodable value on the basis of a decoding capacity and information of the number of pixels and a frame rate of an encoded stream of each partition of a projection picture. Here, the information of the number of pixels and a frame rate of an encoded stream of each partition can be acquired from an MPD file (refer to FIG. 12) received from the service transmission system 100 in advance.

[Example of Calculation of Maximum Value]

[0133] FIG. 15 depicts an example in which a projection picture of the 8 K/60 Hz class is divided by a partition size of 1920.times.1080 (Full HD). In this case, the number of in-plane pixels of the partition is 1920*1080=2073600, and the pixel rate is 1920*1080*60=124416000. In this case, the level value of the complexity required for decoding of the partition is “Level 4.1.”

[0134] For example, in the case where the service receiver 200 includes a decoder of “Level 5.1” for decoding of 4 K/60 Hz, the maximum number of in-plane Luma pixels is 8912896, and the pixel rate (the maximum number of pixels processable every second) is 534773760. Therefore, in this case, 534773760/124416000=4.29 … , and the maximum value is calculated as 4. In this case, the service receiver 200 can decode four partitions in the maximum. Four partitions indicated by an arrow mark P depict an example of the partitions corresponding to the display region selected in this case.

[0135] On the other hand, in the case where the service receiver 200 includes a decoder of “Level 5.2” for decoding of 4 K/120 Hz, the maximum number of in-plane Luma pixels is 8912896, and the pixel rate (the maximum number of pixels processable every second) is 1069547520. Therefore, in this case, 1069547520/124416000=8.59 … , and the maximum value is calculated as 8. In this case, the service receiver 200 can decode eight partitions in the maximum. Eight partitions indicated by an arrow mark Q depict an example of the partitions corresponding to the display region selected in this case.

[0136] FIG. 16 depicts an example in which a projection picture of the 8 K/60 Hz class is divided by a partition size of 1280.times.960 (4VGA). In this case, the number of in-plane pixels of the partition is 1280*960=1228800, and the pixel rate is 1280*960*60=73728000. In this case, the level value of the complexity required for decoding of the partition is “Level 4.1.”

[0137] For example, in the case where the service receiver 200 includes a decoder of “Level 5.1” for decoding of 4 K/60 Hz, the maximum number of in-plane Luma pixels is 8912896, and the pixel rate (the maximum number of pixels processable every second) is 534773760. Therefore, in this case, 534773760/73728000=7.25 … , and the maximum value is calculated as 7. In this case, the service receiver 200 can decode 7 partitions in the maximum. Six partitions indicated by an arrow mark P depict an example of the partitions corresponding to the display region selected in this case.

[0138] On the other hand, in the case where the service receiver 200 includes a decoder of “Level 5.2” for decoding of 4 K/120 Hz, the maximum number of in-plane Luma pixels is 8912896, and the pixel rate (the maximum number of pixels processable every second) is 1069547520. Therefore, in this case, 1069547520/73728000=14.5 … , and the maximum value is calculated as 14. In this case, the service receiver 200 can decode 14 partitions in the maximum. Twelve partitions indicated by an arrow mark Q depict an example of the partitions corresponding to the display region selected in this case.

[0139] FIG. 17 depicts an example in which a projection picture exceeding the 8 K/60 Hz class is divided by a partition size of 1280.times.960 (4VGA). In this case, the number of in-plane pixels of the partition is 1280*960=1228800, and the pixel rate is 1280*960*60=73728000. In this case, the level value of the complexity required for decoding of the partition is “Level 4.1.”

[0140] For example, in the case where the service receiver 200 includes a decoder of “Level 5.1” for decoding of 4 K/60 Hz, the maximum number of in-plane Luma pixels is 8912896, and the pixel rate (the maximum number of pixels processable every second) is 534773760. Therefore, in this case, 534773760/73728000=7.25 … , and the maximum value is calculated as 7. In this case, the service receiver 200 can decode 7 partitions in the maximum. Seven partitions indicated by an arrow mark P depict an example of the partitions corresponding to the display region selected in this case.

[0141] On the other hand, in the case where the service receiver 200 includes a decoder of “Level 5.2” for decoding of 4 K/120 Hz, the maximum number of in-plane Luma pixels is 8912896, and the pixel rate (the maximum number of pixels processable every second) is 1069547520. Therefore, in this case, 1069547520/73728000=14.5 … , and the maximum value is calculated as 14. In this case, the service receiver 200 can decode 14 partitions in the maximum. Fourteen partitions indicated by an arrow mark Q depict an example of the partitions corresponding to the display region selected in this case.

[0142] FIG. 18 depicts an example in which a projection picture exceeding the 8 K/60 Hz class is divided by a partition size of 1280.times.720 (720p HD). In this case, the number of in-plane pixels of the partition is 1280*720=921600, and the pixel rate is 1280*720*60=55296000. In this case, the level value of the complexity required for decoding of the partition is “Level 4.”

[0143] For example, in the case where the service receiver 200 includes a decoder of “Level 5.1” for decoding of 4 K/60 Hz, the maximum number of in-plane Luma pixels is 8912896, and the pixel rate (the maximum number of pixels processable every second) is 534773760. Therefore, in this case, 534773760/55296000=9.67 … , and the maximum value is calculated as 9. In this case, the service receiver 200 can decode 9 partitions in the maximum. Eight partitions indicated by an arrow mark P depict an example of the partitions corresponding to the display region selected in this case.

[0144] On the other hand, in the case where the service receiver 200 includes a decoder of “Level 5.2” for decoding of 4 K/120 Hz, the maximum number of in-plane Luma pixels is 8912896, and the pixel rate (the maximum number of pixels processable every second) is 1069547520. Therefore, in this case, 1069547520/55296000=19.34 … , and the maximum value is calculated as 19. In this case, the service receiver 200 can decode 19 partitions in the maximum. Eighteen partitions indicated by an arrow mark Q depict an example of the partitions corresponding to the display region selected in this case.

[0145] FIG. 19 collectively depicts the maximum number of decodable partitions according to partition sizes in a decoder of “Level 5.1.” In the case where the partition size is 1920.times.1080 (Full HD), while the maximum number of pixels processable every second by the decoder is 534773760, the pixel rate of the partition is 124416000 (equivalent to Level 4.1), and the maximum number of decodable partitions is 4. On the other hand, in the case where the partition size is 1280.times.960 (4VGA), while the maximum number of pixels processable every second by the decoder is 534773760, the pixel rate of the partition is 73728000 (equivalent to Level 4.1), and the maximum number of decodable partitions is 7.

[0146] Further, in the case where the partition size is 1280.times.720 (720p HD), while the maximum number of pixels processable every second by the decoder is 534773760, the pixel rate of the partition is 55296000 (equivalent to Level 4.1), and the maximum number of decodable partitions is 9. Further, in the case where the partition size is 960.times.540 (Q HD), while the maximum number of pixels processable every second by the decoder is 534773760, the pixel rate of the partition is 33177600 (equivalent to Level 3.1), and the maximum number of decodable partitions is 16.

[0147] FIG. 20 collectively depicts the maximum number of decodable partitions according to partition sizes in a decoder of “Level 5.2.” In the case where the partition size is 1920.times.1080 (Full HD), while the maximum number of pixels processable every second by the decoder is 1069547520, the pixel rate of the partition is 124416000 (equivalent to Level 4.1), and the maximum number of decodable partitions is 8. On the other hand, in the case where the partition size is 1280.times.960 (4VGA), while the maximum number of pixels processable every second by the decoder is 1069547520, the pixel rate of the partition is 73728000 (equivalent to Level 4.1), and the maximum number of decodable partitions is 14.

……
……
……

您可能还喜欢...