Sony Patent | Reception device, reception method, transmission device, and transmission method

Patent: Reception device, reception method, transmission device, and transmission method

Drawings: Click to check drawins

Publication Number: 20210006769

Publication Date: 20210107

Applicant: Sony

Assignee: Sony Corporation

Abstract

Depth control when superimposing and displaying superimposition information by using depth information transmitted efficiently is easily implemented. A video stream obtained by encoding image data of a wide viewing angle image for each of left-eye and right-eye pictures, and depth meta information including position information and a representative depth value of the predetermined number of angle areas in the wide viewing angle image for each picture are received. Left-eye and right-eye display area image data is extracted from the image data of the wide viewing angle image for each of the left-eye and right-eye pictures obtained by decoding the video stream. Superimposition information data is superimposed on the left-eye and right-eye display area image data for output. When superimposing the superimposition information data on the left-eye and right-eye display area image data, parallax is given on the basis of the depth meta information.

Claims

  1. A reception device comprising: a reception unit configured to receive a video stream obtained by encoding image data of a wide viewing angle image for each of left-eye and right-eye pictures, and depth meta information including position information and a representative depth value of a predetermined number of angle areas in the wide viewing angle image for each of the pictures; and a processing unit configured to extract left-eye and right-eye display area image data from the image data of the wide viewing angle image for each of the left-eye and right-eye pictures obtained by decoding the video stream and to superimpose superimposition information data on the left-eye and right-eye display area image data for output, wherein when superimposing the superimposition information data on the left-eye and right-eye display area image data, the processing unit gives parallax to the superimposition information data to be superimposed on each of the left-eye and right-eye display area image data on a basis of the depth meta information.

  2. The reception device according to claim 1, wherein the reception unit receives the depth meta information for each of the pictures by using a timed metadata stream associated with the video stream.

  3. The reception device according to claim 1, wherein the reception unit receives the depth meta information for each of the pictures in a state of being inserted into the video stream.

  4. The reception device according to claim 1, wherein when superimposing the superimposition information data on the left-eye and right-eye display area image data, the processing unit gives the parallax on a basis of a minimum value of the representative depth value of the predetermined number of angle areas corresponding to a superimposition range, the representative depth value being included in the depth meta information.

  5. The reception device according to claim 1, wherein the depth meta information further includes position information indicating which position in the areas the representative depth value of the predetermined number of angle areas relates to, and when superimposing the superimposition information data on the left-eye and right-eye display area image data, the processing unit gives the parallax on a basis of the representative depth value of the predetermined number of areas corresponding to a superimposition range and the position information included in the depth meta information.

  6. The reception device according to claim 1, wherein the position information on the angle areas is given as offset information based on a position of a predetermined viewpoint.

  7. The reception device according to claim 1, wherein the depth meta information further includes a depth value corresponding to depth of a screen as a reference for the depth value.

  8. The reception device according to claim 1, wherein the superimposition information includes subtitles and/or graphics.

  9. The reception device according to claim 1, further comprising a display unit configured to display a three-dimensional image on a basis of the left-eye and right-eye display area image data on which the superimposition information data is superimposed.

  10. The reception device according to claim 9, wherein the display unit includes a head mounted display.

  11. A reception method comprising: receiving a video stream obtained by encoding image data of a wide viewing angle image for each of left-eye and right-eye pictures, and depth meta information including position information and a representative depth value of a predetermined number of angle areas in the wide viewing angle image for each of the pictures; and extracting left-eye and right-eye display area image data from the image data of the wide viewing angle image for each of the left-eye and right-eye pictures obtained by decoding the video stream and superimposing superimposition information data on the left-eye and right-eye display area image data for output, wherein when superimposing the superimposition information data on the left-eye and right-eye display area image data, parallax is given to the superimposition information data to be superimposed on each of the left-eye and right-eye display area image data on a basis of the depth meta information.

  12. The reception method according to claim 11, wherein the depth meta information for each of the pictures is received by using a timed metadata stream associated with the video stream.

  13. The reception method according to claim 11, wherein the depth meta information for each of the pictures is received in a state of being inserted into the video stream.

  14. The reception method according to claim 11, wherein when superimposing the superimposition information data on the left-eye and right-eye display area image data, the parallax is given on a basis of a minimum value of the representative depth value of the predetermined number of angle areas corresponding to a superimposition range, the representative depth value being included in the depth meta information.

  15. The reception method according to claim 11, wherein the depth meta information further includes position information indicating which position in the areas the representative depth value of the predetermined number of angle areas relates to, and when superimposing the superimposition information data on the left-eye and right-eye display area image data, the parallax is given on a basis of the representative depth value of the predetermined number of areas corresponding to a superimposition range and the position information included in the depth meta information.

  16. The reception method according to claim 11, wherein the position information on the angle areas is given as offset information based on a position of a predetermined viewpoint.

  17. The reception method according to claim 11, wherein the depth meta information further includes a depth value corresponding to depth of a screen as a reference for the depth value.

  18. The reception method according to claim 11, wherein the superimposition information includes subtitles and/or graphics.

  19. A transmission device comprising: a transmission unit configured to transmit a video stream obtained by encoding image data of a wide viewing angle image for each of left-eye and right-eye pictures and depth meta information for each of the pictures, wherein the depth meta information includes position information and a representative depth value of a predetermined number of angle areas in the wide viewing angle image.

  20. A transmission method comprising: transmitting a video stream obtained by encoding image data of a wide viewing angle image for each of left-eye and right-eye pictures and depth meta information for each of the pictures, wherein the depth meta information includes position information and a representative depth value of a predetermined number of angle areas in the wide viewing angle image.

Description

TECHNICAL FIELD

[0001] The present technology relates to a reception device, a reception method, a transmission device, and a transmission method, and more particularly, the present technology relates to a reception device and the like that VR-displays a stereoscopic image.

BACKGROUND ART

[0002] In a case where a stereoscopic image is virtual reality (VR)-displayed, it is important for stereoscopic vision to superimpose subtitles and graphics at a position closer to an object displayed interactively. For example, Patent Document 1 shows a technology to transmit depth information for each pixel or evenly divided block of an image together with image data of left and right eye images, and to use the depth information for depth control when superimposing and displaying subtitles and graphics on the receiving side. However, for a wide viewing angle image, it is necessary to secure a large transmission band for transmitting depth information.

CITATION LIST

Patent Document

[0003] Patent Document 1: WO 2013/105401

SUMMARY OF THE INVENTION

Problems to be Solved by the Invention

[0004] An object of the present technology is to easily implement depth control when superimposing and displaying superimposition information by using depth information that is efficiently transmitted.

Solutions to Problems

[0005] A concept of the present technology is a reception device including:

[0006] a reception unit configured to receive a video stream obtained by encoding image data of a wide viewing angle image for each of left-eye and right-eye pictures, and depth meta information including position information and a representative depth value of a predetermined number of angle areas in the wide viewing angle image for each of the pictures; and

[0007] a processing unit configured to extract left-eye and right-eye display area image data from the image data of a wide viewing angle image for each of the left-eye and right-eye pictures obtained by decoding the video stream and to superimpose superimposition information data on the left-eye and right-eye display area image data for output,

[0008] in which when superimposing the superimposition information data on the left-eye and right-eye display area image data, the processing unit gives parallax to the superimposition information data to be superimposed on each of the left-eye and right-eye display area image data on the basis of the depth meta information.

[0009] In the present technology, the reception unit receives a video stream obtained by encoding image data of a wide viewing angle image for each of left-eye and right-eye pictures, and depth meta information including position information and a representative depth value of a predetermined number of angle areas in the wide viewing angle image for each of the pictures. For example, the reception unit may receive the depth meta information for each of the pictures by using a timed metadata stream associated with the video stream. Furthermore, for example, the reception unit may receive the depth meta information for each of the pictures, the depth meta information being inserted into the video stream. Furthermore, for example, the position information on the angle areas may be given as offset information based on a position of a predetermined viewpoint.

[0010] The left-eye and right-eye display area image data is extracted by the processing unit from the image data of a wide viewing angle image for each of the left-eye and right-eye pictures obtained by decoding the video stream. The superimposition information data is superimposed on the left-eye and right-eye display area image data for output. Here, when superimposing the superimposition information data on the left-eye and right-eye display area image data, on the basis of the depth meta information, parallax is added to the superimposition information display data that is superimposed on each of the left-eye and right-eye display area image data. For example, the superimposition information may include subtitles and/or graphics.

[0011] For example, when superimposing the superimposition information data on the left-eye and right-eye display area image data, the processing unit may give the parallax on the basis of a minimum value of the representative depth value of the predetermined number of areas corresponding to a superimposition range, the representative depth value being included in the depth meta information. Furthermore, for example, the depth meta information may further include position information indicating which position in the areas the representative depth value of the predetermined number of angle areas relate to. When superimposing the superimposition information data on the left-eye and right-eye display area image data, the processing unit may give the parallax on the basis of the representative depth value of the predetermined number of areas corresponding to the superimposition range and the position information, the representative depth value being included in the depth meta information. Furthermore, the depth meta information may further include a depth value corresponding to depth of a screen as a reference for the depth value.

[0012] Furthermore, for example, a display unit may be included that displays a three-dimensional image on the basis of the left-eye and right-eye display area image data on which the superimposition information data is superimposed. In this case, for example, the display unit may include a head mounted display.

[0013] In this way, in the present technology, when superimposing the superimposition information data on the left-eye and right-eye display area image data, parallax is given to the superimposition information data superimposed on each of the left-eye and right-eye display area image data on the basis of the depth meta information including position information and a representative depth value of the predetermined number of angle areas in the wide viewing angle image. Therefore, depth control when superimposing and displaying subtitles and graphics by using depth information that is efficiently transmitted can be easily implemented.

[0014] Furthermore, another concept of the present technology is

[0015] a transmission device including:

[0016] a transmission unit configured to transmit a video stream obtained by encoding image data of a wide viewing angle image for each of left-eye and right-eye pictures and depth meta information for each of the pictures,

[0017] in which the depth meta information includes position information and a representative depth value of a predetermined number of angle areas in the wide viewing angle image.

[0018] In the present technology, the transmission unit transmits the video stream obtained by encoding image data of a wide viewing angle image for each of the left-eye and right-eye pictures, and the depth meta information for each of the pictures. Here, the depth meta information includes position information and a representative depth value of the predetermined number of angle areas in the wide viewing angle image.

[0019] In this way, in the present technology, the video stream obtained by encoding image data of a wide viewing angle image for each of left-eye and right-eye pictures, and the depth meta information including position information and a representative depth value of the predetermined number of angle areas in the wide viewing angle image for each picture are transmitted. Therefore, depth information in the wide viewing angle image can be efficiently transmitted.

Effects of the Invention

[0020] According to the present technology, depth control when superimposing and displaying the superimposition information by using depth information that is efficiently transmitted can be easily implemented. Note that advantageous effects described here are not necessarily restrictive, and any of the effects described in the present disclosure may be applied.

BRIEF DESCRIPTION OF DRAWINGS

[0021] FIG. 1 is a block diagram showing a configuration example of a transmission-reception system as an embodiment.

[0022] FIG. 2 is a block diagram showing a configuration example of a service transmission system.

[0023] FIG. 3 is a diagram for describing planar packing for obtaining a projection image from a spherical capture image.

[0024] FIG. 4 is a diagram showing a structure example of an SPS NAL unit in HEVC encoding.

[0025] FIG. 5 is a diagram for describing causing a center O(p,q) of a cutout position to agree with a reference point RP (x,y) of the projection image.

[0026] FIG. 6 is a diagram showing a structure example of rendering metadata.

[0027] FIG. 7 is a diagram for describing each piece of information in the structure example of FIG. 6.

[0028] FIG. 8 is a diagram for describing each piece of information in the structure example of FIG. 6.

[0029] FIG. 9 is a diagram showing a concept of depth control of graphics by a parallax value.

[0030] FIG. 10 is a diagram schematically showing an example of setting an angle area under an influence of one viewpoint.

[0031] FIG. 11 is a diagram for describing a representative depth value of the angle area.

[0032] FIG. 12 is diagrams each showing part of a spherical image corresponding to each of left-eye and right-eye projection images.

[0033] FIG. 13 is a diagram showing definition of the angle area.

[0034] FIG. 14 is a diagram showing a structure example of a component descriptor and details of main information in the structure example.

[0035] FIG. 15 is a diagram schematically showing an MP4 stream as a distribution stream.

[0036] FIG. 16 is a diagram showing a structure example of timed meta data for one picture including depth meta information.

[0037] FIG. 17 is a diagram showing details of main information in the configuration example of FIG. 16.

[0038] FIG. 18 is a diagram showing a description example of an MPD file.

[0039] FIG. 19 is a diagram showing a structure example of a PSVP/SEI message.

[0040] FIG. 20 is a diagram schematically showing the MP4 stream in a case where the depth meta information is inserted into a video stream and transmitted.

[0041] FIG. 21 is a block diagram showing a configuration example of a service receiver.

[0042] FIG. 22 is a block diagram showing a configuration example of a renderer.

[0043] FIG. 23 is a view showing one example of a display area for the projection image.

[0044] FIG. 24 is a diagram for describing that a depth value for giving parallax to subtitle display data differs depending on a size of the display area.

[0045] FIG. 25 is a diagram showing one example of a method of setting the depth value for giving parallax to the subtitle display data at each movement position in the display area.

[0046] FIG. 26 is a diagram showing one example of the method of setting the depth value for giving parallax to the subtitle display data at each movement position in a case where the display area transitions between a plurality of angle areas set in the projection image.

[0047] FIG. 27 is a diagram showing one example of setting the depth value in a case where an HMD is used as a display unit.

[0048] FIG. 28 is a flowchart showing one example of a procedure for obtaining a subtitle depth value in a depth processing unit.

[0049] FIG. 29 is a diagram showing an example of depth control in a case where superimposition positions of subtitles and graphics partially overlap each other.

MODE FOR CARRYING OUT THE INVENTION

[0050] A mode for carrying out the invention (hereinafter referred to as an embodiment) will be described below.

[0051] Note that the description will be made in the following order.

[0052] 1. Embodiment

[0053] 2. Modification

  1. Embodiment

[0054] [Configuration Example of Transmission-Reception System]

[0055] FIG. 1 shows a configuration example of a transmission-reception system 10 as the embodiment. The transmission-reception system 10 includes a service transmission system 100 and a service receiver 200.

[0056] The service transmission system 100 transmits DASH/MP4, that is, an MPD file as a metafile and MP4 (ISOBMFF) including media streams such as video and audio through a communication network transmission path or an RF transmission path. In this embodiment, a video stream obtained by encoding image data of a wide viewing angle image for each of left-eye and right-eye pictures is included as the media stream.

[0057] Furthermore, the service transmission system 100 transmits depth meta information for each picture together with the video stream. The depth meta information includes position information and a representative depth value of the predetermined number of angle areas in the wide viewing angle image. In this embodiment, the depth meta information further includes position information indicating which position in the areas the representative depth value relates to. For example, the depth meta information for each picture is transmitted by using a timed metadata stream associated with the video stream, or inserted into the video stream and transmitted.

[0058] The service receiver 200 receives the above-described MP4 (ISOBMFF) transmitted from the service transmission system 100 through the communication network transmission path or the RF transmission path. The service receiver 200 acquires, from the MPD file, meta information regarding the video stream, and furthermore, meta information regarding the timed metadata stream in a case where the timed metadata stream exists.

[0059] Furthermore, the service receiver 200 extracts left-eye and right-eye display area image data from the image data of the wide viewing angle image for each of the left-eye and right-eye pictures obtained by decoding the video stream. The service receiver 200 superimposes superimposition information data such as subtitles and graphics on the left-eye and right-eye display area image data for output. In this case, the display area changes interactively on the basis of a user’s action or operation. When superimposing the superimposition information data on the left-eye and right-eye display area image data, on the basis of the depth meta information, parallax is given to the superimposition information data superimposed on each of the left-eye and right-eye display area image data.

[0060] For example, parallax is given on the basis of the minimum value of the representative depth value of the predetermined number of areas corresponding to a superimposition range included in the depth meta information. Furthermore, for example, in a case where the depth meta information further includes position information indicating which position in the areas the representative depth value relates to, parallax is added on the basis of the representative depth value of the predetermined number of areas corresponding to the superimposition range and the position information included in the depth meta information.

[0061] “Configuration Example of Service Transmission System”

[0062] FIG. 2 shows a configuration example of the service transmission system 100. The service transmission system 100 includes a control unit 101, a user operation unit 101a, a left camera 102L, a right camera 102R, planar packing units 103L and 103R, a video encoder 104, a depth generation unit 105, a depth meta information generation unit 106, a subtitle generation unit 107, a subtitle encoder 108, a container encoder 109, and a transmission unit 110.

[0063] The control unit 101 includes a central processing unit (CPU), and controls an operation of each unit of the service transmission system 100 on the basis of a control program. The user operation unit 101a constitutes a user interface for the user to perform various operations, and includes, for example, a keyboard, a mouse, a touch panel, a remote controller, and the like.

[0064] The left camera 102L and the right camera 102R constitute a stereo camera. The left camera 102L captures a subject to obtain a spherical capture image (360.degree. VR image). Similarly, the right camera 102R captures the subject to obtain a spherical capture image (360.degree. VR image). For example, the cameras 102L and 102R perform image capturing by a back-to-back method and obtains super wide viewing angle front and rear images each having a viewing angle of 180.degree. or more and captured using a fisheye lens as spherical capture images (see FIG. 3(a)).

[0065] The planar packing units 103L and 103R cut out a part or all of the spherical capture images obtained with the cameras 102L and 102R respectively, and perform planar packing to obtain a rectangular projection image (projection picture) (see FIG. 3(b)). In this case, as a format type of the projection image, for example, equirectangular, cross-cubic, and the like is selected. Note that the planar packing units 103L and 103R cut out the projection image as necessary and perform scaling to obtain the projection image with a predetermined resolution (see FIG. 3(c)).

[0066] The video encoder 104 performs, for example, encoding such as HEVC on image data of the left-eye projection image from the planar packing unit 103L and image data of the right-eye projection image from the planar packing unit 103R to obtain encoded image data and generate a video stream including the encoded image data. For example, the image data of left-eye and right-eye projection images are combined by a side-by-side method or a top-and-bottom method, and the combined image data is encoded to generate one video stream. Furthermore, for example, the image data of each of the left-eye and right-eye projection images is encoded to generate two video streams.

[0067] Cutout position information is inserted into an SPS NAL unit of the video stream. For example, in encoding of HEVC, “default_display_window” corresponds thereto.

[0068] FIG. 4 shows a structure example (syntax) of the SPS NAL unit in HEVC encoding. The field of “pic_width_in_luma_samples” indicates the horizontal resolution (pixel size) of the projection image. The field of “pic_height_in_luma_samples” indicates the vertical resolution (pixel size) of the projection image. Then, when the “default_display_window_flag” is set, cutout position information “defaultdisplay_window” exists. The cutout position information is offset information with the upper left of the decoded image as a base point (0,0).

[0069] The field of “def_disp_win_left_offset” indicates the left end position of the cutout position. The field of “def_disp_win_right_offset” indicates the right end position of the cutout position. The field of “def_disp_win_top_offset” indicates the upper end position of the cutout position. The field of “def_disp_win_bottom_offset” indicates the lower end position of the cutout position.

[0070] In this embodiment, the center of the cutout position indicated by the cutout position information can be set to agree with the reference point of the projection image. Here, when the center of the cutout position is O(p,q), p and q are each represented by the following formula.

p=(def_disp_win_right_offset-def_disp_win_left_offset)*1/2+def_disp_win_- left_offset

q=(def_disp_win_bottom_offset-def_disp_win_top_offset)*1/2+def_disp_win_- top_offset

[0071] FIG. 5 shows that the center O(p,q) of the cutout position agrees with a reference point RP (x,y) of the projection image. In the illustrated example, “projection_pic_size_horizontal” indicates the horizontal pixel size of the projection image, and “projection_pic_size_vertical”indicates the vertical pixel size of the projection image. Note that a receiver that supports VR display can obtain a display view (display image) by rendering the projection image, but the default view is centered on the reference point RP (x, y). Note that the reference point can match the physical space by agreeing with a specified direction of actual north, south, east, and west.

[0072] Furthermore, the video encoder 104 inserts an SEI message having rendering metadata (meta information for rendering) in the “SEIs” part of the access unit (AU). FIG. 6 shows a structure example (syntax) of the rendering metadata (Rendering_metadata). Furthermore, FIG. 8 shows details of main information (Semantics) in each structure example.

[0073] The 16-bit field of “rendering_metadata_id” is an ID that identifies the rendering metadata structure. The 16-bit field of “rendering_metadata_length” indicates the rendering metadata structure byte size.

[0074] The 16-bit field of each of “start_offset_sphere_latitude”, “start_offset_sphere_longitude”, “end_offset_sphere_latitude”, and “end_offset_sphere_longitude” indicates the cutout range information in a case where the spherical capture image undergoes planar packing (see FIG. 7(a)). The field of “start_offset_sphere_latitude” indicates the latitude (vertical direction) of the cutout start offset from the sphere. The field of “start_offset_sphere_longitude” indicates the longitude (horizontal direction) of the cutout start offset from the sphere. The field of “end_offset_sphere_latitude” indicates the latitude (vertical direction) of the cutout end offset from the sphere. The field of “end_offset_sphere_longitude” indicates the longitude (horizontal direction) of the cutout end offset from the sphere.

[0075] The 16-bit field of each of “projection_pic_size_horizontal” and “projection_pic_size_vertical” indicates size information on the projection image (projection picture) (see FIG. 7(b)). The field of “projection_pic_size_horizontal” indicates the horizontal pixel count from the top-left with the size of the projection image. The field of “projection_pic_size_vertical” indicates the vertical pixel count from the top-left with the size of the projection image.

[0076] The 16-bit field of each of “scaling_ratio_horizontal” and “scaling_ratio_vertical” indicates the scaling ratio from the original size of the projection image (see FIGS. 3(b), (c)). The field of “scaling_ratio_horizontal” indicates the horizontal scaling ratio from the original size of the projection image. The field of “scaling_ratio_vertical” indicates the vertical scaling ratio from the original size of the projection image.

[0077] The 16-bit field of each of “reference_point_horizontal”and “reference_point_vertical” indicates position information of the reference point RP (x,y) of the projection image (see FIG. 7(b)). The field of “reference_point_horizontal” indicates the horizontal pixel position “x” of the reference point RP (x,y). The field of “reference_point_vertical” indicates the vertical pixel position “y” of the reference point RP (x,y).

[0078] The 5-bit field of “format type” indicates the format type of the projection image. For example, “0” indicates equirectangular, “1” indicates cross-cubic, and “2” indicates partitioned cross cubic.

[0079] The 1-bit field of “backwardcompatible” indicates whether or not backward compatibility has been set, that is, whether or not the center O(p,q) of the cutout position indicated by the cutout position information inserted in the video stream layer has been set to match the reference point RP (x,y) of the projection image. For example, “0” indicates that backward compatibility has not been set, and “1” indicates that backward compatibility has been set.

[0080] The depth generation unit 105 determines a depth value that is depth information for each block by using the left-eye and right-eye projection images from the planar packing units 103L and 103R. In this case, the depth generation unit 105 obtains a parallax (disparity) value by determining sum of absolute difference (SAD) for each pixel block of 4.times.4, 8.times.8, and the like, and further converts the parallax (disparity) value into the depth value.

[0081] Here, the conversion from the parallax value to the depth value will be described. FIG. 9 shows, for example, a concept of depth control of graphics by using the parallax value. In a case where the parallax value is a negative value, the parallax is given such that the graphics for the left-eye display shifts to the right and the graphics for the right-eye display shifts to the left on the screen. In this case, the display position of graphics is forward of the screen. Furthermore, in a case where the parallax value is a positive value, the parallax is given such that the graphics for the left-eye display shifts to the left and the graphics for the right-eye display shifts to the right on the screen. In this case, the display position of graphics is behind the screen.

[0082] In FIG. 9, (.theta.0-.theta.2) shows the parallax angle in the same side direction, and (.theta.0-.theta.1) shows the parallax angle in the crossing direction. Furthermore, D indicates a distance between a screen and an installation surface of a camera (human eyes) (viewing distance), E indicates an installation interval (eye_baseline) of the camera (human eyes), K indicates the depth value, which is a distance to an object, and S indicates the parallax value.

[0083] At this time, K is calculated by the following formula (1) from a ratio of S and E and a ratio of D and K. By transforming this formula, formula (2) is obtained. Formula (1) constitutes a conversion formula for converting the parallax value S into the depth value K. Conversely, formula (2) constitutes a conversion formula for converting the depth value K into the parallax value S.

K=D/(1+S/E) (1)

S=(D-K)E/K (2)

[0084] Returning to FIG. 2, the depth meta information generation unit 106 generates the depth meta information. The depth meta information includes the position information and the representative depth value of the predetermined number of angle areas set on the projection image. In this embodiment, the depth meta information further includes the position information indicating which position in the areas the representative depth value relates to.

[0085] Here, the predetermined number of angle areas is set by the user operating the user operation unit 101a. In this case, the predetermined number of viewpoints is set, and the predetermined number of angle areas under an influence of each viewpoint is further set. The position information of each angle area is given as offset information based on the position of the corresponding viewpoint.

[0086] Furthermore, the representative depth value of each angle area is the minimum value of the depth value of each block within the angle area among the depth value of each block generated by the depth generation unit 105.

[0087] FIG. 10 schematically shows an example of setting the angle area under an influence of one viewpoint. FIG. 10(a) shows an example in a case where the angle area AR includes equally spaced divided areas, and nine angle areas AR1 to AR9 are set. FIG. 10(b) shows an example in a case where the angle area AR includes divided areas with flexible sizes, and six angle areas AR1 to AR6 are set. Note that the angle areas do not necessarily have to be arranged continuously in space.

[0088] FIG. 11 shows one angle area ARi set on the projection image. In the figure, an outer rectangular frame shows the entire projection image, and a depth value dv(j, k) in block units corresponding to this projection image exists, and these are combined to constitute a depth map (depthmap).

[0089] The representative depth value DPi in the angle area ARi is the minimum value among a plurality of depth values dv(j, k) included in the angle area ARi, and is represented by formula (3) below.

[ Formula 1 ] DPi = min ARi ( dv ( j , k ) ) ( 3 ) ##EQU00001##

[0090] FIGS. 12(a) and 12(b) show part of spherical images corresponding to the left-eye and right-eye projection images obtained by the planar packing units 103L and 103R, respectively. “C” indicates the center position corresponding to the viewing position. In the illustrated example, in addition to the reference point RP of the projection image, eight viewpoints from VpA to VpH that are the reference for the angle area are set.

[0091] The position of each point is indicated by an azimuth angle .phi. and an elevation angle .theta.. The position of each angle area (not shown in FIG. 12) is given by the offset angle from the corresponding viewpoint. Here, the azimuth angle .phi. and the elevation angle .theta. each indicate an angle in the arrow direction, and the angle at the base point position of the arrow is 0 degrees. For example, as in the illustrated example, the azimuth angle .phi. of the reference point (RP) is set at .phi.r=0.degree., and the elevation angle .theta. of the reference point (RP) is set at .omega.r=90.degree. (.pi./2).

[0092] FIG. 13 shows definition of the angle area. In the illustrated example, an outer rectangular frame shows the entire projection image. Furthermore, in the illustrated example, three angle areas under the influence of the viewpoint VP, AG_1, AG_2, and AG_3, are shown. Each angle area is represented by angle angles AG_t1 and AG_br that are position information on the upper left start point and the lower right end point of the rectangular angle area with respect to the viewpoint position. Here, AG_t1 and AG_br are horizontal and vertical two-dimensional angle angles with respect to the viewpoint VP, where D is the estimated distance between the display position and the estimated viewing position.

[0093] Note that in the above description, the depth meta information generation unit 106 determines the representative depth value of each angle area by using the depth value of each block generated by the depth generation unit 105. However, as shown as a broken line in FIG. 2, it is also possible to determine the representative depth value of each angle area by using the depth value for each pixel or each block obtained by a depth sensor 111. In that case, the depth generation unit 105 is unnecessary.

[0094] The subtitle generation unit 107 generates subtitle data to be superimposed on the image. The subtitle encoder 108 encodes the subtitle data generated by the subtitle generation unit 107 to generate a subtitle stream. Note that the subtitle encoder 108 adds, to the subtitle data, the depth value that can be used for depth control of subtitles during default view display centered on the reference point RP (x,y) of the projection image or the parallax value obtained by converting the depth value by referring to the depth value for each block generated by the depth generation unit 105. Note that it is considered to further add to the subtitle data the depth value or parallax value that can be used during view display centered on each viewpoint set in the depth meta information described above.

[0095] Returning to FIG. 2, the container encoder 109 generates, as the distribution stream STM, a container, an MP4 stream here including the video stream generated by the video encoder 104, the subtitle stream generated by the subtitle encoder 108, and the timed metadata stream having depth meta information for each picture generated by the depth meta information generation unit 106. In this case, the container encoder 109 inserts the rendering metadata (see FIG. 6) into the MP4 stream including the video stream. Note that in this embodiment, the rendering metadata is inserted into both the video stream layer and the container layer, but may be inserted into only either one.

[0096] Furthermore, the container encoder 105 inserts a descriptor having various types of information into the MP4 stream including the video stream in association with the video stream. As this descriptor, a conventionally well-known component descriptor (component_descriptor) exists.

[0097] FIG. 14(a) shows a structure example (syntax) of the component descriptor, and FIG. 14(b) shows details of main information (semantics) in the structure example. The 4-bit field of “stream_content” indicates an encoding method of the video/audio subtitle. In this embodiment, this field is set at “0x9” and indicates HEVC encoding.

[0098] The 4-bit field of “stream_content_ext” indicates details of the encoding target by being used in combination with the above-described “stream_content.” The 8-bit field of “component_type” indicates variation in each encoding method. In this embodiment, “stream_content_ext” is set at “0x2” and “component_type” is set at “0x5” to indicate “distribution of stereoscopic VR by encoding HEVC Main10 Profile UHD”.

[0099] The transmission unit 110 puts the MP4 distribution stream STM obtained by the container encoder 109 on a broadcast wave or a network packet and transmits the MP4 distribution stream STM to the service receiver 200.

[0100] FIG. 15 schematically shows an MP4 stream. FIG. 15 shows an MP4 stream including a video stream (video track) and an MP4 stream including a timed metadata track stream (timed metadata track). Although omitted here, besides, an MP4 stream including the subtitle stream (subtitle track) and the like also exist.

[0101] The MP4 stream (video track) has a configuration in which each random access period starts with an initialization segment (IS), which is followed by boxes of “styp”, “sidx (segment index box)”, “ssix (sub-segment index box)”, “moof (movie fragment box)” and “mdat (media data box).”

[0102] The initialization segment (IS) has a box structure based on an ISO base media file format (ISOBMFF). Rendering metadata and component descriptors are inserted in this initialization segment (IS).

[0103] The “styp” box contains segment type information. The “sidx” box contains range information on each track, indicates the position of “moof”/”mdat”, and also indicates the position of each sample (picture) in “mdat”. The “ssix” box contains track classification information, and is classified as I/P/B type.

……
……
……

You may also like...