Nvidia Patent | Mixed Primary Display With Spatially Modulated Backlight
Patent: Mixed Primary Display With Spatially Modulated Backlight
Publication Number: 10636336
Publication Date: 20200428
Applicants: Nvidia
Abstract
A method, computer readable medium, and system are disclosed for generating mixed-primary data for display. The method includes the steps of receiving a source image that includes a plurality of pixels, dividing the source image into a plurality of blocks, analyzing the source image based on an image decomposition algorithm, encoding chroma information and modulation information to generate a video signal, and transmitting the video signal to a mixed-primary display. The chroma information and modulation information correspond with two or more mixed-primary color components and are generated by the image decomposition algorithm to minimize error between a reproduced image and the source image. The two or more mixed-primary colors selected for each block of the source image are not limited to any particular set of colors and each mixed-primary color component may be selected from any color capable of being reproduced by the mixed-primary display.
FIELD OF THE INVENTION
The present invention relates to graphics processing, and more particularly to generating image data for a mixed primary display.
BACKGROUND
Display technology has been advancing with cathode ray tube (CRT) monitors replaced with liquid crystal display (LCD), flat-panel monitors, light emitting diode (LED) backlights, and even organic LED (OLED) monitors, as well as others. Current display technology is also quickly evolving towards higher pixel densities and higher resolutions such as 4K. While these advanced technologies are impressive, manufacturing cheap monitors that implement such technologies is still a challenge. For example, the resolution and pixel densities of common, mass produced display technology is still too low for a high quality light-field display or for virtual reality headsets. While it is technically feasible to produce displays with high enough resolutions, such displays are currently expensive and require high bandwidths for communication to receive frame buffer data at frame rates of 60 Hz or higher. Such displays also typically have increased power requirements compared to current common display technology. Thus, there is a need for addressing these issues and/or other issues associated with the prior art.
SUMMARY
A method, computer readable medium, and system are disclosed for generating mixed-primary data for display. The method includes the steps of receiving a source image that includes a plurality of pixels, dividing the source image into a plurality of blocks, analyzing the source image based on an image decomposition algorithm, encoding chroma information and modulation information to generate a video signal, and transmitting the video signal to a mixed-primary display. The chroma information and modulation information correspond with two or more mixed-primary color components and are generated by the image decomposition algorithm to minimize error between a reproduced image and the source image. The two or more mixed-primary colors selected for each block of the source image are not limited to any particular set of colors and each mixed-primary color component may be selected from any color capable of being reproduced by the mixed-primary display.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 illustrates a flowchart of a method for generating image data for a mixed primary display, in accordance with one embodiment;
FIG. 2 illustrates a parallel processing unit (PPU), in accordance with one embodiment;
FIG. 3A illustrates a general processing cluster of the PPU of FIG. 2, in accordance with one embodiment;
FIG. 3B illustrates a partition unit of the PPU of FIG. 2, in accordance with one embodiment;
FIG. 4 illustrates the streaming multi-processor of FIG. 3A, in accordance with one embodiment;
FIG. 5 illustrates a system-on-chip including the PPU of FIG. 2, in accordance with one embodiment;
FIG. 6 is a conceptual diagram of a graphics processing pipeline implemented by the PPU of FIG. 2, in accordance with one embodiment;
FIG. 7A illustrates a mixed primary display, in accordance with one embodiment;
FIG. 7B illustrates a technique for displaying images on the mixed primary display using temporal multiplexing, in accordance with one embodiment;
FIG. 8 illustrates a mixed primary display, in accordance with one embodiment;
FIG. 9 illustrates a flowchart of a method for generating image data for a mixed primary display, in accordance with another embodiment;* and*
FIG. 10 illustrates an exemplary system in which the various architecture and/or functionality of the various previous embodiments may be implemented.
DETAILED DESCRIPTION
A new display technology is proposed that exploits the physiological characteristics of the human eye, is power efficient, requires a smaller bandwidth to receive information for each frame of image data, and offers both a wide color gamut and high dynamic range. The human eye is made up of millions of photoreceptor cells, commonly referred to as rods and cones. Rods are extremely sensitive to light but are only reactive to one particular range of wavelengths and, therefore, cannot resolve colors. Rods are responsible for vision in low light conditions (i.e., night vision) and are found in higher concentrations at the periphery of the retina. Cones are not as sensitive to light, but there are three different types of cones that are sensitive to three different ranges of wavelengths. Thus, cones are used to resolve colors. The human retina contains roughly 5-6 million cones and 100 million rods. The human brain resolves images based on the signals from all of these photoreceptor cells. It is believed that colors are perceived based on differences between signals of the different cone types, similar to how CMOS-type photoreceptor sites work on an image sensor. In trichromatic vision, levels of low-wavelength, medium-wavelength, and long-wavelength signals from the different types of cones in different areas of the retina are processed to perceive particular colors.
Again, rods are responsible for seeing in low light (scotopic vision), but the rods typically have low visual acuity, making it difficult for rods to determine spatial relationships. This is partly because many rods converge into a single bipolar cell, and ganglion cell, to produce signals for the brain, which reduces the spatial resolution of signals from the rods. Cones on the other hand have a higher visual acuity because multiple cones do not converge on a single bipolar cell. The result is that the human eye is much more sensitive to luminance components of color than chrominance components of color. Studies have also shown that the brain has a tendency to discard some hue and saturation information and perceive more details based on differences in light and dark. In other words, the human eye responds more acutely to differences in luminance rather than differences in chrominance.
These differences in perception can be exploited to create a display with a higher color gamut and dynamic range than that of conventional displays. A mixed-primary display is proposed that includes a first, low-resolution layer for displaying chrominance information for an image and a second, high-resolution layer for modulating luminance at each high-resolution pixel site. Such mixed-primary displays may enable high resolution image data to be compressed for transmission at a lower bandwidth. The high resolution image data may be processed into a low resolution chrominance image and a high resolution luminance image, each image corresponding to one of the two layers of the display. Furthermore, a single image frame may be split into multiple sub-frames, each sub-frame corresponding with a particular mixed-primary color component, and then the multiple sub-frames may be displayed in quick succession such that the viewer perceives a single image.
FIG. 1 illustrates a flowchart of a method 100 for generating image data for a mixed-primary display, in accordance with one embodiment. It will be appreciated that the method 100 is described within the scope of software executed by a processor; however, in some embodiments, the method 100 may be implemented in hardware or some combination of hardware and software. The method 100 begins at step 102, where a parallel processing unit receives a source image for display. The source image may be a high resolution image that matches a resolution of a top layer of the mixed-primary display. Of course, the resolution of the source image may be pre-processed to match the resolution of the top layer of the mixed-primary display in the case that the resolution of the source image does not match. In one embodiment, the source image is received in a particular image format such as RGBA (i.e., red, green, blue, alpha). In other embodiments, the image format may be a different format, such as RGB, YUV, and the like.
At step 104, the parallel processing unit divides the source image into a plurality of blocks. In one embodiment, the image is divided into a plurality of N pixel by N pixel blocks. For example, each block may be 32 pixels by 32 pixels, 16 pixels by 16 pixels, or 4 pixels by 4 pixels. Of course, in some embodiments, the number of horizontal pixels by the number of vertical pixels may be different such that each block is N pixels by M pixels. Each block corresponds to a single pixel of a bottom layer of the mixed-primary display.
At step 106, the parallel processing unit analyzes the source image based on an image decomposition algorithm. The image decomposition algorithm may transform pixel values in a first color space into new pixel values in a second color space. The second color space may be associated with a number of mixed-primary color components. For example, pixel values for the image may be represented as a combination of three components in an RGB color space having a red primary color, a green primary color, and a blue primary color. These pixel values may be mapped to a close approximation to new pixel values represented as a combination of two components in a custom color space having two mixed-primary colors. As used herein, a mixed-primary color is any color capable of being reproduced as a combination of one or more primary colors (such as red, green, and blue).
In one embodiment, each block of the source image is associated with a different custom color space associated with two mixed-primary color components. The two mixed-primary color components for the custom color space may be any of the colors represented in the first color space (i.e., any combination of RGB values). The two mixed-primary color components that define the new color space for each block are generated as chroma information associated with the source image and then the pixel values in the source image are converted to new pixel values in the new custom color spaces for the blocks. The new pixel values will have two components that comprise the modulation information associated with the source image, each component being a value associated with one of the corresponding mixed-primary color components of the custom color space.
At step 108, the parallel processing unit encodes chroma information and modulation information derived from the image decomposition algorithm into a video signal for the mixed-primary display. The encoding may include generating a number of sub-frames, each sub-frame corresponding to one mixed-primary color component. Each sub-frame may include chroma information for specifying a particular color for the corresponding mixed-primary color component for each block. Each sub-frame may also include modulation information that identifies a level of the mixed-primary color component for each pixel of each block. At step 110, the parallel processing unit transmits the video signal to the mixed primary display.
More illustrative information will now be set forth regarding various optional architectures and features with which the foregoing framework may or may not be implemented, per the desires of the user. It should be strongly noted that the following information is set forth for illustrative purposes and should not be construed as limiting in any manner. Any of the following features may be optionally incorporated with or without the exclusion of other features described.
Parallel Processing Architecture
FIG. 2 illustrates a parallel processing unit (PPU) 200, in accordance with one embodiment. In one embodiment, the PPU 200 is a multi-threaded processor that is implemented on one or more integrated circuit devices. The PPU 200 is a latency hiding architecture designed to process a large number of threads in parallel. A thread (i.e., a thread of execution) is an instantiation of a set of instructions configured to be executed by the PPU 200. In one embodiment, the PPU 200 is a graphics processing unit (GPU) configured to implement a graphics rendering pipeline for processing three-dimensional (3D) graphics data in order to generate two-dimensional (2D) image data for display on a display device such as a liquid crystal display (LCD) device. In other embodiments, the PPU 200 may be utilized for performing general-purpose computations. While one exemplary parallel processor is provided herein for illustrative purposes, it should be strongly noted that such processor is set forth for illustrative purposes only, and that any processor may be employed to supplement and/or substitute for the same.
As shown in FIG. 2, the PPU 200 includes an Input/Output (I/O) unit 205, a host interface unit 210, a front end unit 215, a scheduler unit 220, a work distribution unit 225, a hub 230, a crossbar (Xbar) 270, one or more general processing clusters (GPCs) 250, and one or more partition units 280. The PPU 200 may be connected to a host processor or other peripheral devices via a system bus 202. The PPU 200 may also be connected to a local memory comprising a number of memory devices 204. In one embodiment, the local memory may comprise a number of dynamic random access memory (DRAM) devices.
The I/O unit 205 is configured to transmit and receive communications (i.e., commands, data, etc.) from a host processor (not shown) over the system bus 202. The I/O unit 205 may communicate with the host processor directly via the system bus 202 or through one or more intermediate devices such as a memory bridge. In one embodiment, the I/O unit 205 implements a Peripheral Component Interconnect Express (PCIe) interface for communications over a PCIe bus. In alternative embodiments, the I/O unit 205 may implement other types of well-known interfaces for communicating with external devices.
The I/O unit 205 is coupled to a host interface unit 210 that decodes packets received via the system bus 202. In one embodiment, the packets represent commands configured to cause the PPU 200 to perform various operations. The host interface unit 210 transmits the decoded commands to various other units of the PPU 200 as the commands may specify. For example, some commands may be transmitted to the front end unit 215. Other commands may be transmitted to the hub 230 or other units of the PPU 200 such as one or more copy engines, a video encoder, a video decoder, a power management unit, etc. (not explicitly shown). In other words, the host interface unit 210 is configured to route communications between and among the various logical units of the PPU 200.
In one embodiment, a program executed by the host processor encodes a command stream in a buffer that provides workloads to the PPU 200 for processing. A workload may comprise a number of instructions and data to be processed by those instructions. The buffer is a region in a memory that is accessible (i.e., read/write) by both the host processor and the PPU 200. For example, the host interface unit 210 may be configured to access the buffer in a system memory connected to the system bus 202 via memory requests transmitted over the system bus 202 by the I/O unit 205. In one embodiment, the host processor writes the command stream to the buffer and then transmits a pointer to the start of the command stream to the PPU 200. The host interface unit 210 provides the front end unit 215 with pointers to one or more command streams. The front end unit 215 manages the one or more streams, reading commands from the streams and forwarding commands to the various units of the PPU 200.
The front end unit 215 is coupled to a scheduler unit 220 that configures the various GPCs 250 to process tasks defined by the one or more streams. The scheduler unit 220 is configured to track state information related to the various tasks managed by the scheduler unit 220. The state may indicate which GPC 250 a task is assigned to, whether the task is active or inactive, a priority level associated with the task, and so forth. The scheduler unit 220 manages the execution of a plurality of tasks on the one or more GPCs 250.
The scheduler unit 220 is coupled to a work distribution unit 225 that is configured to dispatch tasks for execution on the GPCs 250. The work distribution unit 225 may track a number of scheduled tasks received from the scheduler unit 220. In one embodiment, the work distribution unit 225 manages a pending task pool and an active task pool for each of the GPCs 250. The pending task pool may comprise a number of slots (e.g., 32 slots) that contain tasks assigned to be processed by a particular GPC 250. The active task pool may comprise a number of slots (e.g., 4 slots) for tasks that are actively being processed by the GPCs 250. As a GPC 250 finishes the execution of a task, that task is evicted from the active task pool for the GPC 250 and one of the other tasks from the pending task pool is selected and scheduled for execution on the GPC 250. If an active task has been idle on the GPC 250, such as while waiting for a data dependency to be resolved, then the active task may be evicted from the GPC 250 and returned to the pending task pool while another task in the pending task pool is selected and scheduled for execution on the GPC 250.
The work distribution unit 225 communicates with the one or more GPCs 250 via XBar 270. The XBar 270 is an interconnect network that couples many of the units of the PPU 200 to other units of the PPU 200. For example, the XBar 270 may be configured to couple the work distribution unit 225 to a particular GPC 250. Although not shown explicitly, one or more other units of the PPU 200 are coupled to the host unit 210. The other units may also be connected to the XBar 270 via a hub 230.
The tasks are managed by the scheduler unit 220 and dispatched to a GPC 250 by the work distribution unit 225. The GPC 250 is configured to process the task and generate results. The results may be consumed by other tasks within the GPC 250, routed to a different GPC 250 via the XBar 270, or stored in the memory 204. The results can be written to the memory 204 via the partition units 280, which implement a memory interface for reading and writing data to/from the memory 204. In one embodiment, the PPU 200 includes a number U of partition units 280 that is equal to the number of separate and distinct memory devices 204 coupled to the PPU 200. A partition unit 280 will be described in more detail below in conjunction with FIG. 3B.
In one embodiment, a host processor executes a driver kernel that implements an application programming interface (API) that enables one or more applications executing on the host processor to schedule operations for execution on the PPU 200. An application may generate instructions (i.e., API calls) that cause the driver kernel to generate one or more tasks for execution by the PPU 200. The driver kernel outputs tasks to one or more streams being processed by the PPU 200. Each task may comprise one or more groups of related threads, referred to herein as a warp. A thread block may refer to a plurality of groups of threads including instructions to perform the task. Threads in the same group of threads may exchange data through shared memory. In one embodiment, a group of threads comprises 32 related threads.
FIG. 3A illustrates a GPC 250 of the PPU 200 of FIG. 2, in accordance with one embodiment. As shown in FIG. 3A, each GPC 250 includes a number of hardware units for processing tasks. In one embodiment, each GPC 250 includes a pipeline manager 310, a pre-raster operations unit (PROP) 315, a raster engine 325, a work distribution crossbar (WDX) 380, a memory management unit (MMU) 390, and one or more Texture Processing Clusters (TPCs) 320. It will be appreciated that the GPC 250 of FIG. 3A may include other hardware units in lieu of or in addition to the units shown in FIG. 3A.
In one embodiment, the operation of the GPC 250 is controlled by the pipeline manager 310. The pipeline manager 310 manages the configuration of the one or more TPCs 320 for processing tasks allocated to the GPC 250. In one embodiment, the pipeline manager 310 may configure at least one of the one or more TPCs 320 to implement at least a portion of a graphics rendering pipeline. For example, a TPC 320 may be configured to execute a vertex shader program on the programmable streaming multiprocessor (SM) 340. The pipeline manager 310 may also be configured to route packets received from the work distribution unit 225 to the appropriate logical units within the GPC 250. For example, some packets may be routed to fixed function hardware units in the PROP 315 and/or raster engine 325 while other packets may be routed to the TPCs 320 for processing by the primitive engine 335 or the SM 340.
The PROP unit 315 is configured to route data generated by the raster engine 325 and the TPCs 320 to a Raster Operations (ROP) unit in the partition unit 280, described in more detail below. The PROP unit 315 may also be configured to perform optimizations for color blending, organize pixel data, perform address translations, and the like.
The raster engine 325 includes a number of fixed function hardware units configured to perform various raster operations. In one embodiment, the raster engine 325 includes a setup engine, a course raster engine, a culling engine, a clipping engine, a fine raster engine, and a tile coalescing engine. The setup engine receives transformed vertices and generates plane equations associated with the geometric primitive defined by the vertices. The plane equations are transmitted to the coarse raster engine to generate coverage information (e.g., an x,y coverage mask for a tile) for the primitive. The output of the coarse raster engine may transmitted to the culling engine where fragments associated with the primitive that fail a z-test are culled, and transmitted to a clipping engine where fragments lying outside a viewing frustum are clipped. Those fragments that survive clipping and culling may be passed to a fine raster engine to generate attributes for the pixel fragments based on the plane equations generated by the setup engine. The output of the raster engine 325 comprises fragments to be processed, for example, by a fragment shader implemented within a TPC 320.
Each TPC 320 included in the GPC 250 includes an M-Pipe Controller (MPC) 330, a primitive engine 335, one or more SMs 340, and one or more texture units 345. The MPC 330 controls the operation of the TPC 320, routing packets received from the pipeline manager 310 to the appropriate units in the TPC 320. For example, packets associated with a vertex may be routed to the primitive engine 335, which is configured to fetch vertex attributes associated with the vertex from the memory 204. In contrast, packets associated with a shader program may be transmitted to the SM 340.
In one embodiment, the texture units 345 are configured to load texture maps (e.g., a 2D array of texels) from the memory 204 and sample the texture maps to produce sampled texture values for use in shader programs executed by the SM 340. The texture units 345 implement texture operations such as filtering operations using mip-maps (i.e., texture maps of varying levels of detail). The texture unit 345 is also used as the Load/Store path for SM 340 to MMU 390. In one embodiment, each TPC 320 includes two (2) texture units 345.
The SM 340 comprises a programmable streaming processor that is configured to process tasks represented by a number of threads. Each SM 340 is multi-threaded and configured to execute a plurality of threads (e.g., 32 threads) from a particular group of threads concurrently. In one embodiment, the SM 340 implements a SIMD (Single-Instruction, Multiple-Data) architecture where each thread in a group of threads (i.e., a warp) is configured to process a different set of data based on the same set of instructions. All threads in the group of threads execute the same instructions. In another embodiment, the SM 340 implements a SIMT (Single-Instruction, Multiple Thread) architecture where each thread in a group of threads is configured to process a different set of data based on the same set of instructions, but where individual threads in the group of threads are allowed to diverge during execution. In other words, when an instruction for the group of threads is dispatched for execution, some threads in the group of threads may be active, thereby executing the instruction, while other threads in the group of threads may be inactive, thereby performing a no-operation (NOP) instead of executing the instruction. The SM 340 may be described in more detail below in conjunction with FIG. 4.
The MMU 390 provides an interface between the GPC 250 and the partition unit 280. The MMU 390 may provide translation of virtual addresses into physical addresses, memory protection, and arbitration of memory requests. In one embodiment, the MMU 390 provides one or more translation lookaside buffers (TLBs) for improving translation of virtual addresses into physical addresses in the memory 204.
FIG. 3B illustrates a partition unit 280 of the PPU 200 of FIG. 2, in accordance with one embodiment. As shown in FIG. 3B, the partition unit 280 includes a Raster Operations (ROP) unit 350, a level two (L2) cache 360, a memory interface 370, and an L2 crossbar (XBar) 365. The memory interface 370 is coupled to the memory 204. Memory interface 370 may implement 16, 32, 64, 128-bit data buses, or the like, for high-speed data transfer. In one embodiment, the PPU 200 comprises U memory interfaces 370, one memory interface 370 per partition unit 280, where each partition unit 280 is connected to a corresponding memory device 204. For example, PPU 200 may be connected to up to U memory devices 204, such as graphics double-data-rate, version 5, synchronous dynamic random access memory (GDDR5 SDRAM). In one embodiment, the memory interface 370 implements a DRAM interface and U is equal to 8.
In one embodiment, the PPU 200 implements a multi-level memory hierarchy. The memory 204 is located off-chip in SDRAM coupled to the PPU 200. Data from the memory 204 may be fetched and stored in the L2 cache 360, which is located on-chip and is shared between the various GPCs 250. As shown, each partition unit 280 includes a portion of the L2 cache 360 associated with a corresponding memory device 204. Lower level caches may then be implemented in various units within the GPCs 250. For example, each of the SMs 340 may implement a level one (L1) cache. The L1 cache is private memory that is dedicated to a particular SM 340. Data from the L2 cache 360 may be fetched and stored in each of the L1 caches for processing in the functional units of the SMs 340. The L2 cache 360 is coupled to the memory interface 370 and the XBar 270.
The ROP unit 350 includes a ROP Manager 355, a Color ROP (CROP) unit 352, and a Z ROP (ZROP) unit 354. The CROP unit 352 performs raster operations related to pixel color, such as color compression, pixel blending, and the like. The ZROP unit 354 implements depth testing in conjunction with the raster engine 325. The ZROP unit 354 receives a depth for a sample location associated with a pixel fragment from the culling engine of the raster engine 325. The ZROP unit 354 tests the depth against a corresponding depth in a depth buffer for a sample location associated with the fragment. If the fragment passes the depth test for the sample location, then the ZROP unit 354 updates the depth buffer and transmits a result of the depth test to the raster engine 325. The ROP Manager 355 controls the operation of the ROP unit 350. It will be appreciated that the number of partition units 280 may be different than the number of GPCs 250 and, therefore, each ROP unit 350 may be coupled to each of the GPCs 250. Therefore, the ROP Manager 355 tracks packets received from the different GPCs 250 and determines which GPC 250 that a result generated by the ROP unit 350 is routed to. The CROP unit 352 and the ZROP unit 354 are coupled to the L2 cache 360 via an L2 XBar 365.
FIG. 4 illustrates the streaming multi-processor 340 of FIG. 3A, in accordance with one embodiment. As shown in FIG. 4, the SM 340 includes an instruction cache 405, one or more scheduler units 410, a register file 420, one or more processing cores 450, one or more special function units (SFUs) 452, one or more load/store units (LSUs) 454, an interconnect network 480, a shared memory 470 and an L1 cache 490.
As described above, the work distribution unit 225 dispatches tasks for execution on the GPCs 250 of the PPU 200. The tasks are allocated to a particular TPC 320 within a GPC 250 and, if the task is associated with a shader program, the task may be allocated to an SM 340. The scheduler unit 410 receives the tasks from the work distribution unit 225 and manages instruction scheduling for one or more groups of threads (i.e., warps) assigned to the SM 340. The scheduler unit 410 schedules threads for execution in groups of parallel threads, where each group is called a warp. In one embodiment, each warp includes 32 threads. The scheduler unit 410 may manage a plurality of different warps, scheduling the warps for execution and then dispatching instructions from the plurality of different warps to the various functional units (i.e., cores 350, SFUs 352, and LSUs 354) during each clock cycle.