Evaluating involved expressions on pyopencl.array.Array
instances by
using overloaded operators can be somewhat inefficient, because a new temporary
is created for each intermediate result. The functionality in the module
pyopencl.elementwise
contains tools to help generate kernels that
evaluate multi-stage expressions on one or several operands in a single pass.
pyopencl.elementwise.
ElementwiseKernel
(context, arguments, operation, name="kernel", preamble="", options=[])¶A kernel that takes a number of scalar or vector arguments and performs an operation specified as a snippet of C on these arguments.
Parameters: |
|
---|
Warning
Using a return statement in operation will lead to
incorrect results, as some elements may never get processed. Use
PYOPENCL_ELWISE_CONTINUE
instead.
Changed in version 2013.1: Added PYOPENCL_ELWISE_CONTINUE
.
__call__
(*args, wait_for=None)¶Invoke the generated scalar kernel. The arguments may either be scalars or
GPUArray
instances.
Returns a new pyopencl.Event
. wait_for
may either be None or a list of pyopencl.Event
instances for
whose completion this command waits before starting exeuction.
Here’s a usage example:
.. literalinclude:: ../examples/demo_elementwise.py
(You can find this example as
examples/demo_elementwise.py
in the PyOpenCL distribution.)
pyopencl.reduction.
ReductionKernel
(ctx, dtype_out, neutral, reduce_expr, map_expr=None, arguments=None, name="reduce_kernel", options=[], preamble="")¶Generate a kernel that takes a number of scalar or vector arguments (at least one vector argument), performs the map_expr on each entry of the vector argument and then the reduce_expr on the outcome of that. neutral serves as an initial value. preamble offers the possibility to add preprocessor directives and other code (such as helper functions) to be added before the actual reduction kernel code.
Vectors in map_expr should be indexed by the variable i. reduce_expr
uses the formal values “a” and “b” to indicate two operands of a binary
reduction operation. If you do not specify a map_expr, in[i]
is
automatically assumed and treated as the only one input argument.
dtype_out specifies the numpy.dtype
in which the reduction is
performed and in which the result is returned. neutral is specified as
float or integer formatted as string. reduce_expr and map_expr are
specified as string formatted operations and arguments is specified as a
string formatted as a C argument list. name specifies the name as which
the kernel is compiled. options are passed unmodified to
pyopencl.Program.build()
. preamble specifies a string of code that
is inserted before the actual kernels.
__call__
(*args, queue=None, wait_for=None, return_event=False, out=None)¶wait_for
may either be None or a list of pyopencl.Event
instances for
whose completion this command waits before starting exeuction.
With out the resulting single-entry pyopencl.array.Array
can
be specified. Because offsets are supported one can store results
anywhere (e.g. out=a[3]
).
Returns: | the resulting scalar as a single-entry pyopencl.array.Array
if return_event is False, otherwise a tuple (scalar_array, event) . |
---|
Note
The returned pyopencl.Event
corresponds only to part of the
execution of the reduction. It is not suitable for profiling.
New in version 2011.1.
Changed in version 2014.2: Added out parameter.
Here’s a usage example:
a = pyopencl.array.arange(queue, 400, dtype=numpy.float32)
b = pyopencl.array.arange(queue, 400, dtype=numpy.float32)
krnl = ReductionKernel(ctx, numpy.float32, neutral="0",
reduce_expr="a+b", map_expr="x[i]*y[i]",
arguments="__global float *x, __global float *y")
my_dot_prod = krnl(a, b).get()
A prefix sum is a running sum of an array, as provided by
e.g. numpy.cumsum
:
>>> import numpy as np
>>> a = [1,1,1,1,1,2,2,2,2,2]
>>> np.cumsum(a)
array([ 1, 2, 3, 4, 5, 7, 9, 11, 13, 15])
This is a very simple example of what a scan can do. It turns out that scans are significantly more versatile. They are a basic building block of many non-trivial parallel algorithms. Many of the operations enabled by scans seem difficult to parallelize because of loop-carried dependencies.
See also
GenericScanKernel
.This example illustrates the implementation of a simplified version of
pyopencl.algorithm.copy_if()
,
which copies integers from an array into the (variable-size) output if they are
greater than 300:
knl = GenericScanKernel(
ctx, np.int32,
arguments="__global int *ary, __global int *out",
input_expr="(ary[i] > 300) ? 1 : 0",
scan_expr="a+b", neutral="0",
output_statement="""
if (prev_item != item) out[item-1] = ary[i];
""")
out = a.copy()
knl(a, out)
a_host = a.get()
out_host = a_host[a_host > 300]
assert (out_host == out.get()[:len(out_host)]).all()
The value being scanned over is a number of flags indicating whether each array element is greater than 300. These flags are computed by input_expr. The prefix sum over this array gives a running count of array items greater than 300. The output_statement the compares prev_item (the previous item’s scan result, i.e. index) to item (the current item’s scan result, i.e. index). If they differ, i.e. if the predicate was satisfied at this position, then the item is stored in the output at the computed index.
This example does not make use of the following advanced features also available in PyOpenCL:
unique()
for an example.New in version 2013.1.
pyopencl.scan.
GenericScanKernel
(ctx, dtype, arguments, input_expr, scan_expr, neutral, output_statement, is_segment_start_expr=None, input_fetch_exprs=[], index_dtype=<type 'numpy.int32'>, name_prefix='scan', options=[], preamble='', devices=None)¶Generates and executes code that performs prefix sums (“scans”) on arbitrary types, with many possible tweaks.
Usage example:
from pyopencl.scan import GenericScanKernel
knl = GenericScanKernel(
context, np.int32,
arguments="__global int *ary",
input_expr="ary[i]",
scan_expr="a+b", neutral="0",
output_statement="ary[i+1] = item;")
a = cl.array.arange(queue, 10000, dtype=np.int32)
scan_kernel(a, queue=queue)
Parameters: |
|
---|
The first array in the argument list determines the size of the index space over which the scan is carried out, and thus the values over which the index i occurring in a number of code fragments in arguments above will vary.
All code fragments further have access to N, the number of elements being processed in the scan.
__call__
(*args, allocator=None, queue=None, size=None, wait_for=None)¶queue and allocator default to the ones provided on the first
pyopencl.array.Array
in args. size may specify the
length of the scan to be carried out. If not given, this length
is inferred from the first array argument passed.
Returns a new pyopencl.Event
. wait_for
may either be None or a list of pyopencl.Event
instances for
whose completion this command waits before starting exeuction.
Note
The returned pyopencl.Event
corresponds only to part of the
execution of the scan. It is not suitable for profiling.
pyopencl.scan.
GenericDebugScanKernel
¶Performs the same function and has the same interface as
GenericScanKernel
, but uses a dead-simple, sequential scan. Works
best on CPU platforms, and helps isolate bugs in scans by removing the
potential for issues originating in parallel execution.
pyopencl.scan.
ExclusiveScanKernel
(ctx, dtype, scan_expr, neutral, name_prefix="scan", options=[], preamble="", devices=None)¶Generates a kernel that can compute a prefix sum using any associative operation given as scan_expr. scan_expr uses the formal values “a” and “b” to indicate two operands of an associative binary operation. neutral is the neutral element of scan_expr, obeying scan_expr(a, neutral) == a.
dtype specifies the type of the arrays being operated on. name_prefix is used for kernel names to ensure recognizability in profiles and logs. options is a list of compiler options to use when building. preamble specifies a string of code that is inserted before the actual kernels. devices may be used to restrict the set of devices on which the kernel is meant to run. (defaults to all devices in the context ctx.
__call__
(self, input_ary, output_ary=None, allocator=None, queue=None)¶pyopencl.scan.
InclusiveScanKernel
(dtype, scan_expr, neutral=None, name_prefix="scan", options=[], preamble="", devices=None)¶Works like ExclusiveScanKernel
.
Changed in version 2013.1: neutral is now always required.
For the array [1,2,3], inclusive scan results in [1,3,6], and exclusive scan results in [0,1,3].
Here’s a usage example:
knl = InclusiveScanKernel(context, np.int32, "a+b")
n = 2**20-2**18+5
host_data = np.random.randint(0, 10, n).astype(np.int32)
dev_data = cl_array.to_device(queue, host_data)
knl(dev_data)
assert (dev_data.get() == np.cumsum(host_data, axis=0)).all()
pyopencl.algorithm.
copy_if
(ary, predicate, extra_args=[], preamble='', queue=None, wait_for=None)¶Copy the elements of ary satisfying predicate to an output array.
Parameters: |
|
---|---|
Returns: | a tuple (out, count, event) where out is the output array, count
is an on-device scalar (fetch to host with count.get()) indicating
how many elements satisfied predicate, and event is a
|
New in version 2013.1.
pyopencl.algorithm.
remove_if
(ary, predicate, extra_args=[], preamble='', queue=None, wait_for=None)¶Copy the elements of ary not satisfying predicate to an output array.
Parameters: |
|
---|---|
Returns: | a tuple (out, count, event) where out is the output array, count
is an on-device scalar (fetch to host with count.get()) indicating
how many elements did not satisfy predicate, and event is a
|
New in version 2013.1.
pyopencl.algorithm.
partition
(ary, predicate, extra_args=[], preamble='', queue=None, wait_for=None)¶Copy the elements of ary into one of two arrays depending on whether they satisfy predicate.
Parameters: |
|
---|---|
Returns: | a tuple (out_true, out_false, count, event) where count
is an on-device scalar (fetch to host with count.get()) indicating
how many elements satisfied the predicate, and event is a
|
New in version 2013.1.
pyopencl.algorithm.
unique
(ary, is_equal_expr='a == b', extra_args=[], preamble='', queue=None, wait_for=None)¶Copy the elements of ary into the output if is_equal_expr, applied to the array element and its predecessor, yields false.
Works like the UNIX command uniq, with a potentially custom comparison. This operation is often used on sorted sequences.
Parameters: |
|
---|---|
Returns: | a tuple (out, count, event) where out is the output array, count
is an on-device scalar (fetch to host with count.get()) indicating
how many elements satisfied the predicate, and event is a
|
New in version 2013.1.
pyopencl.algorithm.
RadixSort
(context, arguments, key_expr, sort_arg_names, bits_at_a_time=2, index_dtype=<type 'numpy.int32'>, key_dtype=<type 'numpy.uint32'>, scan_kernel=<class 'pyopencl.scan.GenericScanKernel'>, options=[])¶Provides a general radix sort on the compute device.
See also
pyopencl.algorithm.BitonicSort
New in version 2013.1.
Parameters: |
|
---|
__call__
(*args, **kwargs)¶Run the radix sort. In addition to args which must match the arguments specification on the constructor, the following keyword arguments are supported:
Parameters: |
|
---|---|
Returns: | A tuple |
pyopencl.algorithm.
ListOfListsBuilder
(context, list_names_and_dtypes, generate_template, arg_decls, count_sharing=None, devices=None, name_prefix='plb_build_list', options=[], preamble='', debug=False, complex_kernel=False)¶Generates and executes code to produce a large number of variable-size lists, simply.
Note
This functionality is provided as a preview. Its interface is subject to change until this notice is removed.
New in version 2013.1.
Here’s a usage example:
from pyopencl.algorithm import ListOfListsBuilder
builder = ListOfListsBuilder(context, [("mylist", np.int32)], """
void generate(LIST_ARG_DECL USER_ARG_DECL index_type i)
{
int count = i % 4;
for (int j = 0; j < count; ++j)
{
APPEND_mylist(count);
}
}
""", arg_decls=[])
result, event = builder(queue, 2000)
inf = result["mylist"]
assert inf.count == 3000
assert (inf.list.get()[-6:] == [1, 2, 2, 3, 3, 3]).all()
The function generate above is called once for each “input object”.
Each input object can then generate zero or more list entries.
The number of these input objects is given to __call__()
as n_objects.
List entries are generated by calls to APPEND_<list name>(value).
Multiple lists may be generated at once.
Parameters: |
|
---|
generate_template may use the following C macros/identifiers:
All argument-list related macros have a trailing comma included if they are non-empty.
generate_template must supply a function:
void generate(USER_ARG_DECL LIST_ARG_DECL index_type i)
{
APPEND_mylist(5);
}
Internally, the kernel_template is expanded (at least) twice. Once, for a ‘counting’ stage where the size of all the lists is determined, and a second time, for a ‘generation’ stage where the lists are actually filled. A generate function that has side effects beyond calling append is therefore ill-formed.
pyopencl.bitonic_sort.
BitonicSort
(context)¶Sort an array (or one axis of one) using a sorting network.
Will only work if the axis of the array to be sorted has a length that is a power of 2.
New in version 2015.2.
See also
__call__
(arr, idx=None, queue=None, wait_for=None, axis=0)¶Parameters: |
|
---|---|
Returns: | a tuple (sorted_array, event) |