public class MinHashLSH extends Estimator<T>
LSH class for Jaccard distance.
The input can be dense or sparse vectors, but it is more efficient if it is sparse. For example,
Vectors.sparse(10, Array((2, 1.0), (3, 1.0), (5, 1.0)))
means there are 10 elements in the space. This set contains elements 2, 3, and 5. Also, any
input vector must have at least 1 non-zero index, and all non-zero values are
treated as binary "1" values.
References: Wikipedia on MinHash
Constructor and Description |
---|
MinHashLSH() |
MinHashLSH(String uid) |
Modifier and Type | Method and Description |
---|---|
static Params |
clear(Param<?> param) |
MinHashLSH |
copy(ParamMap extra)
Creates a copy of this instance with the same UID and some extra params.
|
static String |
explainParam(Param<?> param) |
static String |
explainParams() |
static ParamMap |
extractParamMap() |
static ParamMap |
extractParamMap(ParamMap extra) |
static T |
fit(Dataset<?> dataset) |
T |
fit(Dataset<?> dataset)
Fits a model to the input data.
|
static M |
fit(Dataset<?> dataset,
ParamMap paramMap) |
static scala.collection.Seq<M> |
fit(Dataset<?> dataset,
ParamMap[] paramMaps) |
static M |
fit(Dataset<?> dataset,
ParamPair<?> firstParamPair,
ParamPair<?>... otherParamPairs) |
static M |
fit(Dataset<?> dataset,
ParamPair<?> firstParamPair,
scala.collection.Seq<ParamPair<?>> otherParamPairs) |
static <T> scala.Option<T> |
get(Param<T> param) |
static <T> scala.Option<T> |
getDefault(Param<T> param) |
static String |
getInputCol() |
String |
getInputCol() |
static int |
getNumHashTables() |
int |
getNumHashTables() |
static <T> T |
getOrDefault(Param<T> param) |
static String |
getOutputCol() |
String |
getOutputCol() |
static Param<Object> |
getParam(String paramName) |
static long |
getSeed() |
static <T> boolean |
hasDefault(Param<T> param) |
static boolean |
hasParam(String paramName) |
static Param<String> |
inputCol() |
Param<String> |
inputCol()
Param for input column name.
|
static boolean |
isDefined(Param<?> param) |
static boolean |
isSet(Param<?> param) |
static MinHashLSH |
load(String path) |
static IntParam |
numHashTables() |
IntParam |
numHashTables()
Param for the number of hash tables used in LSH OR-amplification.
|
static Param<String> |
outputCol() |
Param<String> |
outputCol()
Param for output column name.
|
static Param<?>[] |
params() |
static void |
save(String path) |
static LongParam |
seed() |
static <T> Params |
set(Param<T> param,
T value) |
MinHashLSH |
setInputCol(String value) |
MinHashLSH |
setNumHashTables(int value) |
MinHashLSH |
setOutputCol(String value) |
MinHashLSH |
setSeed(long value) |
static String |
toString() |
StructType |
transformSchema(StructType schema)
:: DeveloperApi ::
|
String |
uid()
An immutable unique ID for the object and its derivatives.
|
StructType |
validateAndTransformSchema(StructType schema)
Transform the Schema for LSH
|
static MLWriter |
write() |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
clear, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn
toString
write
save
public static MinHashLSH load(String path)
public static String toString()
public static Param<?>[] params()
public static String explainParam(Param<?> param)
public static String explainParams()
public static final boolean isSet(Param<?> param)
public static final boolean isDefined(Param<?> param)
public static boolean hasParam(String paramName)
public static Param<Object> getParam(String paramName)
public static final <T> scala.Option<T> get(Param<T> param)
public static final <T> T getOrDefault(Param<T> param)
public static final <T> scala.Option<T> getDefault(Param<T> param)
public static final <T> boolean hasDefault(Param<T> param)
public static final ParamMap extractParamMap()
public static M fit(Dataset<?> dataset, ParamPair<?> firstParamPair, scala.collection.Seq<ParamPair<?>> otherParamPairs)
public static M fit(Dataset<?> dataset, ParamPair<?> firstParamPair, ParamPair<?>... otherParamPairs)
public static final Param<String> inputCol()
public static final String getInputCol()
public static final Param<String> outputCol()
public static final String getOutputCol()
public static final IntParam numHashTables()
public static final int getNumHashTables()
public static void save(String path) throws java.io.IOException
java.io.IOException
public static MLWriter write()
public static T fit(Dataset<?> dataset)
public static final LongParam seed()
public static final long getSeed()
public String uid()
Identifiable
public MinHashLSH setInputCol(String value)
public MinHashLSH setOutputCol(String value)
public MinHashLSH setNumHashTables(int value)
public MinHashLSH setSeed(long value)
public StructType transformSchema(StructType schema)
PipelineStage
Check transform validity and derive the output schema from the input schema.
We check validity for interactions between parameters during transformSchema
and
raise an exception if any parameter value is invalid. Parameter value checks which
do not depend on other parameters are handled by Param.validate()
.
Typical implementation should first conduct verification on schema change and parameter validity, including complex parameter interaction checks.
transformSchema
in class PipelineStage
schema
- (undocumented)public MinHashLSH copy(ParamMap extra)
Params
defaultCopy()
.copy
in interface Params
copy
in class Estimator<MinHashLSHModel>
extra
- (undocumented)public T fit(Dataset<?> dataset)
Estimator
public IntParam numHashTables()
LSH OR-amplification can be used to reduce the false negative rate. Higher values for this param lead to a reduced false negative rate, at the expense of added computational complexity.
public int getNumHashTables()
public StructType validateAndTransformSchema(StructType schema)
schema
- The schema of the input dataset without outputCol
.outputCol
added.public Param<String> inputCol()
public String getInputCol()
public Param<String> outputCol()
public String getOutputCol()