public class RegexTokenizer extends UnaryTransformer<java.lang.String,scala.collection.Seq<java.lang.String>,RegexTokenizer>
gaps is false).
Optional parameters also allow filtering tokens using a minimal length.
It returns an array of strings that can be empty.| Constructor and Description |
|---|
RegexTokenizer() |
RegexTokenizer(java.lang.String uid) |
| Modifier and Type | Method and Description |
|---|---|
RegexTokenizer |
copy(ParamMap extra)
Creates a copy of this instance with the same UID and some extra params.
|
protected scala.Function1<java.lang.String,scala.collection.Seq<java.lang.String>> |
createTransformFunc()
Creates the transform function using the given param map.
|
BooleanParam |
gaps()
Indicates whether regex splits on gaps (true) or matches tokens (false).
|
boolean |
getGaps() |
int |
getMinTokenLength() |
java.lang.String |
getPattern() |
boolean |
getToLowercase() |
static RegexTokenizer |
load(java.lang.String path) |
IntParam |
minTokenLength()
Minimum token length, >= 0.
|
protected DataType |
outputDataType()
Returns the data type of the output column.
|
Param<java.lang.String> |
pattern()
Regex pattern used to match delimiters if
gaps is true or tokens if gaps is false. |
RegexTokenizer |
setGaps(boolean value) |
RegexTokenizer |
setMinTokenLength(int value) |
RegexTokenizer |
setPattern(java.lang.String value) |
RegexTokenizer |
setToLowercase(boolean value) |
BooleanParam |
toLowercase()
Indicates whether to convert all characters to lowercase before tokenizing.
|
java.lang.String |
uid()
An immutable unique ID for the object and its derivatives.
|
protected void |
validateInputType(DataType inputType)
Validates the input type.
|
setInputCol, setOutputCol, transform, transformSchematransform, transform, transformtransformSchemaclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitinitializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarningclear, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, paramMap, params, set, set, set, setDefault, setDefault, shouldOwn, validateParamstoStringpublic RegexTokenizer(java.lang.String uid)
public RegexTokenizer()
public static RegexTokenizer load(java.lang.String path)
public java.lang.String uid()
Identifiablepublic IntParam minTokenLength()
public RegexTokenizer setMinTokenLength(int value)
public int getMinTokenLength()
public BooleanParam gaps()
public RegexTokenizer setGaps(boolean value)
public boolean getGaps()
public Param<java.lang.String> pattern()
gaps is true or tokens if gaps is false.
Default: "\\s+"public RegexTokenizer setPattern(java.lang.String value)
public java.lang.String getPattern()
public final BooleanParam toLowercase()
public RegexTokenizer setToLowercase(boolean value)
public boolean getToLowercase()
protected scala.Function1<java.lang.String,scala.collection.Seq<java.lang.String>> createTransformFunc()
UnaryTransformercreateTransformFunc in class UnaryTransformer<java.lang.String,scala.collection.Seq<java.lang.String>,RegexTokenizer>protected void validateInputType(DataType inputType)
UnaryTransformervalidateInputType in class UnaryTransformer<java.lang.String,scala.collection.Seq<java.lang.String>,RegexTokenizer>inputType - (undocumented)protected DataType outputDataType()
UnaryTransformeroutputDataType in class UnaryTransformer<java.lang.String,scala.collection.Seq<java.lang.String>,RegexTokenizer>public RegexTokenizer copy(ParamMap extra)
Paramscopy in interface Paramscopy in class UnaryTransformer<java.lang.String,scala.collection.Seq<java.lang.String>,RegexTokenizer>extra - (undocumented)defaultCopy()