org.apache.spark.ml.feature
Class RegexTokenizer
Object
org.apache.spark.ml.PipelineStage
org.apache.spark.ml.Transformer
org.apache.spark.ml.UnaryTransformer<String,scala.collection.Seq<String>,RegexTokenizer>
org.apache.spark.ml.feature.RegexTokenizer
- All Implemented Interfaces:
- java.io.Serializable, Logging, Params
public class RegexTokenizer
- extends UnaryTransformer<String,scala.collection.Seq<String>,RegexTokenizer>
:: Experimental ::
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split
the text (default) or repeatedly matching the regex (if gaps
is true).
Optional parameters also allow filtering tokens using a minimal length.
It returns an array of strings that can be empty.
- See Also:
- Serialized Form
Methods inherited from class Object |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface org.apache.spark.Logging |
initializeIfNecessary, initializeLogging, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarning |
Methods inherited from interface org.apache.spark.ml.param.Params |
clear, copyValues, defaultCopy, defaultParamMap, explainParam, explainParams, extractParamMap, extractParamMap, get, getDefault, getOrDefault, getParam, hasDefault, hasParam, isDefined, isSet, paramMap, params, set, set, set, setDefault, setDefault, setDefault, shouldOwn, validateParams |
RegexTokenizer
public RegexTokenizer(String uid)
RegexTokenizer
public RegexTokenizer()
uid
public String uid()
minTokenLength
public IntParam minTokenLength()
- Minimum token length, >= 0.
Default: 1, to avoid returning empty strings
- Returns:
- (undocumented)
setMinTokenLength
public RegexTokenizer setMinTokenLength(int value)
getMinTokenLength
public int getMinTokenLength()
gaps
public BooleanParam gaps()
- Indicates whether regex splits on gaps (true) or matches tokens (false).
Default: true
- Returns:
- (undocumented)
setGaps
public RegexTokenizer setGaps(boolean value)
getGaps
public boolean getGaps()
pattern
public Param<String> pattern()
- Regex pattern used to match delimiters if
gaps
is true or tokens if gaps
is false.
Default: "\\s+"
- Returns:
- (undocumented)
setPattern
public RegexTokenizer setPattern(String value)
getPattern
public String getPattern()
copy
public RegexTokenizer copy(ParamMap extra)
- Description copied from interface:
Params
- Creates a copy of this instance with the same UID and some extra params.
Subclasses should implement this method and set the return type properly.
- Specified by:
copy
in interface Params
- Overrides:
copy
in class UnaryTransformer<String,scala.collection.Seq<String>,RegexTokenizer>
- Parameters:
extra
- (undocumented)
- Returns:
- (undocumented)
- See Also:
defaultCopy()