Questions tagged [distribution]

172 questions
11
votes
2 answers

how to check the distribution of the training set and testing set are similar

I have been playing the Kaggle Competition and I find there is a situation that the distribution of the training set and testing set are different, so I am wondering how to check the distribution of the training set and testing set are similar. And…
9
votes
3 answers

xgboost: Is there a way to perform regression on rates/percentages data?

I have a dependent variable, $Y$, that is made up of rates/percentages data, so each value is between $0$ and $1$. I was attracted to the xgboost library because it allows focusing in on specific subsets of the data in training itself, but I am…
Coolio2654
  • 300
  • 3
  • 10
7
votes
1 answer

Why do train, test, validation datasets need to have the same distribution?

I've found a lot of martial on how to deal with differently distributed train/test/validation data sets, however I'm struggling to find out why they they need to have the same distribution. Can anyone explain to me in detail why these data sets…
zaza
  • 171
  • 1
  • 3
7
votes
3 answers

Boxplots or violinplots?

This is quite a general question, perhaps somewhat opinion-based. In most papers people use boxplots to visualize a certain distribution, yet violinplots are able to give more information. Violinplots are made by performing a kernel density…
Archie
  • 883
  • 8
  • 20
7
votes
3 answers

Which outlier detection can detect these outliers?

I have a vector and want to detect outliers in it. The following figure shows the distribution of the vector. Red points are outliers. Blue points are normal points. Yellow points are also normal. I need an outlier detection method (a…
7
votes
2 answers

Plotting different values in pandas histogram with different colors

I am working on a dataset. The dataset consists of 16 different features each feature having values belonging to the set (0, 1, 2). In order to check the distribution of values in each column, I used pandas.DataFrame.hist() method which gave me a…
enterML
  • 3,091
  • 9
  • 28
  • 38
7
votes
1 answer

What is the difference between Covariate Shift, Label Shift, Concept Shift, Concept Drift, and Prior Probability Shift?

As a beginner in MLOps, I was overwhelmed by some confusing definitions. As far as I understand, when we have a classifier or regressor with y = f(X) function: Covariate Shift is changing the distribution of independent variables (X), Label Shift…
6
votes
4 answers

Regression: How to deal with positive skewness in continuous target variable

I'm working on a regression problem. My aim is to "learn" the distribution of a continuous target $y$ as good as possible to make predictions. My model looks like: $$y_i=\beta X_i + u_i.$$ $y$ is right skewed (positive skewness) and consists of…
Peter
  • 7,896
  • 5
  • 23
  • 50
6
votes
1 answer

How can I plot/display a dataset or an image distribution?

I want to view a specific image or a dataset's distribution, and see if they are different. Does simply writing something like : # mydataset.shape = (50k,32,32,3) plt.hist(mydataset.reshape(-1)) do the trick? or should I be doing something…
Hossein
  • 565
  • 6
  • 14
6
votes
1 answer

How to estimate the mutual information numerically?

Suppose I have a sample {$z_i$}$_{i\in[0,N]}$ = {($x_i,y_i$)}$_{i\in[0,N]}$ which commes from a probability distribution $p_z(z)$. How can I use it to estimate the mutual information between X and Y ? $MI(X,Y) = \int_Y \int_X …
5
votes
2 answers

Working with Data which is not Normal/Gaussian

What happens if my data/feature is not normal? Can I still use machine learning algorithms to utilize such data for predictions? I noticed in many data sciences courses, there is always a strong assumption of using a normal/Gaussian data. I have…
Newbie01
  • 53
  • 1
  • 4
4
votes
0 answers

Non-Gaussian like distributions - Classifier of source data fails on target data

I ask you for help on a classification problem (classes are represented by the numbers 0,1 and 2). All features are extracted from time series data (fundamental is sinus shape). I have a source dataset with features, which do not follow a gaussian…
4
votes
2 answers

evaluation metrics for multiple values per session

I have an application that executes my foo() function several times for each user session. There are 2 alternate algorithms that I can implement as foo() function and my goal is to evaluate them based on execution delay. The number of times foo() is…
4
votes
2 answers

Is it possible to train probabilistic model to return several distributions?

I have nonlinear data of function y(x), which is let's say parabolic. At some points of x there are several y's (look at the picture). Is it possible to train a probabilistic model to return several distributions (when needed) i.e. several means…
4
votes
2 answers

Why do seaborn.dist and pyplot.hist generate two different looking histograms on the same data?

I'm looking at telecom customers data. Two of the variables I'm looking at currently are: Monthly Charges - The total amount charged to the customer monthly. Is Senior Citizen - Whether the customer is a senior citizen. I'm trying to plot two…
helloworld
  • 43
  • 1
  • 5
1
2 3
11 12