Scala is a high level language that combines functional and object oriented programming with high performance runtimes. Scala programming language is build to implement scale able solutions to crunch big data / data science in order to produce actionable insights is a great languages for large-scale projects. Designed to express common programming patterns in a concise, elegant, and type-safe way, it fuses both imperative and functional programming styles.
Questions tagged [scala]
47 questions
15
votes
3 answers
How to calculate the mean of a dataframe column and find the top 10%
I am very new to Scala and Spark, and am working on some self-made exercises using baseball statistics. I am using a case class create a RDD and assign a schema to the data, and am then turning it into a DataFrame so I can use SparkSQL to select…
the3rdNotch
- 253
- 1
- 2
- 7
15
votes
4 answers
Data Science Tools Using Scala
I know that Spark is fully integrated with Scala. It's use case is specifically for large data sets. Which other tools have good Scala support? Is Scala best suited for larger data sets? Or is it also suited for smaller data sets?
sheldonkreger
- 1,169
- 8
- 20
6
votes
3 answers
Summarize and visualize a CSV in Java/Scala?
I would like to summarize (as in R) the contents of a CSV (possibly after loading it, or storing it somewhere, that's not a problem). The summary should contain the quartiles, mean, median, min and max of the data in a CSV file for each numeric…
Trylks
- 178
- 8
5
votes
1 answer
Saving Large Spark ML Pipeline to HDFS
I'm having trouble saving a large (relative to spark.rpc.message.maxSize) Spark ML pipeline to HDFS. Specifically, when I try to save the model to HDFS, it gives me an error related to spark's maximum message size:
scala> val mod =…
Thomas Cleberg
- 1,525
- 7
- 22
5
votes
2 answers
What is the best deep learning library for scala?
Does any one has a recommendation for what libraries to use for deep learning?
Soerendip
- 744
- 1
- 9
- 16
5
votes
2 answers
Distributed k-means in Spark
I want to implement K-means algorithm in Spark. I am looking for a starting point and I found Berkeley's naive implementation. However, is that distributed?
I mean I see no mapreduce operations. Or maybe, when submitted in Spark, the framework…
gsamaras
- 291
- 6
- 15
4
votes
1 answer
Sampling with replacement, specify the probabilities
I am trying to do sampling with replacement in Scala/Spark, defining the probabilities for each class.
This is how I would do it in R.
# Vector to sample from
x <- c("User1","User2","User3","User4","User5")
# Occurenciens from which to obtain…
Stefano
- 41
- 2
4
votes
1 answer
Plotting libraries for Scala on Zeppelin
My main question is it looks like Zeppelin limit the display of the results to on 1000, I know that I can change this number but when I change it Zeppelin become slow. And it looks like the default plotting tool of Zeppelin also plot the first 1000…
Rami
- 604
- 2
- 6
- 16
4
votes
1 answer
Has anyone succeeded in finding a good Scala/Spark kernel for Jupyter?
The ones I've tried so far
Almond: Works very well for just Scala, but you have to import dependencies, and it gets tedious after a while. And unfortunately can't run when using Spark with YARN instead of Local.
Spylon-kernel: Kernels connects, but…
Varun Gawande
- 141
- 3
4
votes
2 answers
How Mllib in Spark select variables in logistic regression
I have a question about MLlib in Spark.(with Scala)
I'm trying to understand how LogisticRegressionWithLBFGS and LogisticRegressionWithSGD work. I usually use SAS or R to do logistic regressions but I now have to do it on Spark to be able to analyze…
SparkUser
- 113
- 2
- 6
4
votes
1 answer
How to set up multi cluster spark without hadoop on Google Compute engine
I'm new to apache spark.
Is it possible to configure multi cluster spark without hadoop?
If so, can you please provide the steps.
I would like to create clusters on Google Compute Engine (1-master, 1-worker)
user4290511
- 101
- 6
4
votes
3 answers
Scala vs Java if you're NOT going to use Spark?
I'm facing some indecision when choosing how to allocate my scarce learning time for the next few months between Scala and Java.
I would like help objectively understanding the practical tradeoffs.
The reason I am interested in Java is that I think…
Hack-R
- 1,949
- 1
- 21
- 34
3
votes
3 answers
How to install Polynote on Windows?
I've been searching around the Internet for a while but I have not been able to find detailed instructions on how to install Polynote (the polyglot notebook
with first-class Scala support) for Windows with mixing multiple languages, Python and…
Pluviophile
- 4,203
- 14
- 32
- 56
3
votes
2 answers
Hashing trick with random forest in scala
I am trying to perform a hashing trick and then a random forest with scala. I have the following code:
val documents: RDD[Seq[String]] = sc.textFile("hdfs:///tmp/new_cromosoma12v2.csv").map(_.split(",").toSeq)
val hashingTF = new HashingTF()
val…
keira
- 101
- 1
- 8
3
votes
3 answers
Task not serializable Error
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.cassandra.CassandraSQLContext
object Test {
val sparkConf = new SparkConf(true).set("spark.cassandra.connection.host", )
val…
Credosam
- 81
- 1
- 10